How Our Pollster Ratings Work

[ad_1]

The Details

Longtime readers of FiveThirtyEight are probably familiar with our pollster ratings: letter grades that we assign to pollsters based on their historical accuracy and transparency. Since 2008, we have been evaluating pollsters and using these ratings to inform both the public and our models about the quality of individual polls. Over the years, the methodology for these ratings has evolved, but the fundamental principle has remained the same: look at all the polls we have that were conducted within three weeks of an election, and try to determine how accurate each pollster has been and might be in the future.

Our pollster ratings can be found at this dashboard. There, you’ll see a graph with every pollster we have evaluated, organized by their most recent rating, as well as a searchable and sortable table of all the pollsters. Each pollster also has an individual rating page (for example, here’s the one for Selzer & Co.) that shows details about its rating, including all the polls we’ve analyzed by that pollster and their accuracy. If you want even more detail, you can download the associated datasets. Pollster grades are also included next to every poll we publish on our polls page, to help give context to the data.

Our pollster ratings are based on a metric called Predictive Plus-Minus. This metric is based on several key factors, including:

Simple error for polls (i.e., how far away the poll results are from the actual election margin).
How well other pollsters performed in the same races (i.e., whether this pollster is as good as, better than or worse than others).
Methodological quality (i.e., whether this pollster is conducting polls in accordance with professional standards).
Herding (i.e., whether this pollster appears to just be copying others’ results).

While our dataset includes several other metrics for understanding how well a pollster has historically performed, our letter grades are based entirely on Predictive Plus-Minus.

Below are all the methodological details of how we currently calculate Predictive Plus-Minus as well as a few other metrics that appear in the data. If you want to see methodology for previous versions of our pollster ratings, scroll to the bottom of this page for a series of links that contain all the methodological updates we’ve published over the years.

Step 1: Collect and classify polls

Almost all of the work is in this step; we’ve spent hundreds (thousands?) of hours over the years collecting polls. The ones represented in the pollster-ratings database meet our basic standards as well as three simple criteria:

They were conducted in 1998 or later. (We chose 1998 as the cutoff point because there are multiple sources that make polling data available from 1998 to the present, meaning that the data ought to be reasonably comprehensive. If you are aware of errors or omissions from this data, please reach out to let us know!)
They have a median field date within 21 days of the election date.
They were conducted for one of the following types of elections:
- Presidential general elections
- Presidential primaries or caucuses
- U.S. Senate general elections
- U.S. House general elections
- Gubernatorial general elections

Of course, it’s not so simple. A number of other considerations come up from time to time:

Sample sizes are sometimes missing from older polls. In these cases, we’ve estimated a poll’s sample size from its reported margin of error or from how many people a polling firm surveyed in other polls where the sample size was listed. As a last resort, we use 600 as a default sample size.
If a pollster lists results among likely voters and registered voters (or all adults), we include only the likely-voter version in the pollster-ratings database. Because the database covers the final three weeks of the campaign, and because almost all polling firms publish likely-voter polls by that time, almost all polls in the database should be likely-voter surveys.
When a pollster publishes multiple versions of the same survey (for example, versions of the poll with and without a third-party candidate included), FiveThirtyEight’s policy is to average the versions together. However, some of the older polls in our database were taken from sources that may have followed different rules, so the treatment of these cases may be inconsistent.
Polls of special elections and runoffs are included.
In races that use an instant runoff, polls of all rounds of the race are included. Polls are evaluated based on the results of the round(s) they polled, if those results are published, or the results of the final round if a candidate got to 50 percent of the vote before all runoff rounds were calculated.
Polls of all-party primaries (such as in Louisiana) are included.
National polls for the presidential popular vote and the generic congressional ballot are included.
The use of tracking polls is restricted to nonoverlapping dates. For instance, if a firm’s final tracking poll was conducted on the Friday through the Sunday before an election, we wouldn’t also list the version that covered Thursday through Saturday.
Polls are included in the database even if they were not used in FiveThirtyEight’s forecasts.
Although virtually all polls conducted in the final three weeks of a campaign are included, there are some exceptions in the case of the presidential primaries.
- We exclude polls of the New Hampshire primary that are conducted before the Iowa caucus.
- We exclude polls of primaries in states beyond New Hampshire that are conducted before the New Hampshire primary.
- We exclude primary polls whose leader or runner-up dropped out before that primary was held.
- We exclude primary polls if any candidate receiving at least 15 percent in the poll dropped out before that primary was held.
- We exclude primary polls if any combination of candidates receiving at least 25 percent in the poll dropped out before that primary was held.

One challenge comes in how to identify which pollster we associate with each survey. For instance, Fabrizio, Lee & Associates and Impact Research began a partnership to conduct surveys for the Wall Street Journal in late 2021. Theoretically, these could be classified as polls conducted by Fabrizio, Lee & Associates, Impact Research, the Wall Street Journal or some combination thereof. Our policy is to classify polls based on the pollster that conducted them, regardless of sponsorship, so these surveys are attributed to the partnership “Fabrizio, Lee & Associates/Impact Research.”

However, a few media companies have in-house polling operations. Confusingly, media companies sometimes also act as the sponsors of polls conducted by other firms. Our goal is to associate the poll with the company that, in our estimation, contributed the most intellectual property to the survey’s methodology. In some cases, this does include the media company that funded the poll. This is why, for example, Siena College/The New York Times Upshot is listed as a separate pollster from regular old Siena College.

When the same pollster or polling team operates multiple companies with different names but the same polling methodology, their polls are evaluated together. This is why, on some pollsters, you may see alternative names, indicating other companies operated by the same principal researchers or previous branding for that pollster’s work.

Step 2: Calculate simple average error

This part’s really simple: We compare the margin in each poll against the actual margin of the election and see how far apart they were. If the poll showed the Republican leading by 4 percentage points and they won by 9 instead, the poll’s simple error was 5 points. We draw election results from officially certified state or federal sources.

Simple error is calculated based on the margin separating the top two finishers in the election — not the top two candidates in the poll. For instance, if a certain poll of the 2008 Iowa Democratic caucus showed Hillary Clinton at 32 percent, Barack Obama at 30 percent and John Edwards at 28 percent, we’d look at its margin between Obama and Edwards since they were the top two finishers in the election (Clinton narrowly finished third).

We then calculate a simple average error for each pollster based on the average of the simple error of all its polls. This average is calculated using root-mean-square error.

Step 3: Calculate Simple Plus-Minus

Some elections are more conducive than others to accurate polling. In particular, polls of presidential general elections are historically quite accurate, while presidential primaries are much more challenging to poll. Polls of general elections for Congress and for governor are somewhere in between.

This step seeks to account for that fact, along with a couple of other factors. We run a regression analysis that predicts polling error based on the type of election surveyed, a poll’s margin of sampling error and the number of days between the poll and the election.

We then calculate a Simple Plus-Minus score for each pollster by comparing its simple average error against the error one would expect from these factors. For instance, suppose a pollster has a simple average error of 4.6 points. By comparison, the average pollster, surveying the same types of races on the same dates and with the same sample sizes, would have an error of 5.3 points according to the regression. Our pollster therefore gets a Simple Plus-Minus score of -0.7. This is a good score: As in golf, negative scores indicate better-than-average performance. Specifically, it means this pollster’s polls have been 0.7 points more accurate than other polls under similar circumstances.

A few words about the other factors Simple Plus-Minus considers. In the past, we’ve described the error in polls as resulting from three major components: sampling error, temporal error and pollster error (or “pollster-induced error”). These are related by a sum of squares formula:

Total Error = √ Sampling Error + Temporal Error + Pollster Error

Sampling error reflects the fact that a poll surveys only some portion of the electorate rather than everybody. This matters less than you might expect; theoretically, a poll of 1,000 voters will miss the final margin in the race by an average of only about 2.5 points because of sampling error alone — even in a state with 10 million voters. Unfortunately, sampling error isn’t the only problem pollsters have to worry about.

Another concern is that polls are (almost) never conducted on Election Day itself. We refer to this property as temporal (or time-dependent) error. There have been elections when important news events occurred in the 48 to 72 hours that separated the final polls from the election, such as the New Hampshire Democratic presidential primary debate in 2008.

If late-breaking news can sometimes affect the outcome of elections, why go back three weeks in evaluating pollster accuracy? Well, there are a number of considerations we need to balance against the possibility of last-minute shifts in the polls:

The overwhelming majority of elections do not feature important late-breaking developments. There will often be head-fakes and media-hyped “game changers,” but the evidence suggests they rarely make much difference.
Herding (see below) becomes more prominent in the final few days before an election. It’s fairly common for a pollster to publish some wild-seeming results earlier in the cycle — which can affect media coverage of the campaign — only to “fall in line” with its final poll.
Some of the apparent movement in the polls in the late days of the election is probably artificial, reflecting response bias (i.e., voters for a certain candidate might be more likely to respond to polls after the candidate has a strong news cycle) and badly designed turnout models rather than genuine changes in public opinion.
“Election Day” is something of a misnomer. Most states allow people to vote by mail or early in person; in the 2022 Senate election in Arizona, for example, over 80 percent of votes were cast by early or mail-in ballot rather than at a polling place on Nov. 8.
Accounting for all polls in the final three weeks of the campaign increases the sample size of polls we can analyze, making us much more confident in our evaluations.

Three weeks is an arbitrary cutoff point; we have found no significant difference between ratings based on polls conducted three, four or five weeks out from an election. But we feel strongly that evaluating a polling firm’s accuracy based only on its very last poll before an election is a mistake.

Nonetheless, the pollster ratings account for the fact that polling on the eve of the election is slightly easier than doing so a couple of weeks out. So a firm shouldn’t be at any advantage or disadvantage because of when it surveys a race.

The final component is pollster error (what we’ve referred to in the past as “pollster-induced error”); it’s the residual error component that can’t be explained by sampling error or temporal error. Certain things (like projecting turnout or ensuring a representative sample of the population) are inherently pretty hard. Our research suggests that even if all polls were conducted on Election Day itself (i.e., no temporal error) and took an infinite sample size (i.e., no sampling error), the average poll would still miss the final margin in the race by about 2 points.

However, some polling firms are associated with more of this type of error. That’s what our Simple Plus-Minus scores seek to evaluate.

Step 4: Calculate Advanced Plus-Minus

In 2014, House Majority Leader Eric Cantor lost the Republican primary in Virginia’s 7th Congressional District to David Brat, a college professor. It was a stunning upset, at least according to the polls. For instance, a Vox Populi Polling/Daily Caller poll had put Cantor ahead by 12 points. Instead, Brat won by 11 points. The poll missed by 23 points.

According to Simple Plus-Minus, that poll would score very poorly. We don’t have a comprehensive database of House primary polls and don’t include them in the pollster ratings, but we’d guess that such polls are off by something like 10 points on average. Because the aforementioned poll missed by 23 points, it would get a Simple Plus-Minus score somewhere around +13.

That seems pretty terrible — until you compare it with the only other poll of the race, an internal poll released by McLaughlin & Associates on behalf of Cantor’s campaign. That poll had Cantor up by 34 points — a 45-point error! If we calculated something called Relative Plus-Minus (how the poll stacks up against others of the same race), the Vox Populi/Daily Caller poll would get a score of -22, since it was 22 points more accurate than the McLaughlin & Associates survey.

Advanced Plus-Minus, the next step in the calculation, seeks to balance these considerations. Advanced Plus-Minus is a combination of Relative Plus-Minus and Simple Plus-Minus, weighted by the number of other polling firms that surveyed the same race (let’s call this number n). Relative Plus-Minus gets the weight of n, and Simple Plus-Minus gets a weight of three. For example, if six other polling firms surveyed a certain race, Relative Plus-Minus would get two-thirds of the weight and Simple Plus-Minus would get one-third.

In other words, when there are a lot of polls in the field, Advanced Plus-Minus is mostly based on how well a poll did in comparison to the work of other pollsters that surveyed the same election. But when there is scant polling, it’s mostly based on Simple Plus-Minus.

Meticulous readers might wonder about another problem. If we’re comparing a poll against its competitors, shouldn’t we account for the strength of the competition? If a pollster misses every election by 40 points, it’s easy to look good by comparison if you happen to poll the same races it does. The problem is similar to the one you’ll encounter if you try to design college football or basketball rankings: Ideally, you’ll want to account for the strength of a team’s schedule in addition to its wins and losses and margins of victory. Advanced Plus-Minus addresses this by means of iteration (see a good explanation here), a technique commonly applied in sports power ratings.

Advanced Plus-Minus also addresses another problem. Polls tend to be more accurate when there are more of them in the field. This may reflect herding, selection bias (pollsters may be more inclined to survey easier races; consider how many of them avoided the Kansas gubernatorial race in 2022) or some combination thereof. So Advanced-Plus Minus also adjusts scores based on how many other polling firms surveyed the same election. This has the effect of rewarding polling firms that survey races few other pollsters do and penalizing those that swoop in only after there are already a dozen polls in the field.

Two final wrinkles. Advanced Plus-Minus puts slightly more weight on more recent polls. It also contains a subtle adjustment to account for the higher volatility of certain election types, especially presidential primaries.

Step 5: Calculate Predictive Plus-Minus

If you’re interested in a purely retrospective analysis of poll accuracy, Simple Plus-Minus and Advanced Plus-Minus can be useful. You’ll also find a number of other measures of historical accuracy in our pollster-ratings database. The version we’d personally recommend is called “Mean-Reverted Advanced Plus-Minus,” which is retrospective but discounts the results for pollsters with a small number of polls in the database.

However, that may not be your purpose. At FiveThirtyEight, we’re more interested in predicting which polling firms will be most accurate going forward. This is useful to know if you’re using polls to forecast election results, for example. For that purpose, we use a measure called Predictive Plus-Minus.

The difference with Predictive Plus-Minus is that it also accounts for a polling firm’s methodological standards — albeit in a slightly roundabout way. A pollster gets a boost in Predictive Plus-Minus if it is a member of the American Association for Public Opinion Research’s Transparency Initiative or contributes polls to the Roper Center for Public Opinion Research’s archive. Participation in these organizations is a proxy variable for methodological quality. That is, it’s a correlate of methodological quality rather than a direct measure of it.

We’ve previously discussed at length the value of including this sort of methodological component in our pollster ratings. In every cycle we have evaluated, pollsters that participate in professional organizations such as these have performed significantly better than pollsters that do not.

But let’s say you have one polling firm that passes our methodological tests but hasn’t been so accurate, and another that doesn’t meet the methodological standards but has a reasonably good track record. Which one should you expect to be more accurate going forward?

That’s the question Predictive Plus-Minus is intended to address. But the answer isn’t straightforward; it depends on how large a sample of polls you have from each firm. Our finding is that past performance reflects more noise than signal until you have about 30 polls to evaluate, so you should probably go with the firm with the higher methodological standards up to that point. If you have more than 30 polls from each pollster, however, you should tend to value past performance over methodology.

One further complication is “herding,” or the tendency for polls to produce very similar results to other polls, especially toward the end of a campaign. A methodologically inferior pollster may be posting superficially good results by manipulating its polls to match those of the stronger polling firms. If left to its own devices — without stronger polls to guide it — it might not do so well. When we looked at Senate polls from 2006 to 2013, we found that methodologically poor pollsters improve their accuracy by roughly 2 points when there are also strong polls in the field. As a result, Predictive Plus-Minus includes a “herding penalty” for pollsters that show too little variation from the average of previous polls of the race.

The full formula for how to calculate Predictive Plus-Minus has evolved over the years. The formula we currently use is as follows:

\begin{equation*}PPM = \frac{max(-2, APM+herding\_penalty)\times(disc\_pollcount)+prior\times18}{18+disc\_pollcount}\end{equation*}

[ad_2]