The article that I will be talking about today is How the Economist presidential forecast works by G. Elliott Morris. I have always wanted to know more about how American Presidential forecasts work, and how reliable they are. This is my attempt to try and understand it. Note that the authors have developed a code for their statistical algorithm, that they have posted here.

## Poll position

How does one predict who will win the Presidential election? Simple. Randomly select a group of people from amongst the population, and note their voting preferences. If this selection process is unbiased and selects a large enough group of people, you should have a good indicator of who will win. Right? Unfortunately, this is not the full picture.

The time at which this poll is conducted is of great importance. At the beginning of the election year, a lot of people are undecided. “I don’t want to choose between two bad options, I’d rather go for a third party”. However, as the elections loom closer, people often return to their inherent preferences for either the Democrats or the Republicans. Hence, polls conducted at the beginning of the year are much less accurate than polls conducted right before the elections. For example, even as late as June 1988, George HW Bush was trailing by 12 percentage points in polling averages to his contender. He went on to win by 8 points just five months later. Of course, national polls taken close to the election can also be unreliable. Hillary Clinton was leading by 8 percentage points over Donald Trump as late is October 2016. She won the popular vote by only 2 points (and of course lost the election). For a fascination explanation of the electoral college, and how a candidate can lose the election despite winning the popular vote, watch this.

So if national polls are not completely reliable (at least the ones conducted in the early stages), how can one predict the election? A lot of things like market stability, global business, and even the stability of foreign governments rides on being able to predict the American Presidential election successfully. Hence, political theorists have put a lot of thought into it. It tuns out that there are some “fundamentals” that predict the election outcome better than polls. The “fundamentals” that we are concerned with here are the state of the economy, the state of the country, etc. One such model that uses “fundamentals” is “Time for Change”, developed by the political scientist Alan Abramowitz. It predicts the election outcome by using the GDP growth, net approval rating, and whether the incumbent is running for re-election. The error margins for this model have historically been comparable to those of polls taken late in the election season, and in 1992 it did a better job of predicting the election than national polls.

## Something simple, please

To develop a prediction model using “fundamentals”, we have to choose the factors that are important in determining the election outcome. In selecting these factors using the given data, we might select factors that “seem” important, given the limited data, but do not really matter in predicting elections. This fallacy is known as **overfitting**, and can introduce substantial error into our predictions. To mitigate this problem, we borrow two techniques from machine learning- “elastic-net regularization” and “leave-one-out cross-validation”. It is heartening to see that although statistics heralded the machine learning revolution, new insights into how machines learn have also started changing the world of statistics.

Elastic-net regularization is the process of “shrinking” the impact of factors we consider in our model. The mantra that one must follow is that simpler equations do a better job of predicting the future than more convoluted ones. Hence, we may reduce the weights of the various factors we are considering, or remove the weak ones entirely. But how does one know by how much we should reduce the weights of these factors, or which factors to completely remove? For this, we use **leave-one-out cross-validation**. We will leave out one part of our data set, and train the model on the remainder of the data set, using a pre-determined “shrinkage” algorithm for reducing the weights of certain factors. We may also completely remove certain factors. We then test whether our model is able to predict the correct conclusion based on that left out data set. For instance, if we training an election prediction model based on data from 1952 to 2016, we leave out the data from 1952, and train out model on all the other election years to identify relevant factors and prescribe weights to them. Then we feed the data for 1952 into the model and see if it is able to predict the election result correctly. In the next iteration, we leave out 1956, and run the same algorithm. After we have this algorithm for all election years, we change the “shrinkage” algorithm and run the whole process all over again. A total of 100 times. We select the shrinkage algorithm that is the most successful on average.

The “shrinkage” algorithm that the authors found after running this algorithm was pretty close to Alan Abramowitz’s model. Some small differences were that the authors prescribed a penalty to parties that had been in power for two previous terms, and used a cocktail of economic indicators like real disposable income, non-farm payrolls, stock market, etc rather than just second-quarter GDP growth. They interestingly found that these economic factors have become less important in predicting elections, as the voter base gets more polarized. Hence, ideology has slowly come to trump economic concerns, which is a worrying indicator of major ideological upheaval in the coming years.

Of course, economic indicators are important in the 2020 elections. The pandemic has caused major economic depression, which is likely to reverse quickly once the health scare is mitigated. The authors see it fit to assign a weight to economic factors that is 40% higher than that assigned to economic factors during the 2008-2009 Great Recession.

The authors find that their “fundamentals” model does exceedingly well in back-testing, and better in fact that both early polls and late polls.

When the authors try to include polls in the model to possibly make it even more accurate, the machine learning algorithms they use decline to use early polls, and only incorporate polls conducted very close to the actual election.

## There’s no margin like an error margin

Suppose the model predicts that Biden will win 51.252% of the vote. The actual election results being exactly this is effectively zero. The most important information that a model produces is the **uncertainty** estimate around that prediction. If the model predicts that Biden will get between 50% and 53% of the vote with 95% certainty, we can be pretty sure that Biden will win the popular vote.

To calculate these ranges of outcomes, we use a beta distribution, which is essentially like the normal distribution, but for values between 0 and 1. Also, the width of the beta distribution can vary as compared to the normal distribution, increasing or decreasing the uncertainty of a model’s prediction. If the beta distribution is wide, the margin of error is large. If the margin of error (95% confidence interval) is, say 10%, then a candidate predicted to win 52% of the vote has a 2.5% chance of getting less than 42% of the vote, and a 2.5% chance of getting more than 62%. Hence, in closely contested elections, beta distributions with large uncertainty can be quite unreliable.

## Modeling uncertainty

How does one model uncertainty though, now that we’ve calculated the correct amount of “shrinkage”? We again use elastic-net regularization and leave-one-out cross-validation. Uncertainty also depends on certain parameters, and these parameters can be determined by these two algorithms. Uncertainties, in the authors’ model, are smaller closer to the election, in polarized elections, when there’s an incumbent running for re-election, and when economic conditions are similar to the long-term average. For instance, 11 months before the election in 1960, when the economy was unusually buoyant and the incumbent set to retire, the 95% confidence interval of the Republican vote share was quite large: from 42.7% to 62.4%. However, in 2004, with George W Bush seeking re-election, when the economy was in a stable state and the electorate was polarized, the 95% confidence level of Bush’s vote-share was from 49.6% to 52.6%. He ended up getting 51.1%, which was almost identical to the authors’ prediction.

## Moral victories are for losers

Winning the popular vote does not guarantee winning the election. Hillary Clinton famously won the popular vote by 2%, but still lost the election to Donald Trump. The election outcome depends upon the “electoral college“, through which states, rather than people, do the voting. The authors, in trying to predict national election outcomes, choose to forecast a state’s “partisan lean” rather than the actual state election outcome. “**Partisan lean**” can be defined as how much a state favors Democrats or Republicans as compared to the whole nation, and hence how it would be expected to vote in the event of a tie.

Why would we forecast partisan lean instead of actual outcome though? This is because partisan lean is a more stable predictor of the actual voting outcome in the state. For instance, in the early stages, our model might predict that Minnesota is likely to vote Democrat by 52%. However, as the election approaches, events might cause the voting pattern across the whole nation to shift to the Republican side, including in Minnesota. Hence, although our model would predict that Minnesota would vote Democrat, events might transpire such that Minnesota eventually votes Republican. However, if one forecasts partisan bias, this bias towards a particular party will remain unchanged even if national events cause voters to swing, as long as this swing is spread uniformly across all states. Hence, partisan bias is a better predictor of eventual election outcome.

To produce central estimates of partisan lean, the authors use a variety of parameters like the state’s partisan lean during the previous two elections, the home states of the presidential candidates, the actual national popular vote, etc. But how do we use this analysis for 2020? The actual national popular vote has not even been conducted yet. In this case, we can use the various possible outcomes, calculate the partisan bias based on these numbers, and then attach a weight to them based on the probability of that outcome. For instance, if there’s a 10% chance of Trump getting 52% of the vote and Biden getting 45% of the vote, we plug those numbers into the algorithm to calculate each state’s partisan bias, and then attach a weight of 0.10 to it.

## Bayesian analysis

The principle of Bayesian statistics is pretty powerful. First assume that a certain hypothesis or “prior” is true. Now study the actual real world data, and calculate the probability of that data being the outcome, assuming that your prior was true. If the probability if low, discard your prior and choose one for which the real world data is likely.

How does Bayesian analysis fit in here though? Don’t we already have a model? Why don’t we just plug in all the data and get a prediction already? This is because poll numbers are unreliable, and often have a bias. We can remove this bias by using Bayesian analysis.

What are some sources of errors while taking polls? One source of error is **sampling error**, in which the sample chosen is not representative of the whole population. For instance, in a population where half the people vote Democrat and the other half vote Republican, choosing a group of people who are mostly Republican will give us a biased election forecast.

However, there are other, **non-sampling errors** too. Even if the sample we select is representative of the population, not all of the people who are polled are eligible to vote, or will actually go out and vote even if they are eligible. Polls that try to correct for these and other non-sampling errors also don’t do a very good job of it, and inevitably introduce a bias. The model developed by the authors corrects for these biases by comparing the predictions of polls that have a bias towards the Democrats, and others that are biased towards the Republicans. Simplistically speaking, both of these kinds of biases cancel out.

There is another source of error that is more subtle: the **partisan non-response**. Let me illustrate that with an example. Given the generally negative media coverage for Donald Trump amongst multiple news outlets, many Republican voters will not agree to be surveyed at all. They might be scared of social ridicule should they voice their support for Trump, and probably don’t want to lie that they support Biden. Hence, any sample that polling organizations can construct will have a pro-Biden bias. However, this might change if the overall news coverage of Trump becomes more favorable. This introduces a lot of uncertainty into any forecasting model. The authors correct for partisan non-response by separating all polls into two groups- those that correct for partisan non-response, and those that don’t. Then they observe how the predictions given by these polls change every day. The difference in prediction between the two types of polls can be attributed to partisan non-response, and the authors can then incorporate this difference into their model.

However, what about the far flung states that are not polled as often as other, always in-the-news states? How do we reliably predict how these states will vote? The authors conclude that neighboring states, with similar geographies and demographies, vote similarly. Hence, if we have a forecast for Michigan but not for Wisconsin, we can reliably assume that Wisconsin is likely to have a similar polling result as Michigan. Given below is the correlation matrix for various states.

## Bayes-in the model

Let us now put all the pieces together. The authors use an extension of the Markov Chain Monte Carlo Method, first expounded by Drew Linzer. What does it do? It performs a random walk.

Let me illustrate this with an example. Let us choose the prior that polls are biased towards the Democrats by about 5%. Also, we know the partisan bias for Michigan in the month of June. In the coming days until the election, Michigan can, swing Republican, Democrat, or stay the same. Because of our prior, however, we have to assign different probabilities to Michigan swinging Republican or Democrat (or staying the same). We perform this random walk every day for Michigan until the election, to get a prediction for how Michigan will vote, assuming our prior is true.

Now we may assume a different prior: that polls over-estimate Republican chances by 2%. We again perform a random walk for each state, including Michigan. The authors take 20,000 such priors, and perform random walks for various states. They now calculate each candidate’s chances of winning as the total number of priors which led to them winning, divided by the total number of priors.

Using this model, the authors predict a comfortable win for **Biden**.

## References