We are in the final months of the 2016 US presidential election and people have gone a little nuts analyzing and overanalyzing the various election forecasts out there on the internets. I want to take a minute and explore what exactly the election forecasters are doing, using silly toy examples

Conveniently it is also the presidential election season of the small island nation of *Nove*, which has nine states: uma, duas, três, quatro, cinco, seis, sete, oito, and nove (where the capital city is). Naturally *Nove* uses the most rational and freedom loving of election systems: the electoral college, however each state has only 1 electoral college vote. Each state also has a population of 10,000. This is a small country after all.

The country is divided into two types of people, those who think Circles are the best shape and those who think Squares are the best shape. This otherwise perfectly rational nation of math enthusiasts is fiercly divided on this issue and it has become a campaign issue in this presidential election. Of the two candidates, Águeda Roundrect and Gervasio Pentágono, Roundrect is expected to dominate amongst the Square fanatics whereas Pentágono has been fervently campaigning on a platform of the perfection of circles and the mathematical properties of $\pi$. This is such an issue that Circle preference is tracked nationally, in the census.

```
State Proportion of Circle Lovers
uma 0.5054
duas 0.4956
três 0.5001
quatro 0.5060
cinco 0.4989
seis 0.5024
sete 0.5036
oito 0.4951
nove 0.5101
```

Of course being a country of rational math enthusiasts, the people of *Nove* are very interested in election forecasting and statistics. To help them out, I have done a small survey, sampling 50 people from each of the nine states asking who they are going to vote for (and happily the people of *Nove* are very honest so they tell the truth). The results of the survey, based purely on aggregation, suggests that Pentágono is favoured to win, leading in 4/9 states, with the 3 states in the middle of the island a dead-heat.

I can go a little further and try and weight the *observed* data using the census. During the survey I asked whether or not the Circle was the respondents preferred shape, so I can use that data to try and map my survey results (which likely oversampled Circle enthusiasts in some states and undersampled them in others just randomly) to the known demographics, from the national census.

Naively I am going to simply weight the data like so:

$$ \text{Vote for Roundrect} = w_{circle} \cdot y_{circle} + w_{square} \cdot y_{square}$$

where *w* is the weight from the census data, and *y* is the observed proportion of votes for Roundrect, per state.

Bad news for the Pentágono campaign! Their victory is no longer so assured. It looks like many of those strong majorities have evaporated, and two of the previoulsy deadlocked states are now favouring Roundrect!

As an aside the Upshot dealt with exactly this sort of thing in an article where they gave the same polling data to several different pollsters and, depending on how they sliced and weighted the results using demographic data, they ended up getting different final answers. Polling requires a lot of assumptions and modeling to get good answers out the end. But also keep in mind that different polls will get different answers because *they ask different people*, the poll is just a sample of the voting population after all and there is variability inherent to sampling.

So this has been a very short course in polling, but how do organizations like fivethirtyeight and the upshot end up with their election predictions? For one they aggregate many polls together, and average polls over time. The basic idea is that every poll is wrong, but they are wrong in their own special way, so if you average them all together the errors in the polls will all cancel out leaving a *more* correct result. For another they do simulations to really capture the uncertainty. Even with all the aggregations and regressions in the world there is still uncertainty in how the voting public will actually vote.

The election maps I've shown so far only represent *point estimates* from a small sample of the population. There is no representation of the *uncertainty* in those estimates. By building a model that incorporates uncertainty, we can start to make probabilistic claims about who will win the election. Usually these models end up being very complicated and simulations are the only practical way of getting the math done.

As a simple model suppose that there is some proportion of voters *p* who will vote for Roundrect. In other words if we poll *n* people at random then on average *np* people will vote for Roundrect, but it is unlikely to be *exactly* that. In fact the actual odds will follow a Binomial distribution.

Furthermore we can model our *estimate* of *p* using a Beta distribution, there are lots of hand-waving reasons why but the main reason is that it makes the math easier. It makes the math easier in a really convenient way actually, supposing I use a uniform prior $\text{Beta}(1,1)$ -- that is I walk in with complete ignorance of *p* and give equal probability to any number between 0 and 1 -- then after doing a poll of *n* people with *k* of them deciding to vote Roundrect, my belief about *p* is best captured by $\text{Beta}(1+k,1+n-k)$

At this point it doesn't really matter what the nitty gritty details of the Beta distribution or the Binomial distribution *are*, since they are provided by software packages. What I am going to do is: for every plausible value of *p* given the polling data I have, do a trial "election" and see how that works out. If I do this enough times I can get a sense of what are the possible election outcomes, given my uncertainty in *p*.

In terms of math I can write this:

$$ p_{Circles} \sim \text{Beta} ( 1 + \text{# Circles polled Yes}, 1 + \text{# Circles polled No} ) $$ $$ p_{Squares} \sim \text{Beta} ( 1 + \text{# Squares polled Yes}, 1 + \text{# Squares polled No} ) $$

$$ \text{Votes for Roundrect} \sim \text{Bin}(p_{Circles}, n_{Circles}) + \text{Bin}(p_{Squares}, n_{Squares}) $$

Or as code, do the following:

For each state: 1. Randomly sample $p_{Circles}$ and $p_{Squares}$ from their probability distributions 2. Randomly sample from the Binomial distributions, given $p_{Circles}$ and $p_{Squares}$

Add up the results and repeat.

Below is a table of the polling results combined with the census data, e.g. in the state Cinco only 4 of 27 Circle enthusiasts would vote for Roundrect, and there is a total of 5,060 Circle enthusiasts in the state.

Yes | No | Population | ||
---|---|---|---|---|

State | Circle | |||

cinco | 0 | 21 | 2 | 4940 |

1 | 4 | 23 | 5060 | |

duas | 0 | 23 | 3 | 5044 |

1 | 2 | 22 | 4956 | |

nove | 0 | 22 | 4 | 4946 |

1 | 4 | 20 | 5054 | |

oito | 0 | 20 | 2 | 4999 |

1 | 5 | 23 | 5001 | |

quatro | 0 | 25 | 1 | 5049 |

1 | 1 | 23 | 4951 | |

seis | 0 | 21 | 3 | 4976 |

1 | 2 | 24 | 5024 | |

sete | 0 | 21 | 1 | 4964 |

1 | 1 | 27 | 5036 | |

três | 0 | 19 | 2 | 4899 |

1 | 2 | 27 | 5101 | |

uma | 0 | 20 | 2 | 5011 |

1 | 3 | 25 | 4989 |

Using that table of data I generated 10,000 trial elections, based on the model I gave above. Below are the first few rows of that.

```
def model(row):
p = rnd.beta(1+row['Yes'], 1+row['No'])
return rnd.binomial(row['Population'], p)
sim = []
for i in range(10000):
series = df.apply(model, axis=1).groupby(level=['State']).sum()
sim.append(series)
sim = pd.DataFrame(sim)
```

State | cinco | duas | nove | oito | quatro | seis | sete | três | uma |
---|---|---|---|---|---|---|---|---|---|

0 | 4834 | 5227 | 5097 | 4350 | 5242 | 5099 | 5261 | 5542 | 5683 |

1 | 5374 | 4908 | 3709 | 5734 | 5184 | 4843 | 5247 | 4030 | 4455 |

2 | 5557 | 4979 | 4752 | 5482 | 5003 | 4485 | 4961 | 4366 | 5083 |

3 | 4706 | 5323 | 4498 | 5862 | 5610 | 4883 | 4957 | 4374 | 5642 |

4 | 5626 | 4452 | 5004 | 5286 | 5001 | 4836 | 4699 | 5199 | 5595 |

As box-plots it is pretty clear that this will be a close election. Most states are pretty close to an even split.

Going further, the distribution of electoral college votes also shows this is a *very* close election. There is only a 49.4% chance of Roundrect winning!

In fact I can reveal the true election data, which I generated at the very beginning, wherein Roundrect only won 3 Electoral College Votes, capturing 49.82% of the popular vote. Her dreaded nemesis Gervasio Pentágono won 6 Electoral College Votes and only 50.18% of the popular vote. This is also a reminder of how first-past-the-post election systems can lead to crazy results, where the number of electoral college votes won doesn't match the popular vote.

But let's back up and think about what all that election analysis told us: Our first naive look at the survey told us that Pentágono was likley to win, but it predicted much more decisive wins in some states than in reality (in reality the difference between states was only 1% or 2%. The weighted survey results were closer in this regard, but they predicted that Roundrect would win. We can put the two survey approaches side by side with the actual results and... The surveys only predicted about half of the state results. So not great.

The simulation was somewhat more meaningful, it had the most probability behind Roundrect winning Cinco and Oito, which she did, but it's hard to asses this as the real result of the simulations was that the election was really too close to call. Which, in fact, it was. By design actually.