James Ball sent me the data for the Russian election vote counts this morning and asked me to test whether it deviates from Benford’s law, a test that can give a hint at whether numbers are the product of fraud. Posted below is my analysis, and also a check for last digit preference, which is another method for spotting sneakiness.
You might remember Benford’s Law from this post, it’s a good way to check if data has been faked, and was used on shady Greek economic data submitted to the EU:
Imagine you have the data on, say, the population of every country in the world. Now, take only the “leading digit” from each number: the first number in the number, if you like. For the UK population, which was 61,838,154 in 2009, that leading digit would be “six”. Andorra’s was 85,168, so that’s “eight”. And so on.
If you take all those leading digits, from all the countries, then overall, you might naively expect to see the same number of ones, fours, nines, and so on. But in fact, for naturally occurring data, you get more ones than twos, more twos than threes, and so on, all the way down to nine. This is Benford’s law: the distribution of leading digits follows a logarithmic distribution, so you get a “one” most commonly, appearing as first digit around 30% of the time, and a nine as first digit only 5% of the time.
The next time you’re waiting for a bus, you can think about why this happens (bear in mind what leading digits do when quantities repeatedly double, perhaps) but reality agrees with this theory pretty neatly, and if you go to the website testingbenfordslaw.com you’ll see the proportions of each leading digit from lots of real-world datasets, graphed alongside what Benford’s law predicts they should be, with data from Twitter users’ follower counts to the number of books in different libraries across the US.
Benford’s law doesn’t work perfectly: it only works when you’re examining groups of numbers that span several orders of magnitude, for example, so for age, in years, of the graduate working population, which goes from around 20 to 70, it wouldn’t be much good, but for personal savings, from nothing to millions, it should work fine. And of course, Benford’s law works in other counting systems, so if three-fingered sloths ever develop numeracy, and count in base-6, or maybe base-12, the law would still hold.
I analysed for election fraud using the stats package Stata, beloved of economists and epidemiologists. James has posted graphs from the analysis I did on his blog, where you can also (I assume!) download the data he sent me:
The bottom line is this: the data massively do not conform to Benford’s Law, which would make you think that fraud is at work; but I think we shouldn’t expect the data to conform to Benford’s law, as the numbers don’t span several orders of magnitude. Essentially, this tool won’t work on these numbers. Anyway, here are the graphs and tables. Firstly, a graph showing the leading digit for all vote counts, in blue, against Benford’s distribution in red.
But here’s the distribution of the vote counts, not spanning multiple orders of magnitude:
Here are the tables with the Benford figures, in case you want them, and the second table reports a chi-squared test for whether the real figures deviate from Benford’s distribution (they do, but we could have guessed that from eyeballing the data!):
Lastly, I looked at last digit preference, which is a crude tool for looking at whether someone has made up figures. If they’re all made up by the same human, you might expect them to have some preference for specific digits, as humans are quite bad at generating random numbers. There’s no evidence of digit preference in the data. You might say that’s not surprising, as if there was fraud it would not have been centralised. In any case, I’m only really posting this as an illustration of how you can take data and play around with it.
Lastly, bear in mind that all these tests are only to check if people have made up numbers after the votes have been counted, and won’t detect people stuffing dodgy voting slips into ballot boxes.
For completeness, here’s the Stata code I used. Opensource dorks: in the future, I’m planning to do more public data analyses and detailed walk-throughs, for which I’ll use R, the open source statistics package, but because I did this in a hurry between work this morning I had to use Stata (I know it well so can code without thinking too hard).
Here’s the Stata code:
insheet using CandCount.csv, names
summ votecount, detail
* benford action
firstdigit votecount, by(candidate)
* digit preference
gen lastdigit = mod(votecount,10)
* nice pics
hist votecount, bin(100) freq
eqprhistogram votecount, bin(100)
* install packages if needed
ssc install benford
ssc install firstdigit
ssc install digits