Benford’s Law: using stats to bust an entire nation for naughtiness.

September 23rd, 2011 by Ben Goldacre in crime, economics, statistics, structured data | 8 Comments »

Ben Goldacre, The Guardian, Saturday 17 September 2011

This week we might bust an entire nation for handing over dodgy economic statistics. But first: why would they bother? Well, it turns out that whole countries have an interest in distorting their accounts, just like companies and individuals. If you’re an Euro member like Greece, for example, you have to comply with various economic criteria, and there’s the risk of sanctions if you miss them.

Government figures are subjected to various forms of audit already, of course, but alongside checking that things marry up with each other, forensic statisticians also have a few interesting tricks to try and spot suspicious patterns in the raw numbers, and so estimate the chances that figures from a set of accounts have been tampered with. One of the cleverest tools is something called Benford’s law.

Imagine you have the data on, say, the population of every country in the world. Now, take only the “leading digit” from each number: the first number in the number, if you like. For the UK population, which was 61,838,154 in 2009, that leading digit would be “six”. Andorra’s was 85,168, so that’s “eight”. And so on.

If you take all those leading digits, from all the countries, then overall, you might naively expect to see the same number of ones, fours, nines, and so on. But in fact, for naturally occurring data, you get more ones than twos, more twos than threes, and so on, all the way down to nine. This is Benford’s law: the distribution of leading digits follows a logarithmic distribution, so you get a “one” most commonly, appearing as first digit around 30% of the time, and a nine as first digit only 5% of the time.

The next time you’re waiting for a bus, you can think about why this happens (bear in mind what leading digits do when quantities repeatedly double, perhaps) but reality agrees with this theory pretty neatly, and if you go to the website you’ll see the proportions of each leading digit from lots of real-world datasets, graphed alongside what Benford’s law predicts they should be, with data from Twitter users’ follower counts to the number of books in different libraries across the US.

Benford’s law doesn’t work perfectly: it only works when you’re examining groups of numbers that span several orders of magnitude, for example, so for age, in years, of the graduate working population, which goes from around 20 to 70, it wouldn’t be much good, but for personal savings, from nothing to millions, it should work fine. And of course, Benford’s law works in other counting systems, so if three-fingered sloths ever develop numeracy, and count in base-6, or maybe base-12, the law would still hold.

This property of naturally occuring data has been used to check for dubious behaviour in figures for four decades now: it was first used on socioeconomic data submitted to support planning applications, and then on company accounts: it’s even admissible in US courts. But in 2009, an economist from Bundesbank suggested using Benford’s law on countries’ economic data, and last month the results were published (hat-tip to Tim Harford for the paper).

Researchers took macroeconomic data on all 27 EU nations, looking specifically at the accounting data that countries have to hand over for monitoring, which is all posted for free at the online repository Eurostat: things like government deficit, debt, revenue, expenditure, and so on. Then they took just the first digits from all the numbers, and checked to see if that deviated from what you would predict, using Benford’s law.

The results were fun. Greece – whose economy has tanked –  showed the largest and most suspicious deviation from Benford’s law of any country in the Euro.

This isn’t a massive surprise: the EU have run several investigations into Greece’s numbers already, and the ones from 2005 to 2008 were repeatedly revised upwards after the fact. But it’s neat, and if you wanted to wile away a very nerdy afternoon, I reckon you could even download the data, for free from Eurostat, and repeat the analysis for yourself. Joy!

If you like what I do, and you want me to do more, you can: buy my books Bad Science and Bad Pharma, give them to your friends, put them on your reading list, employ me to do a talk, or tweet this article to your friends. Thanks! ++++++++++++++++++++++++++++++++++++++++++

8 Responses

  1. elder_pegasus said,

    September 23, 2011 at 1:14 pm

    Nice – it’s the sort of sanity checking you might do by eye on smaller data sets. Makes you wonder what other public data sets this might be used on… 🙂

  2. nohoval_turrets said,

    September 23, 2011 at 2:38 pm

    Interesting. So who were the other suspicious ones then? Who’s the next Greece.

    All the papers linked here seem to be behind paywalls!

  3. lemoutan said,

    September 23, 2011 at 2:40 pm

    Votes cast in elections would, across the country and across the parties, exhibit the required range of orders of magnitude.

  4. keristor said,

    September 23, 2011 at 2:47 pm

    Oh fun! I’m so tempted to download the data and do it myself (well, write a program to do it, I’m not going to do that much counting!).

    @nohoval_turrets: the data isn’t behind a paywall…

  5. nohoval_turrets said,

    September 23, 2011 at 4:01 pm

    Actually, found the answer in the Tim Harper link.

    “Romania, Latvia and Belgium also have abnormally distributed data, while Portugal, Italy and Spain have a clean bill of health.”

  6. Chris Neville-Smith said,

    September 24, 2011 at 5:19 pm

    Articles some from Guardian seem to be making it on to this site quite randomly at the moment. Can anyone advise me what’s going on?

  7. huxley_leopard said,

    September 26, 2011 at 12:52 pm

    Great article, tres interessant.

    Would have been quite nice to see a graph on this page though, to illustrate.

    Also “an Euro member”? Might be *correct* but doesn’t read well. Or is because I read aloud?

  8. Mooks said,

    September 30, 2011 at 1:11 pm

    Sorry for coming to the discussion late – I normally read on the Guardian but haven’t got round to it the last couple of weeks. And the comments section is now closed for this article.

    Basically, I just wanted to say that, for recommending R, you are an absolute hero.