Losing the lottery

April 6th, 2007 by Ben Goldacre in bad science, statistics | 34 Comments »

Ben Goldacre
Saturday April 7, 2007
The Guardian

It is possible to be very unlucky indeed. A nurse called Lucia de Berk has been in prison for 5 years in Holland, convicted of 7 counts of murder and 3 of attempted murder. An unusually large number of people died when she was on shift, and that, essentially, along with some very weak circumstantial evidence, is the substance of the case against her. She has never confessed, but her trial has generated a small collection of theoretical papers in the statistics literature (below), and a major government enquiry will report on her sentence in the next few weeks.

The judgement was largely based on a figure of “one in 342 million against”. Now, even if we found errors in this figure – and we will – the figure itself would still be largely irrelevant. Unlikely things do happen: somebody wins the lottery every week; children are struck by lightning; I have an extremely fit girlfriend. It is only significant that something very specific and unlikely happens if you have specifically predicted it beforehand.

Here is an analogy. Imagine I am standing near a large wooden barn with an enormous machine gun. I place a blindfold over my eyes and – laughing maniacally – I fire off many thousands and thousands of bullets into the side of the barn. I then drop the gun, walk over to the wall, examine it closely for some time, all over, pacing up and down: I find one spot where there are three bullet holes close to each other, and then I draw a target around them, announcing proudly that I am an excellent marksman. You would, I think, disagree with both my methods and conclusions for that deduction. But this is exactly what has happened in Lucia’s case: the prosecutors have found 7 deaths, on one nurse’s shifts, in one hospital, in one city, in one country, in the world, and then drawn a target around them. A very similar thing happened with the Sally Clark cot death case.

Before you go to your data, with your statistical tool, you have to have a specific hypothesis to test. If your hypothesis comes from analysing the data, then there is no sense in analysing the same data again to confirm it. This is a rather complex, philosophical, mathematical form of circularity: but there were also very concrete forms of circular reasoning in the case. To collect more data, the investigators went back to the wards to find more suspicious deaths. But all the people who have been asked to remember ‘suspicious incidents’ know that they are being asked because Lucia may be a serial killer. There is a high risk that “incident was suspicious” became synonymous with “Lucia was present”. Some sudden deaths when Lucia was not present are not listed in the calculations: because they are in no way suspicious, because Lucia was not present.

“We were asked to make a list of incidents that happened during or shortly after Lucia’s shifts,” said one hospital employee. In this manner more patterns were unearthed, and so it became even more likely that investigators found more suspicious deaths on Lucia’s shifts. This is the stuff of nightmares.

Meanwhile, a huge amount of corollary statistical information was almost completely ignored. In the three years before Lucia worked on the ward in question, there were 7 deaths. In the three years that Lucia did work on that ward, there were 6 deaths. It seems odd that the death rate should go down on a ward at the precise moment that a serial killer – on a killing spree – arrives on the scene. In fact, if Lucia killed them all, then there must have been no natural deaths on that ward at all, in the 3 years that she worked there.

On the other hand, as they revealed at her trial, Lucia did like tarot. And she does sound a bit weird in her private diary. So she might have done it after all.

But the strangest crime of all is that the prosecution’s statistician made a simple mathematical error to produce the figure of one in 342 million. He combined individual statistical tests by multiplying p-values. This bit’s for the hardcore science nerds, and will be edited out by the paper, but I intend to write it anyway. You do not just multiply p-values together, you weave them with a clever tool, like maybe “Fisher’s method for combination of independent p-values”.

If you multiply p-values together, then harmless incidents rapidly become dangerously unlikely. Let’s say you worked in 20 hospitals, each with a harmless incident pattern: say p=0.5. If you multiply those harmless p-values, you end up with a final p-value of 0.5 to the power of 20, which is p < 0.000001, which is extremely, very, highly statistically significant. With this mathematical error, if you change hospital a lot, you automatically become a suspect. Have you worked in 20 hospitals? For god's sake don't tell the Dutch police if you have. References: Here's a presentation to the UCL "Evidence" Group by Dutch statistician Peter Grunwald:

Statistician Richard Gill’s page on the case:

“Elffers’ [court statistician] method, and Elffers’ mistake”

The wikipedia page is excellent for the basic story:

arXiv pre-print

“Lucia: Killed by Innocent Heterogeneity”

And lastly – because he always got there first – Richard Feynman used an excellent example to illustrate this phenomenon of post hoc coincidence detection: “You know, the most amazing thing happened to me tonight. I was coming here, on the way to the lecture, and I came in through the parking lot. And you won’t believe what happened. I saw a car with the license plate ARW 357. Can you imagine? Of all the millions of license plates in the state, what was the chance that I would see that particular one tonight? Amazing…”

oh, and note the very informative post below from Peter Grunwald.


If you like what I do, and you want me to do more, you can: buy my books Bad Science and Bad Pharma, give them to your friends, put them on your reading list, employ me to do a talk, or tweet this article to your friends. Thanks! ++++++++++++++++++++++++++++++++++++++++++

34 Responses

  1. Damien said,

    April 6, 2007 at 11:41 pm

    Yey I don’t feel like such a pseudo-science nonce (aka Psychology undergraduate) after understanding your stats 😀

  2. Kimpatsu said,

    April 7, 2007 at 2:22 am

    Hey, the Guardian did indeed edit out the science bit, Ben. Are you psychic? What are the odds on THAT?

  3. jackpt said,

    April 7, 2007 at 10:08 am

    It’s alarming how often this kind of thing happens. Multiplying p-values makes no sense at all. That’s such a basic error that the court itself deserves some opprobrium by letting it slip through unchecked.

  4. wotsisnameinlondon said,

    April 7, 2007 at 10:43 am

    Er, where was Ms de Berk’s defence lawyer while all this was going on?

    If the prosecution is going to attack you on the basis of statistics, then you need to be able to defend yourself in the same area. As BG has clearly pointed out, the prosecution’s use of statistics was nonsense. Why was it unchallenged?

  5. davehodg said,

    April 7, 2007 at 10:51 am

    Isn’t this the old “correlation is not causation” thing?

  6. jackpt said,

    April 7, 2007 at 10:53 am

    #4, absolutely, although one hopes that courts/prosecution services would have chucked it out before it got to that.

  7. Toby Hopkins said,

    April 7, 2007 at 2:15 pm

    “On the other hand, as they revealed at her trial, Lucia did like tarot. And she does sound a bit weird in her private diary. So she might have done it after all.”

    A rather Dawkins-esque analysis some would say Ben…you do make a fair point though – if she is a serial killer, she is also an exceptionally capable nurse.

  8. Evil Monster said,

    April 7, 2007 at 3:38 pm

    I once worked out that if 2 people play snakes and ladders, have 20 throws each and write down the sequence of die throws, they could then play a game every day until all of the galaxies in the universe collapse into black holes before they’d expect to see the same sequence of throws again. Post hoc probability is one. Creationists don’t understand it either.

  9. SciencePunk said,

    April 7, 2007 at 6:06 pm

    wow, some seriously heavy bad science consequences there. good work spreading the word on this.

  10. BobP said,

    April 8, 2007 at 8:21 am

    According to Wiki page, and I haven’t done any more research that that, all the evidence against her was either –
    a) circumstantial (… she was there) ,
    b) statistical or
    c) irrelevant (….tarot cards & dodgy diary entries).

    Circumstantial and irrelevant information should be discounted by a court and are not by themselves sufficient for conviction. This means that the basis of the conviction was the statistical evidence, as in the Sally Clark case.

    Courts do convict on balance of probabilities. Statements like “there is only a one in 70 million chance of another person’s DNA matching the sample” are regularly reported. However if the statistics presented in the court case are flawed I would expect that to be a valid ground for appeal – it doens’t seem to have worked that way yet in the Dutch court system.

    “There are three kinds of lies: lies, damned lies, and statistics”

  11. John Coffin said,

    April 8, 2007 at 5:01 pm

    The ‘machine gun’ illustration is called a ‘Texas bullseye.’

  12. grunwald said,

    April 9, 2007 at 7:38 pm

    As one of several Dutch statisticians who has protested against the use of statistics in the Lucia de B. trial, let me post a few quick clarifications which will answer some of the questions raised (in particular, those by #4 and #12).

    (1) the defense actually did an excellent job. The problem is that the judges didn’t listen. In the final verdict one finds many places where the defense raises an entirely valid point, and the judges write, essentially ‘we are not convincend’. (note that we do not have juries in the Netherlands, we have the ‘Napoleontic system’, where a team of judges decides; if one doesn’t like the verdict, one can go to the court of appeals; but if the judges in the court of appeals make mistakes in their reasoning, then nobody is in a position to correct them or re-appeal)

    (2) In the court of appeals, the defense had two expert witnesses (a probability theorists and a logician, both full professors) who claimed that the statistics was flawed. Unfortunately, the judges were not impressed. The two witnesses both claimed that the number 1 in 342 million made no sense. The judge kept asking: so what IS the right number? They refused to answer that, saying the question was ill-defined. This didn’t exactly help and this may have caused the bizarre outcome:

    (3) The judges did realize that the statistics was highly controversial. Officially, in the verdict made by the court of appeals ‘statistics in the form of probability calculations plays no role’. That is what they write on the front page of the 90-page (!) report accompanying the verdict. But if one goes on to read the report, one finds example after example of the following line of reasoning: ‘so many people died while she was there. That cannot have been a coincidence!’ …so they use statistics after all. A fine example: one of the ‘murders’ concerns a 73-year old woman who suffered from terminal cancer. She was going to die soon, but when it actually happened, it was a bit sudden. The court asked six medical experts to give their opinion. Five say: this was a natural death. One says: at first I thought it was natural, but *given all the many other cases in which L. was involved*, I now think it was unnatural’ *. The court follows this sole dissenting expert against the other five. But this dissenting expert has used statistics, once again! The upshot is that statistics has played an essential role, although this is denied. Apparently, once the 1 in 342 million number was publicised in all the Dutch media, it stuck in people’s (both judges and experts) heads, and it influenced their opinions.

    [now it gets more technical]
    (4) I should note that the court’s statistician did realize the main problem noted by Ben , i.e. one cannot simply use the data that suggested a hypothesis to confirm that same hypothesis. To counter this problem, he used what he calls a post-hoc correction (he multiplied the p-value by 27, this being the number of nurses in the ward). The problem is that this correction doesn’t make a lot of sense. No matter how one does a post-hoc correction, the resulting number will not be very meaningful.

    (5) I should also note that the statistician very explicitly warned that a small p-value doesn’t say anything about whether Lucia is guilty or not (an explicit example that he gave is that Lucia may always do the night shifts, when more people die) . The problem is however, still, that, because of the post-hoc confirmation problem, a small p-value doesn’t even indicate that it’s not a coincidence!

    (6) BTW if one uses Fisher’s combination method rather than multiplying p-values,
    the number becomes about 1 in a million rather than 1 in 342 million. So, still very small. But again, the number in itself doesn’t say much.

  13. onegoodmove said,

    April 9, 2007 at 8:39 pm

    Would the same reasoning apply to the anthropic argument for the existence of God?

  14. tms said,

    April 10, 2007 at 4:57 pm

    Google tells me that there are 6 million nurses in Europe. So even if you accept the Fisher’s method 1-in-a-million figure, you would expect to find 6 perfectly innocent nurses a year with a comparable record of happening to be there when people died.

  15. Ben Goldacre said,

    April 10, 2007 at 5:04 pm

    incarcerating six innocent nurses is a small price to pay for patient safety.

  16. Ben Goldacre said,

    April 11, 2007 at 11:18 am

    um, when you say “my web designer”…

    i just botched about with a free theme, looks alright to me in IE (6), i needed 3 columns to fit all the info in, especially with the new minilinks blog, if there are any wordpress gurus who want to help i’m always listening. if you didnt pick a forum login that was obviously a humorous sceptic you might have been misidentified as a spambot, email your login and i’ll enable it.

  17. wewillfixit said,

    April 11, 2007 at 1:48 pm

    On my screen, I am losing half of the number of the comment. Eg comment 24 above is labelled as 4.

  18. Ginger Yellow said,

    April 11, 2007 at 3:26 pm

    “if one doesn’t like the verdict, one can go to the court of appeals; but if the judges in the court of appeals make mistakes in their reasoning, then nobody is in a position to correct them or re-appeal)”

    Is this true? Is there really no grounds for appeal? That sounds like a recipe for injustice.

    I still can’t understand how these judges came to their conclusions.
    “Officially, in the verdict made by the court of appeals ’statistics in the form of probability calculations plays no role’. That is what they write on the front page of the 90-page (!) report accompanying the verdict”

    But they must have said somewhere what does play a role. And they can’t seriously mean the tarot and diary stuff. What’s the evidential burden in the Netherlands? Is it reasonable doubt or something else?

  19. epepke said,

    April 11, 2007 at 6:48 pm

    Thanks for fighting the good fight, Ben. I appreciate your wry and ironic sense of humor, but there is a danger in that, as it does require at least minimal intelligence.

    In any event, I’m a mathematician, was a research scientist for 13 years, and got to be Mr. March in the Studmuffins of Skepticism calendar, and even I like tarot. It can be a lot of fun.

  20. Robert Carnegie said,

    April 11, 2007 at 9:45 pm

    In the UK I thought the basis of an appeal was that the original court trial was or is deficient, either because a legal error was made or because important evidence was not considered, including new evidence not previously available. Otherwise you take it that the original court got it right. In practice it seems that the thing to do is to appeal every case until you run out of higher courts or until you win one. But perhaps this is an illusion brought on by reporting – for instance, every loser in court tells the press that they’re considering an appeal, but I wonder how many of them do it.

    In the case of new evidence, I think you always can petition for a new trial, but a court decides whether that happens or not.

  21. CaptainKirkham said,

    April 12, 2007 at 10:28 am

    In UK courts, generally, findings of fact in criminal cases are not appealable. It is incorrect summing up, in correct law, that kind of thing, that gets something appealed. However, that is in a jury-based system, and comparisons between a civil law and common law system are pretty much apples and oranges.

  22. grunwald said,

    April 12, 2007 at 12:47 pm

    Re #27:

    In the Netherlands, if you don’t like the verdict of the court of appeals, you can go to the high court, but, as far as I understood, they can only reconsider the case if something didn’t go according to protocol (e.g. the prosecution withheld relevant evidence, things like that).

    The judges listen to prosecution, defense and their experts, and then decide whether guilt has been established ‘beyond reasonable doubt’. However, whether or not something actually IS beyond reasonable doubt is up to their discretion. If the court of appeals claims that it’s beyond reasonable doubt, but you disagree because you think that their reasoning is flawed, then there’s nothing you can do – this
    is not sufficient reason to reopen the case in high court.
    Now if there’s new evidence, the case CAN be reopened. If the committee who has to decide on the Lucia case decides
    that there’s sufficient new evidence, then they can advise to reopen it.

    Now what did the court write in this 90-page report? They first claim that for two of the 10 victims, they can actually prove murder by digoxin poisoning (in 1 case) and murder attempt by strangulation in the other case. So, according to them, these 2 cases have classical ‘beyond reasonable doubt status’. For the other 8 cases, they use something called a chain proof, a peculiar ‘tool’ in Dutch law. Basically it says that if you’ve been found guilty of a crime, and you’re a suspect in N more similar crimes, then the amount of evidence that constitutes ‘beyond reasonable doubt’ decreases with each further case. So for the second case, you need less evidence for the first. For the third, you need less than the second, etc.
    In Lucia’s case, for the eight subsequent cases, they only have ‘patient died or needed reanimation rather suddenly; no immediate cause could be found; some doctors think it might be an unnatural death; the flawed statistics (‘it cannot have been a coincidence’, as they write a few times) and then there’s the weird diary. Despite the fact that for all these 8 cases, there originally was no suspicion at all (a natural death certificate was signed), all 8 cases were about VERY sick people, and in all 8 cases some doctors (often the majority of doctors asked!) still thought it was a natural death when they were asked in court (for some cases this was 5 years after the deaths happened).

    So via the strange chain proof construction, everything hinges on the first two cases. In these cases, a lot of new evidence has been found: for example, the prosecution has withheld crucial evidence to medical expert witnesses.
    (for example, in the digoxine case, the heart of the patient was not contracted, which is a very strong indication AGAINST digoxin poisoning, but the medical expert was not given this information, although the prosecution seems to have been aware of it)
    So if the case gets reopened, and the defense manages to convice the judge that there is considerable doubt about the first two cases, then there’s nothing left.

    BTW (technical aside), the chain proof construction may itself be thought of as a kind of flawed statistics. From a probabilistic point of view, you can argue that it makes sense if you find 10 bodies ‘with knifes in their chests’, so to speak: if it is clear tha t 10 MURDERS have been committed, and you know that somebody was around during all of them, then it is indeed true that, given that is has been established that that person killed already two of them, then the conditional probability that the same person has killed the other 8 as well goes up tremendously. This reasoning is justified, if, for example,you assume that only a very small part of the population is murderous.
    The problem with the Lucia case is that there is hardly any evidence that these other 8 cases were murders/crimes. They may very well have been natural deaths. Under those circumstances, the chain proof construction makes no statistical/probabilistic sense at all, I think.

    Peter Grunwald

  23. Tony Gardner-Medwin said,

    April 12, 2007 at 4:16 pm

    “incarcerating six innocent nurses is a small price to pay for patient safety.”
    [Ben Goldacre, post (2)1: 10/4/07]

    When I read this I thought hard about whether it was actually seriously meant. Possibly not, but it touches what seems to me the crux of the issue. As ‘tms’ concluded in the previous post “you would expect to find [in Europe] 6 perfectly innocent nurses a year with a comparable record of happening to be there when people died.”

    If such a murder trial were to be decided on the statistical evidence alone, and if the statistics were properly gathered and interpreted [which they weren’t] then it is this bottom line figure that it seems to me should be the basis for a decision about whether to convict or not. We start with the presumption of innocence. Is it reasonable to believe that such a weight of evidence could amass, by chance, against an innocent person? Yes: it would be expected to happen to about 6 people in Europe every year. Then we must acquit, even if, for whatever reason, we remain highly suspicious. I have argued the logic of this strongly elsewhere [ Significance 2:9-12 (2005) : www.ucl.ac.uk/~ucgbarg/doubt.htm ]

    Consider the implications of accepting “incarcerating six innocent nurses [per year, in Europe]” as a price worth paying. If one convicts on a standard of evidence that would lead to this statistic, one must also believe (as well as accepting this dire cost in human misery), that all these people one would convict are ‘probably’ (with a threshold maybe of P=0.99) guilty. The implication is that the six innocent victims are the tip of a large iceberg : there are 600 truly homicidal nurses [per year, in Europe] with no greater weight of evidence against them, who are or should be being convicted. This seems rather implausible to me, and even if it were correct (and this is the really important point) it should not justify conviction.

    Whether simplified statistical scenarios, based on cases such as Lucia de B. or Sally Clark, should lead to conviction must depend on the facts of the case, and on judgement about whether such facts could plausibly and with reasonable probability or frequency have arisen so as to incriminate an innocent person. This judgement can of course be difficult, as can any legal decision, and may lead to acquittal of defendants one believes are probably guilty. But I believe it is what society and lawyers perceive as proper. It interprets “beyond reasonable doubt” as rather different from the way it is often viewed, as simply a high threshold on the probability of guilt. Utilitarianism has gone wild if one is to use statistics that would arise without guilt and with reasonable frequency as the basis for locking people up, justifying this to reduce a supposedly high level of crime or to improve patient safety. This is not where I want to live.

    That said, Ben’s post does look a bit like an uncharacteristically incautious and hasty remark from someone who does vastly more than most to sustain good standards of thinking, and to keep this a place where I do want to live.

  24. Tony Gardner-Medwin said,

    April 12, 2007 at 7:33 pm

    Fine. Sorry I didn’t get the joke. Of course nobody thinks it’s a good idea to incarcerate innocent nurses, and I wasn’t suggesting for a moment that you did. But convicting the innocent is an inevitable risk if you set a criterion for conviction that is less than certain proof. I (and presumably you – though you still leave it to inference) would set the acceptable risk much less than the figure calculated here, though others perhaps would not.

    My point was not about whether you were serious or not. It was that this figure that tms had estimated (the frequency with which criteria could convict the innocent) is what should govern the verdict, not a judgement of the probability of guilt. I know these may seem at first sight two sides of the same coin, but they are not. There are all sorts of things (like the incidence of the alleged crime, the personal profile of the defendant, or previous convictions) that may affect your judgement of the probability of guilt but that do not affect the probability that such evidence could have arisen to incriminate an innocent defendant.

  25. Tim Wogan said,

    April 16, 2007 at 1:59 pm

    I suppose it makes some kind of twisted sense given that, in Bayesian analysis, the probability of the hypothesis given the data increases if the a priori probability of the hypothesis is higher. In other words, the probability that the nurse killed the patient (the hypothesis) given the circumstances of the death (the data) is higher because, in advance of doing the calculation, we know that this nurse has a higher probability of being a murderer (the a priori probablility). But read that last sentence again (if you can face it). We are assuming that a nurse, who has not been convicted of anything, is a probable killer, and using that information to obtain a probability that she killed someone. If she had been convicted of murder in the past, then it would be ethically tricky (a kind of mathmaticized version of the argument about disclosure of previous convictions that we had in the British courts). To use it within a single multiple murder trial is frankly Orwellian.

  26. Robert Carnegie said,

    April 20, 2007 at 12:32 am

    Not Orwell, Kafka. Have we done Kafka this time? (search) Apparently not.

  27. Tim Wogan said,

    April 23, 2007 at 3:58 pm

    I defer to the superior knowledge of literature.

  28. gill1109 said,

    April 30, 2007 at 7:12 pm

    “incarcerating six innocent nurses is a small price to pay for patient safety.”
    [Ben Goldacre, post (2)1: 10/4/07]

    Actually I think the price is rather higher. Some of the new evidence which has come to light in the last year, is that the original statistics were badly biased – there are a few “missing incidents” during shifts of other nurses, a few “incidents” attributed to Lucia which aren’t incidents at all. Secondly, I think the probability model which is used in almost everyone’s calculation is very poor. The statistician for the prosecution refused to consider data from other years because “too many things could have been different”. This is an argument why it is dangerous to even combine different months. The analysis should have been stratified by time. Using the latest data and taking account of a modest amount of heterogeneity, I guestimate that about 1 in 9 nurses would experience such a large concentration of incidents in their shifts.

    So do we need to build a new Gulag for approx 10% of Europe’s nurses? No, most of these coincidences won’t be noticed at all, by anyone. They were noticed in Lucia’s case because of her not quite usual clothing, way of talking, difficult youth… gossip made people keep her under scrutiny and THEN she had the bad (1 in 9) luck.

    So just one small concentration camp should be quite enough.

  29. Candy said,

    May 10, 2007 at 4:52 pm

    All these miscarriages of justice based on statistics, always seem to happen to women. Has anyone worked out the probability that this has happened by chance?…

  30. gill1109 said,

    May 21, 2007 at 5:20 pm

    I don’t know the answer to Candy’s question just above, but something more can be said about the role of women in Lucia’s case. The board of three judges was chaired by a woman. One of the other two was home sick/overworked when the trial started, the other got sick during the trial (though I suppose they both added their signatures). When you see the judge on TV reading her conclusions, or read them yourself, you see that she is white-hot with emotion (hate and disgust and shock). Totally convinced.

    The “chef de clinique” who at the outset persuaded Director Smits that there was something wrong with Lucia, and in fact did his “simple man’s statistics” for him, was also a woman – in fact, the sister-in-law of Ton Derksen and Metta de Noo. Apparently a very very powerful personality, who could exert a great deal of influence on everyone round her. She’s currently being treated for depression.

  31. tinus42 said,

    November 19, 2007 at 5:59 pm

    @Deano, April 10, 2007

    The reporter who asked the question didn’t mention the name of the magazine. So it’s likely that official never heard of Nature but that can’t be deducted from the quoted text on that site.

  32. ChrisSamsDad said,

    January 5, 2009 at 1:06 pm

    Update on the case: She has now been released, pending a re-trial. luciadeb.nl/english/time-line.html

  33. diudiu said,

    December 21, 2009 at 6:27 am

    free shipping ugg
    free shipping ugg

  34. wholesale lingerie said,

    March 11, 2010 at 7:24 am

    Wholesale Lingerie