Ben Goldacre, The Guardian, Saturday 10th September 2011

We all like to laugh at quacks when they misuse basic statistics. But what if academics, en masse, deploy errors that are equally foolish? This week Sander Nieuwenhuis and colleagues publish a mighty torpedo in the journal Nature Neuroscience.

Theyâ€™ve identified one direct, stark statistical error that is so widespread it appears in about half of all the published papers surveyed from the academic neuroscience research literature.

To understand the scale of this problem, first we have to understand the statistical error theyâ€™ve identified. This is slightly difficult, and it will take 400 words of pain. At the end, you will understand an important aspect of statistics better than half the professional university academics currently publishing in the field of neuroscience.

Letâ€™s say youâ€™re working on some nerve cells, measuring the frequency with which they fire. When you drop a chemical on them, they seem to fire more slowly. Youâ€™ve got some normal mice, and some mutant mice. You want to see if their cells are differently affected by the chemical. So you measure the firing rate before and after applying the chemical, first in the mutant mice, then in the normal mice.

When you drop the chemical on the mutant mice nerve cells, their firing rate drops, by 30%, say. With the number of mice you have (in your imaginary experiment) this difference is statistically significant, which means it is unlikely to be due to chance. Thatâ€™s a useful finding which you can maybe publish. When you drop the chemical on the normal mice nerve cells, there is a bit of a drop in firing rate, but not as much â€“ letâ€™s say the drop is 15% – and this smaller drop doesnâ€™t reach statistical significance.

But here is the catch. You can say that there is a statistically significant effect for your chemical reducing the firing rate in the mutant cells. And you can say there is *no *such statistically significant effect in the normal cells. But you cannot say that mutant cells and mormal cells respond to the chemical differently. To say that, you would have to do a third statistical test, specifically comparing the â€œdifference in differencesâ€, the difference between the chemical-induced change in firing rate for the normal cells against the chemical-induced change in the mutant cells.

Now, looking at the figures Iâ€™ve given you here (entirely made up, for our made up experiment) itâ€™s very likely that this â€œdifference in differencesâ€ would not be statistically significant, because the responses to the chemical only differ from each other by 15%, and we saw earlier that a drop of 15% on its own wasnâ€™t enough to achieve statistical significance.

But in exactly this situation, academics in neuroscience papers are routinely claiming that they have found a difference in response, in every field imaginable, with all kinds of stimuli and interventions: comparing responses in younger versus older participants; in patients against normal volunteers; in one task against another; between different brain areas; and so on.

How often? Nieuwenhuis looked at 513 papers published in five prestigious neuroscience journals over two years. In half the 157 studies where this error could have been made, it was made. They broadened their search to 120 cellular and molecular articles in Nature Neuroscience, during 2009 and 2010: they found 25 studies committing this statistical fallacy, and not one single paper analysed differences in effect sizes correctly.

These errors are appearing throughout the most prestigious journals for the field of neuroscience. How can we explain that? Analysing data correctly, to identify a â€œdifference in differencesâ€, is a little tricksy, so thinking very generously, we might suggest that researchers worry itâ€™s too longwinded for a paper, or too difficult for readers. Alternatively, perhaps less generously, we might decide itâ€™s too tricky for the researchers themselves.

But the darkest thought of all is this: analysing a â€œdifference in differencesâ€ properly is much less likely to give you a statistically significant result, and so itâ€™s much less likely to produce the kind of positive finding you need to look good on your CV, get claps at conferences, and feel good in your belly. Seriously: I hope this is all just incompetence.

## digitrev said,

October 3, 2011 at 2:08 pm

Just out of curiousity, what test would one have to apply? My (admittedly poor) statistical background suggests a Fisher test, but I’ll be damned if I know.

## JamesSW said,

October 3, 2011 at 2:09 pm

It’s clearly all part of a vast right-brain conspiracy.

## Philip Kendall said,

October 3, 2011 at 2:36 pm

This example perhaps becomes much more obvious if the threshold for statistical significance was 29%, but the two measurements were at 28% and 30%. More generally, it’s a symptom of simplifying results to a binary “significant” or “not significant” from the underlying “probability that this result could be obtained by chance”.

## Rob K said,

October 3, 2011 at 3:21 pm

For this example, I’d have thought the simplest analysis would be a 2-factor ANOVA, look for a significant chemical X mouse strain interaction, assuming that neuron firing rate meets the usual assumptions.

## Andy Grieve said,

October 3, 2011 at 5:32 pm

This is not only a problem of being untutored in statistics. Itâ€™s also a problem that too many biomedical journals donâ€™t kept to the standards set by the CONSORT guidelines. Itâ€™s also a problem that not enough statisticians act as referees or associate editors of scientific journals. Itâ€™s also a problem that not enough statisticians are involved in the design, analysis and conduct of medical research. Having worked both as an academic and industrial statistician (in big pharma) I my opinion the design and analysis of industrial studies are uniformly of a higher standard than academic studies. I leave aside interpretation because I accept that not all big pharma sponsors are blameless as far as the interpretation of study results goes. My experience in academia both as a statistician providing input into grant proposals and as a statistician sitting on granting bodies is that many academic researchers treat their institutionâ€™s statisticians as the providers of sample sizes (on average no more than 48 hours before the grant deadline) and as t-ers and p-ers to provide a t-test and p-value. In the pharmaceutical industry statisticians were treated better than that when I joined it > 35 years ago. From a UK perspective there are two fundamental issues here. First, UK PLC and its university system is not producing enough statisticians and therefore there are not enough to go around. The argument is sometimes made that because the pharma industry takes the majority of those statisticians trained to mastersâ€™ level that industry should fund the courses themselves. However with the UK-based pharma industry contracting â€“ witness the reduction of Pfizer, Novartis and GSK statistical groups who is going to fund these courses? And if they are not funded where will the next generation of statisticians come from to support academic and research organisations in the public sector. Secondly the salaries of statisticians in non-pharma environments are not competitive. Surely market forces with too few statisticians available to fill the required slots should push up salaries in academia and non-profit organisations such as the MRC precisely as it did within pharma 20 years ago. It is all very well to have what amounts to a free-market at the top end of the ladder â€“ the professorial end â€“ but if the system doesnâ€™t recognise the problems at the bottom end the current failings will continue indefinately. As far as the MRC is concerned I remember attending a meeting about 10 years ago at the MRC organised by MRC basic laboratory scientists who were concerned that they had no statisticians to support them, nor statisticians involved in the majority of journals to which they would naturally submit their work and hence were using the same methods that were inadequate 20-30 years ago. Donâ€™t forget that many of the great statisticians of previous generations cut their statistical teeth not in academia but in research units and industry: RA Fisher at Rothampstead research station (Agriculture); Henry Daniels and David Cox at the Wool Industry Research Association; Maurice Bartlett At ICIâ€™s Jealottâ€™s Hill facility, OL Davies and Geoprge Box at ICIâ€™s Dyestuffs and Chemical Divisions, Edgar Fiellers at Boots â€“ yes thatâ€™s Boots the Chemistg â€“ Pure Drug Company and Frank Wilcoxon at the Cyanamid Company.

## Oeh said,

October 3, 2011 at 6:11 pm

I suggest anyone interested in this read the paper abstract. This has a better explanation of the issue with results for the experimental and control groups being on either side of .05 test for statistical significance.

## paulbrichardson said,

October 3, 2011 at 7:42 pm

I am with Rob K. I would use a 2-way Anova. Most second-year biologists would know this. It’s a bit shocking that senior researchers would be this ignorant.

## rnlanders said,

October 3, 2011 at 8:44 pm

2-way ANOVA is a better way to go. Hierarchical multiple regression might be more informative, but same basic idea.

This story is an excellent example of a classic error: concluding statistical significance means something is “real” and a lack of statistical significant meaning it is not. Since the difference the researchers are interested in is “real” and the other is not, they conclude their hypotheses are supported. A misinterpretation, certainly, but not an uncommon one in most academic fields.

## jb0713 said,

October 3, 2011 at 9:58 pm

I would suggest it is incompetence, willful incompetence. Honesty is only the best policy when there is money in it.

## KeithLaws said,

October 3, 2011 at 10:58 pm

Obviously significance for a simple effect does not entail a significant interaction – though the authors estimate that in at least one third of cases, it would have produced a significant interaction – so it doesnt undermine the data only the approach used.

In this context, it is also worth noting the less obvious – that a significant interaction does not entail that the simple effects will also be signifcant

## ajcarr said,

October 3, 2011 at 11:19 pm

Surely a 1-way ANOVA would do? We have four groups: are one or more of them different to the others? If the 1-way ANOVA indicates that this is the case, then we can carry out post-hoc analysis using boxplots and (better) paired t-tests with (e.g.) Bonferroni correction to determine which of the groups differ. Simple and pretty rigorous. It also takes about a second to do in decent software like DataDesk (I don’t mean a second of CPU time, I mean a second of user time, clicking on things).

As an engineering undergrad at Newcastle I was very lucky to get a lot of stats, taught by a very practical lecturer, Dr Andrew Metcalfe (presumably retired now), but a lot of places neglect stats because the mathematicians who teach maths to engineers despise stats as somehow ‘soft’ or ‘imprecise’. Unfortunately, modern manufacturing industry runs on stats (e.g., six-sigma and SPC). I would never advocate that even partial differential equations be dumped to make way for statistics, but stuff like set theory could maybe be reduced.

Heaven only knows what medics are taught. Maybe counting in octal after they’ve mistakenly removed two of the patient’s fingers?

## UrbanAchieverAndy said,

October 4, 2011 at 7:19 am

I am no statistician but even I know that’s wrong! They are willfully misleading people. And the journals are equally at fault for publishing the papers.

## PsyPro said,

October 4, 2011 at 7:36 am

I think there is a fundamental confusion here. Only some of the authors cited as making a statistical mistake were actually making a statistical claim. Some of the claims were with respect to substantive (i.e., theory-corroborative) conclusions, not statistical ones. For example, a before-after manipulation was found to be statistically significant under conditions x, precisely as predicted by theory Z, but not theory Q (which predicted no such effect). The same result was not found to be statistically significant (with all the same statistical parameters, e.g., sample-size, etc.), under conditions not-z, again, precisely as predicted by theory Z, but not theory Q (which now predicted a real effect). I see no problem with the authors concluding that they have provided support for theory Z, and a challenge to theory Q. And, in fact, most science is indeed conducted in just that way, as it should be: statistical hypothesis are mere tools in the testing of theories: they are not, in themselves, the hypotheses of real interest, except in the testing of extremely applied (atheoretical/nonsubstantive) claims. The confusion between the two uses of statistical “hypotheses” is the source of too much of the false debate regarding null-hypothesis statistical testing and its alleged alternatives.

Most of what is said in the piece is accurate and damning of a lot of really sloppy neuroscience (where it really was the statistical claim that was made). But, one should not use that sloppiness to condemn a rather common bit of logical reasoning about substantive hypotheses.

There is the claim in some of the comments that, well, why not just run a two-way ANOVA and test for a significant interaction? Two points: 1) it is not obvious that it is that *statistical* conclusion that is at issue in every case, and 2) interaction tests are notoriously of low-power relative to main-effects (as KeithLaws noted). Again, if the experimental conditions were such that there were NO theoretical bases for any of the statistical results, there may be a point in conducting the two-way ANOVA and subsequent simple-effects using the interaction error-term (but then we really are in the domain of dust-bowl empiricism).

## PsyPro said,

October 4, 2011 at 7:38 am

that should read “not-x” not “not-z”

## Craig said,

October 4, 2011 at 11:22 am

I can’t imagine any neuroscientist or psychologist that I know (I’m a behavioural neuroscience/psychopharmacology researcher) making that sort of error as a mistake.

I _can_ imagine just about every researcher I know playing as fast and loose with their statistics as they think they can get away with. When you spend your life fighting for those significant results, it’s easy to see the statistical tests as an obstacle to be overcome.

If the peer reviewers consistently fail to pick up on it, then it’s going to get through.

## gRg said,

October 4, 2011 at 11:30 am

Agree with Craig’s last point: what about reviewers? Superficial, incompetent or what else?

## KeithLaws said,

October 4, 2011 at 11:53 am

Ditto Craig and gRg

There are several possible explanations:

1) they ‘dont know their stats’ – impossible – as mentioned above – this is quite basic and these are clearly capable individuals

2) they do it on purpose – some may, but – see PsyPro above – there are power issues and so, good reasons sometimes to do simple predicted comparisons. Interestingly, a large proportion (one third at least) do it even though the interaction would probably have been significant (so, in no way a deliberate ploy)

3) the most likely explanation is that a culture has developed in that area – such that it becomes a common – and therefore accepted practice – to analyse data in this manner. It happens e.g. in my area, it is common for patient data to be analysed without reference to any control data in the papers – i.e. all comparisons are within-patient – totally erroneous of course, but the culture implcitly accepts it and perpetuates it

If there is a fault, it surely lays with the reviewers and editors, editorial boards etc …who ‘permit’ this culture to develop and thrive

## linzel said,

October 4, 2011 at 12:51 pm

I’m afraid that in this instance Mr. Goldacre falls for mainstream exaggeration to acquire interest as much of the mainstream – and really poor – science ‘reporting’ does. Extraordinary claims require extraordinary evidence. I expect to read an article like this from theglobeandmail.com or thestar.com

Still love your blog Ben.

## DrBSteamJets said,

October 7, 2011 at 8:50 pm

I see this or related issues all the time when serving as a journal reviewer for various physiology journals. The most straight-forward way I find to illustrate the error in the authors’ reasoning is to question what the conclusion would be if in fact, to use this example, it had been the normal mice that showed a ‘significant’ decrease if 15% and it had been the mutant mice that failed to attain statistical significance despite the absolute magnitude of change being double (i.e. 30%) – which is not only possible but actually fairly common as control groups are often more consistent in whatever small effect may occur, whereas many interventions turn out to involve responders and non-responders (increasing variability in response). It is clear in such cases that we couldn’t conclude that the normal group had a significantly greater change as the actual data would illustrate the opposite interpretation.

As with so many of these statistical glitches, we find support for the strangely underused analytical approach of looking at the data.

## malaria_rules said,

October 10, 2011 at 2:53 pm

Certainly a lot of the comments seem to point out that the reviewers should have pick up the “bad statistics”. Although i have t agree that yes, they should have seen that, it is the responsibility of the authors, in the first place, to make sure they are doing good science and not leading themselves and others into a false story. Having said that, we all know how the game is played and some times getting that value under .05 seems to be an obsession…

## afterglow said,

October 11, 2011 at 4:20 pm

As someone who is writing up a manuscript at the moment, this is quite scary and I’m going to double check my stats so that I haven’t made this mistake. But would I be correct in thinking that a 2-way anova is the best way to look at these kind of differences?

## DrBSteamJets said,

October 11, 2011 at 8:51 pm

This is a common issue in published papers in my field (physiology) and also one I have also picked it up several times now when serving as a journal reviewer. I find the most graphic illustration (literally) to highlight the error in authors’ reasoning is to imagine (using Ben’s example) if it had been the normal mice whose 15% descrease from baseline was deemed ‘significant’, whereas the larger 30% absolute decrease in the mutant mice may not have attained the required level of significance (quite possible considering that intervention/experimental groups often vary in response magnitude or even direction, relative to the smaller yet more consistent effect in the control/normal group). In this instance, none could claim that the decrease was significantly greater in the normal mice than the mutants because the actual/graphical data (if reported of course) would directly contradict such a conclusion.

As with so many statistical ‘glitches’, this neatly supports a much underused analytical tool involving i) looking at the actual data and ii) making a personal decision about what it may mean.

see ‘A picture is worth a thousand p-values’ (Loftus 1993)

## DrBSteamJets said,

October 11, 2011 at 8:54 pm

Now my earlier post appears – good job I didn’t type that all out a third time!

## Quackonomics said,

October 16, 2011 at 6:19 am

Great stuff as usual Ben!

## thom said,

October 17, 2011 at 11:35 am

@PsyPro “There is the claim in some of the comments that, well, why not just run a two-way ANOVA and test for a significant interaction? Two points: 1) it is not obvious that it is that *statistical* conclusion that is at issue in every case”

It is the focus of the paper – and their examples. They note that the error can occur in other contexts – but it is clear that they are talking about factorial ANOVA designs.

“and 2) interaction tests are notoriously of low-power relative to main-effects (as KeithLaws noted). Again, if the experimental conditions were such that there were NO theoretical bases for any of the statistical results, there may be a point in conducting the two-way ANOVA and subsequent simple-effects using the interaction error-term (but then we really are in the domain of dust-bowl empiricism).”

I think you may have misunderstood Keith. In a 2×2 ANOVA design the interaction test is the most statistically powerful test of the hypothesis of a difference in differences. Main effects are tests of different hypotheses. Interaction effects have notoriously low power when predictors are continuous, but this isn’t really such an issue for experimental designs where predictors are categorical an levels of factors chosen to be extreme. Switching to simple main effects analysis to detect an interaction decreases power – Keith noted that all simple main effects can be NS when there is a significant interaction.

More generally, simple main effects are not true follow-ups to an interaction effect as they are decompositions of SS from one of the main effects plus the interaction. For a 2×2 design, the interaction test is a highly focused test and if your hypothesis is about a difference in differences there is no more powerful alternative. For designs with several df for the effect (e.g., 3×3 or 3×4) you can get more powerful tests than the omnibus test of the interaction effect. These are interaction contrasts and allow you to test a focused pattern among means.

(None of this is predicated on using significance tests. The same is true for confidence intervals, likelihood ratios, Bayes factors etc.)

In short, researchers who make the error are using an unfocused procedure that has low statistical power to detect an interaction. Thus lack of power to detect interactions can not be a motivating factor.

@ Keith:

I think you are right about the culture here, but wrong that incompetence/ignorance is not a contributor:

psychologicalstatistics.blogspot.com/2011/09/problem-of-significance.html

## BenHemmens said,

October 17, 2011 at 12:00 pm

I’m a total fool with numbers, but as a scientist I eventually gravitated to a simple rule: just don’t bother with any effects around 20%. Whatever you do, set up your experiments so that something changes relative to something else by

at least twofold. There is always something waiting to be done that will give you such results; it’s just a matter of selection.Whether or not you know how to use statistical tools, this is a good rule of thumb because the fact is, you never know everything that’s going on. Some of your assumptions about your system are always wrong. Even if your ANOVA or whatever tells you the 15% are significant, don’t believe it. There’s ALWAYS something lurking in the undergrowth that will wipe out a effect of that size. You need leeway.

Yes, it’ll probably lead to you publishing less. But wouldn’t it be nice for all of us if we had to spend fewer weeks wading through bullshit every time we have to read up on a new area of research.

## Enky said,

October 24, 2011 at 4:02 am

As I’m sure this blog hast noticed – most science is junk science because its methodology is flawed in ways like this and others.

Pyramids built off of mistakes, carelessness, ignorance, and the need to be published.

The example experiment described here doesn’t even include a placebo. You would need to somehow get a baseline for what happens when *anything* is dropped onto the cells (water, for example).

## pe51ter said,

October 24, 2011 at 4:20 pm

The correct procedure is as follows:

1 Carry out a (significance) test for the interaction (i.e. the difference of the diference). If this is significant stop there and try to interpret the interaction. In the mice/chemical example this means that the effect of chemical depends upon whether the mouse is normal or mutant.

2 If the interaction is non-significant then test the two main effects seperately – i.e (1) mutant versus normal and (2) effect of chemical.

The correct approach to the analysis is pretty basic and will be covered very early on in all experimental design texts.

The fault therefore lies entirely journals for setting poor standards.

## bert said,

October 24, 2011 at 10:41 pm

The ‘error’ is far more subtle than stated and PsyPro has made that reasonably clear.

BenHemmens, your comment really do show you are “a total fool with numbers”. Statements about “effects around 20%” or “at least twofold” are meaningless. I could obtain a twofold difference that is both statistically and clinically meaningless, but get a 10% change that is both statistically significant and massively important at an individual level.

## William Stewart said,

October 25, 2011 at 1:07 am

I suggest the following: Suppose there are ten normal mice and ten mutants. Each mouse has a before and after score.So there are two sets of ten difference scores, one set for the normal mice and one set for the mutant mice. Carry out a one-sample(matched pairs) t-test for each set of difference scores. (Null hypothesis equal to zero for each) The two sets of difference scores would be compared by a two-sample t-test.

## Jon Wade said,

October 27, 2011 at 2:38 am

Could a part of it bit personal pride or a determination to make a difference? I mean, if someone has spent years studying something to find a relationship, maybe they will feel personally cheated if they simply conclude that there is no connection, no cause and effect. The researchers may simply want to see something so much that they omit to calculate it correctly, maybe on a subconscious level? I guess that is the same as incompetence, but somehow it seems more fogivable!

## Olhado said,

October 30, 2011 at 6:01 am

I suspect an additional factor at work here is the effect of these statistical errors in the presence of publication bias.

The classic form of publication bias is in which positive results are more likely to be published than negative. This can cause all sorts of problems with statistical analyses–after all, if a research question is tested over and over, one would expect the test to show a false positive eventually. If only the false positive is published and the others are ignored . . .

In any case, with a situation like this, publication bias can be even worse. In this case, bad statistics is more likely to show “positive” results. If positive results are in turn more likely to be published, then an indirect correlation is formed between bad statistics and publication.

This is the exact opposite of what we want.

-A Biostatistics PhD candidate.

## shoi said,

November 6, 2011 at 5:40 pm

Perhaps the better journals should particularly note the statistics were reviewed by a named specialist statistician (in addition to the anonymous senior geneticist/whatever). Or something, but SOMETHING SHOULD BE DONE as they say.

There’s another similar article out today

www.sciencenews.org/view/feature/id/335872/title/Odds_Are%2C_Its_Wrong

## Wmuntean said,

November 28, 2011 at 3:27 pm

The classic “poor man’s interaction”, where one main effect is signification and the other isn’t. These are the easiest statistical criticisms to catch.

What scares me are these more difficult errors to spot (caused by either intention or otherwise) pps.sagepub.com/content/6/2/163.abstract?etoc

These statistical errors unfortunately require more data than what are typically published in articles. Without the raw data they go undetected.

## VAL20002 said,

December 7, 2011 at 2:44 pm

One of my biggest petpeeves – NO statisticians involved in the review process of academic journals!

## bizwiz said,

March 31, 2012 at 9:18 pm

@thom – I think you are right about there not being a more powerful test for a “difference in difference” test. However, often times that is not what is being tested. Perhaps this is obvious to you (as you correctly discuss simple effects using SS from both the main effect and interaction), but maybe not to other readers, so I elaborate.

Psypro has a point that the interaction test is not the most powerful way to test a THEORY. One could be interested in more than just a difference in slopes, but also the relative levels of all conditions. Suppose we have a 2×2 and the predicted deviations from the grand mean are:

A=+1, B=0,C=0,D=-1

A difference in slope test compares A-B to C-D. This ignores the relative levels of the AB and CD lines. If one wanted a stronger test, one could incorporate this other information and use contrast coding. Appropriate use of contrast coding is not trivial – I refer the reader to Rosnow and Rosenthal’s work. Of particular importance is the “semi-omnibus” or “residual” test. Also note, contrast coding requires that tests be specified apriori.

## PsyStef said,

April 2, 2012 at 6:58 pm

Well, I’m certainly no stats expert, but I do have some different thoughts on this issue. First of all, the paper mentioned 2 possible solutions to the problem, (1) a comparison of effect sizes between different groups and (2) reporting the interaction term of a mixed ANOVA. These are not the same: To compare effect sizes in a mixed design, some quite advanced stats are necessary that involve computing the confidence intervals for the respective effect sizes and some other things that can be quite time-consuming… Given that this is rather impractical, I don’t think we can blame anyone for not using this option for analysing data.

Second, reporting the interaction term for a mixed ANOVA seems easy enough. However, if you’re playing with single neurons, you can have 340 ‘good’ neurons in one sample, and maybe 120 ‘good’ neurons in another sample and now you’ll have to compare the experimental condition to the control condition. Given the different sample sizes, I could imagine that the mixed ANOVA wouldn’t immediately suggest itself to the analyst. It also seems wildly optimistic to think that the pre-conditions for the between-group comparisons would not be violated from the very beginning (homogeneity of variance-covariance matrix etc.). Doing the t-tests within groups is probably just a way to avoid non-parametric tests that would become necessary when the pre-conditions for the omnibus ANOVA are violated. So, all in all, I don’t think that the preference for t-tests reflects incompetence or dishonesty on part of the researcher. More likely, it’s just a shortcut that they chose to avoid problems with a potentially difficult data set. This is certainly not correct, and should definitely be discussed, but I guess the solution to this problem would be a straightforward correction for a ‘faulty’ mixed ANOVA that violates the assumptions (e.g., similar to Greenhouse-Geisser correction for repeated measurements).

What IS a bit disturbing is however the overall tendency to depict the standard error of the mean in within-subjects comparisons; and this ~20 years after it’s been pointed out that these measures of variance don’t tell us anything about the critical comparisons (e.g., Loftus & Masson, 1994). Big applause for Nieuwenhuis, Forstmann and Wagenmakers to squeeze a discussion of this issue into their paper.

PS: I’m myself not guilty of the analysis/interpretation error that Nieuwenhuis and colleagues pointed out, and I’m not trying to justify the use of t-tests where ANOVAs etc. would be more appropriate. All I’m saying is that I can *maybe* understand how this custom evolved.

PPS: I am however guilty of reporting the standard error in within-subjects contrasts (instead of the more appropriate within-subjects confidence intervals). 🙂

## David @ Therapy said,

August 8, 2012 at 6:06 am

I heard that 23.75 % of statistics are useless anyway.

## kcharneski said,

February 6, 2013 at 1:54 pm

Suggesting the appropriate statistical test to carry out is impossible without knowing the nature of the data. I’d say this is a big problem in biology in general, applying tests which do not suit the data at hand. Most biological data is rarely normally distributed and hence does not meet the underlying assumptions for the most common, straight-out-of-the-box parametric tests such as T tests. But people apply them anyway — I suspect they do not really understand how they work.