Serious power failure threatens the entire field of neuroscience
“Low statistical power is an endemic problem in neuroscience."
10 April 2013
Psychology has had a torrid time of late, with fraud scandals and question marks about the replicability of many of the discipline's key findings. Today it is joined in the dock by its more biologically oriented sibling: Neuroscience. A team led by Katherine Button at the School of Experimental Psychology in Bristol, and including psychologist Brian Nosek, founder of the new Center for Open Science, make the case in a new paper that the majority of neuroscience studies involve woefully small sample sizes, rendering their results highly unreliable. "Low statistical power is an endemic problem in neuroscience," they write.
At the heart of their case is a comprehensive analysis of 49 neuroscience meta-analyses published in 2011 (that's all the meta-analyses published that year that contained the information required for their purposes). This took in 730 individual papers, including genetic studies, drug research and papers on brain abnormalities.
Meta-analyses collate all the findings in a given field as a way to provide the most accurate estimate possible about the size of any relevant effects. Button's team compared these effect size estimates for neuroscience's subfields against the average sample sizes used in those same areas of research. If the meta-analyses for a particular subfield suggested an effect – such as a brain abnormality associated with a mental illness – is real, but subtle, then this would indicate that suitable investigations in that field ought to involve large samples in order to be adequately powered. A larger effect size would require more modest samples.
Based on this, the researchers' estimate is that the median statistical power of a neuroscience study is 21 per cent. This means that the vast majority (around 79 per cent) of real effects in brain science are likely being missed. More worrying still, when underpowered studies do uncover a significant result, the lack of power means the chances are increased that the finding is spurious. Thirdly, significant effect sizes uncovered by underpowered studies tend to be overestimates of the true effect size, even when the reported effect is in fact real. This is because, by their very nature, underpowered studies are only likely to turn up significant results in data where the effect size happens to be large.
It gets more worrying. The aforementioned issues are what you get when all else in the methodology is sound, bar the inadequate sample size. Trouble is, Button and her colleagues say underpowered studies often have other problems too. For instance, small studies are more vulnerable to the "file-drawer effect", in which negative results tend to get swept under the carpet (simply because it's easier to ignore a quick and easy study than a massive, expensive one). Underpowered studies are also more vulnerable to an issue known as "vibration of effects" whereby the results vary considerably with the particular choice of analysis. And yes, there is often a huge choice of analysis methods in neuroscience. A recent paper documented how 241 fMRI studies involved 223 unique analysis strategies.
Because of the relative paucity of brain imaging papers in their main analysis, Button's team also turned their attention specifically to the brain imaging field. Based on findings from 461 studies published between 2006 and 2009, they estimate that the median statistical power in the sub-discipline of brain volume abnormality research is just 8 per cent.
Switching targets to the field of animal research (focusing on studies involving rats and mazes), they estimate most studies had a "severely" inadequate statistical power in the range of 18 to 31 per cent. This raises important ethical issues, Button's team said, because it makes it highly likely that animals are being sacrificed with minimal chance of discovering true effects. It's clearly a sensitive area, but one logical implication is that it would be more justifiable to conduct studies with larger samples of animals, because at least then there would be a more realistic chance of discovering the effects under investigation (a similar logic can also be applied to human studies).
The prevalence of inadequately powered studies in neuroscience is all the more disconcerting, Button and her colleagues conclude, because most of the low-lying fruit in brain science has already been picked. Today, the discipline is largely on the search for more subtle effects, and for this mission, suitable studies need to be as highly powered as possible. Yet sample sizes have stood still, while at the same time it has become easier than ever to run repeated, varied analyses on the same data, until a seemingly positive result crops up. This leads to a "disquieting conclusion", the researchers said – "a dramatic increase in the likelihood that statistically significant findings are spurious." They end their paper with a number of suggestions for how to rehabilitate the field, including performing routine power calculations prior to conducting studies (to ensure they are suitably powered), disclosing methods and findings transparently, and working collaboratively to increase study power.
Further reading
KS Button, JPA Ioannidis, C Mokrysz, BA Nosek, J Flint, ESJ Robinson, and; MR Munafo (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience : 10.1038/nrn3475