To P or not to P?
Marcus Munafò, our new Associate Editor for Research, on credibility in how psychologists report results.
08 March 2024
It is an enduring irony that psychologists are at pains to emphasise that behaviour lies along a continuum and diagnostic categories are somewhat arbitrary, yet when it comes to statistical inference we largely still focus on a 5 per cent P-value threshold. Arguments around if and how we should use P-values have raged for years, but thinking beyond P-values and testing point nulls (i.e., simply whether or not there is difference from the null) could help us get a better handle on whether reported results are credible.
A statistically significant finding may give us grounds to reject the null hypothesis, but on its own, it doesn't tell us much more than that. However, if we have designed our study with a particular effect size in mind - a minimum clinically, theoretically or biologically interesting effect - then we can say a bit more. A study designed to detect, let's say, the small effect that we would deem clinically interesting means that a non-significant result is still meaningful - we can be more confident that if there is an effect we've failed to detect, it's probably too small to be interesting.
Unfortunately, this approach remains fairly uncommon. Sample size calculations are often based on the prior literature (which is probably overly optimistic, given that publication bias is ubiquitous and will lead to reported effect sizes being inflated). Worse still, they are sometimes based on historical precedent ('A sample size of 12 was good enough for my PhD supervisor, so it's good enough for me!'). Ultimately, effect size and sample size are design considerations and can help ensure results are interpretable no matter which way they turn out.
But thinking in terms of effect size can also help us get a handle on whether published studies are credible or not. Some effects are so large that we don't really need statistics to tell whether or not they're real…
Visible to the naked eye
Take the example of the average height of men and women - we know that men are on average taller than women, and don't have to calculate a t-statistic and P-value to arrive at that conclusion.
If we wanted to, though, then the effect size for this difference in average height (about d = 2, or two standard deviations) means we would need 6 men and 6 women to have an 80 per cent chance (i.e., 80 per cent statistical power) to detect a statistically significant difference (at a 5 per cent alpha level). So an effect size of d = 2 is a big effect - one that is visible to the naked eye, so to speak. And by extension, a study with N = 12 participants is only able to reliably detect effects that big.
How is this helpful? The reproducibility crisis has thrown up plenty of studies that may not be as reliable as we'd hope. But if we'd been thinking in these terms it should have been obvious that something was up with some of these studies, simply because if the effect sizes indicated were real then the phenomena in question would have already been known. They'd have been visible to the naked eye!
Take one example from a 2012 study - that participants randomised to engage in a creative thinking task whilst standing outside a cardboard box (thinking outside the box!) generated more ideas than those standing inside the box. The effect size (calculated from the means and standard deviations reported in the paper) is d > 3 - much larger than the difference in average height between men and women. If this effect size estimate is even remotely accurate, we would feel wild swings in creativity on every recycling day when we walked past our neighbours' empty cardboard boxes.
Unstandardised effect sizes
But part of the problem here is that the size of the effect is not reported in a way that makes intuitive sense. Standardised effect sizes like Cohen's d are helpful if we want to compare across different study designs and methods, but they don't give us a sense of the real-world effect size (unless we happen to know the standard deviation of the outcome measure in question off the top of our heads). Reporting unstandardised effect sizes can be more helpful for interpretation.
Here's an example. Choice architecture refers to how we design the environment around us to promote certain choices. This can be used to 'nudge' healthier behaviour, and there is evidence that the relative availability of products influences choice. If something is more available, or more prominent, we are more likely to select it.
Our Tobacco and Alcohol Research Group at the University of Bristol explored this in the context of alcohol-free beer, with the rationale being that increasing the availability of alcohol-free beer on draught would increase sales of alcohol-free beer, and reduce sales of alcoholic beer. If that were the case, that could have a positive impact on health.
We randomised pubs and bars in Bristol to serve alcohol-free beer on draught for a period, then revert to selling only alcoholic beer on draught, and then back again (called an ABBA design). See the preprint.
The key thing is how we reported the results. We found 'an adjusted mean difference of -29 litres per week (95 per cent CI -53 to -5), equivalent to a 5 per cent reduction (95 per cent CI 8 per cent reduction to 0.8 per cent reduction)". In other words, pubs sold (on average) 29 litres or 5 per cent less alcoholic beer when alcohol-free beer was available on draught. A plausible effect size (I think), but also potentially a useful one. (We also found that, because sales of alcohol-free beer went up, overall takings weren't affected).
So, thinking in terms of effect size rather than just P-values, and linking these to real-world examples of those effect sizes, can help calibrate our intuitions, and perhaps more reliably sift the robust findings from the less robust. And reporting our results in terms of effect sizes - standardised for comparison to other studies and unstandardised so we can get a clearer sense of the real-world effect - is a much more helpful approach than simply relying on P-values.
You'll note the example I've used is from the research group I'm part of. But as your new Associate Editor for Research, I'm keen to hear your examples of good research practice - design, analysis, reporting. The editor and I believe we can tell stories - of psychologists and the public - through the design, analysis and reporting of the research methods we use.
Marcus Munafò is Professor of Biological Psychology and MRC Investigator, and Associate Pro Vice-Chancellor - Research Culture, at the University of Bristol.
[email protected]
The editor, Dr Jon Sutton, comments: Methods and stats are not my strong point. I've admitted that a few times in these pages. Part of the reason I left academia all those years ago was the dawning realisation that, as Marcus says here, 'Ultimately, effect size and sample size are design considerations'. Without properly understanding the methods side of things, I was going to struggle to design research that made a difference.
I've made a few attempts over the years to face my demons in these pages… to drive coverage of research methods which will genuinely engage and inform. We've reached a lot of people, for example through our Research Digest. But in terms of the magazine, I don't think those attempts have been particularly successful. I've found it easier to tell stories about working lives than about methods and statistics.
So while I am proud of our focus on psychologists as people - we are, I am fond of saying, The Psychologist - I do think it's time to nudge towards a different direction. Because I still believe it's possible to combine the two, in order to tell stories of psychologists and the public through research methods. And I think that's a vital and sometimes overlooked part of science communication.