Predicting replicability
Marcus Munafo on heuristics, statistical literacy, writing style and intuition in research prediction.
06 February 2025
Some outcomes can be hard to predict – US presidential elections, for example. But sometimes we can in fact predict things when, intuitively, we might expect that to be quite difficult.
A lot has been written about the extent to which psychological research is reproducible – most famously, perhaps, via the Reproducibility Project: Psychology, which found that only around 40 per cent of published findings could be reliably reproduced. What is (in my experience) less well known is that there was a parallel project looking at whether scientists could predict in advance which findings would be replicable.
This study, led by Professor of Economics Anna Dreber, used a prediction market –similar to trading on the stock market, except that traders buy or sell positions based on whether they think a future outcome is likely to transpire (in this case, a finding being reproduced). This is intended to capture what is sometimes referred to as the 'wisdom of crowds'.
The results were striking. A simple survey of participants' beliefs in which findings would reproduce only correctly predicted 58 per cent of the outcomes. In contrast, the prediction market – which was dynamic, allowing participants to buy and sell positions over a defined period using real money provided as an incentive to participate – correctly predicted 71 per cent of the outcomes.
Clearly, collectively, we are able to do a reasonable (albeit not perfect) job of predicting which research findings are likely to be robust. The question this raises is: What information was being used to make these – reasonably accurate – predictions?
What information signals replicability?
Another study, led by Bonnie Wintle and published in Royal Society Open Science in 2023, addressed a similar question using a mixed methods approach. It suggested that people use heuristics such as effect size and the reputation of a field to judge whether a finding is likely to be reproducible. There was also a relationship between statistical literacy and the accuracy of predictions.
As a participant in the Dreber prediction market study myself, this resonates with my own experience – I used exactly those heuristics myself when making judgements. I also used a bit of statistical understanding – for example, even if an effect is real, if the replication study is powered at 80 per cent, and the market has priced the outcome of successful replication at 90 per cent, it makes sense to sell that position because it's over-priced given the power of the study to detect that effect!
It's also likely that there is information contained within the article itself that might provide clues as to whether a finding is likely to be robust or not. It's worth noting that we may not know that this information is in there, or that we're using it – we are very good at extracting information without even realising we're doing it. And, as authors, the way we write may lead us to unconsciously leave a trail of breadcrumbs as a clue. Is there something about the writing style of articles that predicts robustness?
A 2024 study by Herzenstein and colleagues suggests exactly that. They found that the language used to describe a study can predict the outcomes. Specifically, 'the language in replicable studies is transparent and confident, written in a detailed and complex manner, and generally exhibits markers of truthful communication'. Nonreplicable studies, by contrast, are 'vaguely written and have markers of persuasion techniques, such as the use of positivity and clout'.
So it's possible that this information – as well as the other factors that have been identified, such as effect size and the reputation of a field – might shape our intuitions as to whether a finding is likely to be reproducible or not. And this information may arise because authors themselves have a sense of whether the findings they are reporting are likely to be robust or not – they are most likely to know, after all – and unconsciously write in a particular way as a result.
Not only the experts…
Do you have to be an expert to be able to predict which scientific findings are robust? The prediction market study (implicitly) assumes so – only researchers were invited to participate. The study by Herzenstein and colleagues provides a possible mechanism (or at least one of potentially many). But there are intriguing suggestions that we may not need researchers themselves to provide us with reasonable predictions.
Perhaps unsurprisingly (given the Herzenstein study) machine learning models seem to be able to do a pretty good job – but not necessarily because of how the papers are written. Wu Youyou and colleagues' 'discipline-wide investigation of the replicability of Psychology papers over the past two decades' showed that author track record (publications, citation record) is positively related to the likelihood of replication, but other proxies of quality (institutional prestige, citations to that paper) are not. Most interestingly, they find that 'media attention is positively related to the likelihood of replication failure' (emphasis added).
So some things that we would hope would be related to replicability (i.e., author track record) are indeed so, other things we might think might be (i.e., institutional prestige) are not – aficionados of the UK's Research Excellence Framework exercise take note! But most intriguingly, media coverage is associated with poor replicability – perhaps because the media (and university press offices) like to cover novel, eye-catching findings – those that are almost certain a priori to be less likely to replicate.
But most intriguing, for me, is the suggestion that even non-experts can do a reasonable job of predicting which findings will stand the test of time. Hoogeveen and colleagues presented 27 high-profile social science findings to 233 people without a PhD in psychology and asked them to evaluate its intuitive plausibility. They predicted replication success with 59 per cent accuracy (i.e., above chance level), and this increased to 67 per cent when they were given evidence about the strength of evidence in the original study.
I've written before about the role of story and narrative in scholarly communication, and that it can be both positive and negative. It's notable, therefore, that one of the conclusions of Herzenstein and colleagues is that 'nonreplicable studies have the archetypical structure of many stories'. So perhaps we do need to be careful if the story feels too compelling. If you want to find out more about story and scholarly communication a new podcast by Anna Ploszajski, Storyology, explores exactly that!
Once bitten…
There is another (perhaps related) factor that seems to be at play here. If a finding feels too good to be true, or implausible, then perhaps it is. As I say, by definition, novel and unexpected findings are more likely to be flukes – our gut instinct or intuition in these cases perhaps reflects our Bayesian priors about how the world works. In fact, I've written about this in the context of effect size – an implausibly large effect size should be another reason to be cautious.
There's nothing wrong, of course, with high-risk / high-return research – it's an essential part of science. But we need to treat some findings with more caution than others – and attempt to replicate them, of course. And the media being interested in a study might be one proxy for that – 'Man Bites Dog' is a more interesting headline than 'Dog Bites Man', after all. Good science involves both discovery and replication, but we need to remember that not all our 'discoveries' are in fact discoveries.
Marcus Munafò is Professor of Biological Psychology and MRC Investigator, and Associate Pro Vice-Chancellor - Research Culture, at the University of Bristol. [email protected]