Dispelling myths about randomisation
Marcus Munafò explores randomisation processes.
14 October 2024
When we are interested in cause and effect relationships (which is much of the time!) we have two options: We can simply observe the world to identify associations between X and Y, or we can randomise people to different levels of X and then measure Y.
The former – observational methods – generally provides us with only a weak basis for inferring causality at best. This approach has given us the oft-repeated (but slightly fallacious) line that 'correlation does not imply causation' (I would say that it can imply it, just often not much more). Of course, sometimes this is the best that we can do – if we want to understand the effects of years spent in education on mental health outcomes (for example), it would be unethical and impractical to conduct an experiment where we randomise children to stay in school for 1 or 2 more years (which option is unethical may depend on whether you're the child or the parent…).
But when we can randomise, that gives us remarkable inferential power. The lack of causal pathways between how we allocate participants to conditions (our randomisation procedure – hopefully, something more robust than tossing a coin!) and other factors is critical. If our randomisation mechanism influences our exposure (which by definition it should) and nothing else (ditto), and we see a difference in our outcome, then this difference must have been caused by the exposure we manipulated. But a lot remains poorly understood about exactly how and why randomisation has this magic property of allowing us to infer cause and effect. And this leads to misconceptions about what we should report in randomised studies.
I want to dispel a couple of common but persistent myths.
The first myth is that randomisation works because it balances confounders. Confounders exist in observational studies because the associations we observe between an exposure and an outcome are also influenced by myriad other variables – age, sex, social position and so on – via a complicated web of causal chains. In principle, if we measure all of these perfectly and statistically adjust for them then we are left with the causal effect of the exposure on the outcome. But in practice, we are never able to do this.
When we randomise people, these influences will still be operating on the outcome, which will vary across the people randomised to our conditions. Does randomisation mean that all these different effects are balanced somehow?
No – not least because confounders do not exist in experimental studies! This is for the simple reason that a confounder is something that affects both the exposure and the outcome, and in an experimental (i.e., randomised) study we test for a difference in our outcome between the two randomised groups. We know that randomisation influences the exposure, but we don't directly compare levels of exposure and the outcome – we compare the randomised arms. And variables such as age, sex and social position can't influence the randomisation mechanism (there is no causal pathway between, for example, participant age and our random number generator!).
So, to be accurate, we need to be talking about covariates in experimental studies –factors that influence or strongly predict the outcome – not confounders. Does randomisation balance these? Well, yes, but in a more technical and subtle sense than is generally appreciated. We know (mathematically) that the chance of a difference between our randomised groups in terms of covariates and the distribution of future outcomes becomes smaller as our sample size become larger (all other things being equal, larger experiments will provide narrower confidence intervals, and more precise estimates – as well as smaller p-values, if that's your thing!).
In other words, a smaller study has a higher chance of imbalance, and this will be reflected in a wider confidence interval (and correspondingly larger p-value).
This means that it doesn't matter if our groups are in fact balanced, because we've been able to turn complexity into error. If our sample is small our standard error will be large, reflecting the greater likelihood of imbalance, and our statistical test will take that into account when generating a confidence interval and p-value. That is exactly why larger studies are more precise – they are more likely to be balanced. Darren Dahly, a statistician at University College Cork, gives a more complete treatment of the issue here. In his words: 'randomisation allows us to make probabilistic statements about the likely similarity of the two randomised groups with respect to the outcome'.
This leads to the second myth, which is that we should test for baseline differences between randomised groups. We see this all the time – usually Table 1 in an experiment – a range of demographic variables (the covariates we've measured – the known knowns) for each of the two groups, and then a column of p-values. Now, this is a valid approach an observational study, where we might want to test whether something is in fact a confounder by testing whether it is associated with the level of the exposure (e.g. whether or not someone drinks alcohol). But is it valid in an experimental study (e.g. if we're randomising people to consume a dose of alcohol or not)?
Once we start to think about what those p-values in Table 1 might be telling us, the conceptual confusion becomes clear. A randomisation procedure should be robust (i.e., immune to outside influence), and the methods section should give us the information to evaluate this. What would a statistical test add to this? As Doug Altman said in 1985: 'performing a significance test to compare baseline variables is to assess the probability of something having occurred by chance when we know that it did occur by chance'. If our randomisation procedure is robust, by definition any difference between the groups must be due to chance. It's not a null hypothesis we're testing, it's a non-hypothesis.
Aha! But what if our randomisation process is not robust for reasons we're not aware of? Surely we can test for that this way? But how should we do that? In particular, what alpha level should we set for declaring statistical significance? The usual 5%? If we did that, we would find baseline differences in 1 in 20 studies (more, probably, since multiple baseline variables are usually included in Table 1) even if all of them had perfectly robust randomisation. Better to invest our energies in ensuring that our randomisation mechanism is indeed robust by design (e.g., computer-generated random numbers that are generated by someone not involved in data collection).
OK, OK – but what about deciding which of our baseline characteristics to adjust for in our analysis? It's true that adjusting for baseline covariates that are known to influence the outcome can increase the precision of our estimates (and shrink our p-values – hurrah!). But testing for baseline differences to decide what to adjust for is again conceptually flawed. A statistically significant difference is not necessarily a meaningful difference in terms of the impact on our outcome. It depends in large part on whether the covariate does in fact strongly influence the outcome, and we aren't testing that! Much better to select covariates based on theory or prior evidence – identify the variables we think a priori are likely to be relevant and adjust for these.
Randomisation is extremely powerful but also surprisingly simple. Its power comes from the ability it gives us to control some of the key causal pathways operating, and to convert complexity into measurable, predictable error. So we can relax! We don't need to worry about 'balance' – our sample size and the standard error will take care of that (which is why we need to power our studies properly!) – and we don't need to have that column of p-values in Table 1 – they don't tell us anything useful or give us any information we can usefully act on. We should all – including the editors and reviewers who ask for these things – take note!
How should we report randomisation?
If we accept that the key to successful randomisation is getting the process right (rather than testing whether or not it works post hoc, which is fraught with conceptual and practical issues), how do we report randomisation in a way that allows readers to evaluate its robustness?
In medical studies – particularly clinical trials – journals expect authors to follow reporting guidelines (these exist for a vast range of study designs, many of which are relevant to psychology). A full description might look something like this:
Randomisation was generated by an online automated algorithm (at a ratio of 1:1), which tracked counts to ensure each intervention was displayed equally. Allocation was online and participants and researchers were masked to study arm. If participants raised technical queries the researcher would be unblinded, participants seeking technical assistance received no information on the intervention in the other condition and so were not unblinded. The trial statistician had no contact with participants throughout the trial and remained blinded for the analysis. At the end of the baseline survey, participants were randomised to view one of two pages with the recommendation to either download Drink Less (intervention) or the recommendation to view the NHS alcohol advice webpage (comparator).
This example was taken from a recent article published by Claire Garnett and colleagues (disclosure: I'm a co-author!), which tested the efficacy of an app to reduce alcohol consumption. As it was a clinical trial and published in a medical journal it had to follow the relevant reporting guidelines and describe the randomisation process fully.
Of course, sometimes the randomisation process is robust and can be described v briefly – a computer task may have randomisation built in, so the experimenter doesn't need to be involved at all. But that should still be described clearly. And sometimes the randomisation process does involve humans (and therefore may be potentially biased!).
Something I've learned throughout my career is that we can learn a lot from how things are done in other disciplines (and also showcase what we do well in psychology). This is perhaps one example of that – there's lots of good practice in psychology when it comes from reporting randomised studies, but we can still look to learn and improve.
Marcus Munafò is a Professor of Biological Psychology and MRC Investigator, and Associate Pro Vice-Chancellor - Research Culture, at the University of Bristol. [email protected]