Prediction vs causal inference
Marcus Munafò, our Associate Editor for Research, on important differences in what our research questions actually are, and how we ask them.
16 September 2024
Statistical tests are a vital tool for psychology researchers, but it's important to remember that they are just that – a tool. Why does that need saying? Because, in my experience, we sometimes lose sight of our research question and focus on the statistical tools we're using, and the results they give us.
This matters because we can sometimes use the same tool to answer very different questions (in the same way that a hammer can be used to both hammer in a nail but also to pull one out!) We can end up confused – conceptually – if we're not clear how or why we're using a particular statistical tool in the context of a particular question.
A perfect example of this is linear regression. We can use this to estimate the association between two variables – the independent variable (or variables) and dependent variable. So far, so straightforward. We collect our data, we run our regression, and we have our result. X predicts Y.
Or does X cause Y?
That depends very much on our research question. Prediction and causal inference can appear very similar – not least because we can use linear regression in both cases. The distinction is not in the tools we use to answer the question, but the conceptual framework we apply when using that tool.
If our goal is to identify cause-and-effect relationships (and we don't have the luxury of experimentation via randomisation) then we need to deal with potential confounding – factors that can influence both our exposure and our outcome. We can adjust for these statistically – albeit imperfectly.
What does that look like? We include multiple variables into a regression model. However, we're only really interested in one of these variables – the putative causal risk factor (the 'exposure'). The other variables are our confounders and appear in the model simply to adjust for their influence.
We might want to report results for our causal risk factor both unadjusted and adjusted for a range of potential confounders. In fact, if we did that perfectly and included all confounders and they were measured perfectly then what would remain in the fully adjusted model would be the causal effect. Unfortunately, that never happens!
However, if our goal is prediction, concepts such as confounding simply don't apply. Anything that is predictive is useful. Yellow fingers may predict who is likely to get lung cancer, not because yellow fingers are carcinogenic, but because they're a marker for something that is – cigarette smoking. (Another favourite example is the correlation between ice cream consumption and shark attacks – cutting back on Flake 99s won't fend off Jaws).
What does that look like? We include multiple independent variables into a regression model. But in this case what we're interested in is the extent to which all of these variables, collectively, predict our outcome. If they do that well, then our model is predictive and we can estimate a value of the dependent variable.
So we're using the same tool in these two cases, but our conceptual framework is very different because our research question is different.
In one case, we're trying to identify causal risk factors. If we can do that, we can target these – for example by developing new treatments or preventive interventions. If these are effective, and we're correct in our identification of a causal risk factor, then we can change the likelihood of the outcome.
In the other case, we're simply trying to predict something – for example, who is likely to develop a particular outcome. That can still be used to inform clinical decisions – who to treat for example. But it doesn't matter if any of the predictive variables are causal, and the treatments don't need to target those factors.
Clinical prediction models are increasingly being developed in psychology and psychiatry to group patients into discrete subgroups (stratification) or individualised care pathways (personalisation), and in turn, help clinicians make decisions about the best course of treatment for those groups or individuals.
So, prediction and causal inference are fundamentally different types of research questions, even if we use the same statistical tools to answer them. But because of that we need to be precise about the language we use. If our question is a causal one, don't talk about 'predictors' in your regression model!
Causal graphs
What else can we do to help us be clear about our conceptual framework? One valuable tool is causal graphs – sometimes called directed acyclic graphs. These are used to refine causal questions and identify confounders that should be adjusted for (as well as variables that should not be adjusted for).
Julia Rohrer has written an excellent review of the use of causal graphs in psychological research, including how to deal with familiar issues – such as mediators – as well as ones that may be less familiar – such as colliders – in the context of observational data. The use of these graphs is common in epidemiology, but less common in psychology.
The title of Julia's review, published in 2018 in Advances in Methods and Practices in Psychological Science, is 'Thinking Clearly About Correlations and Causation'. This is what we need – to be clear on our conceptual framework, and to understand the subtle but important distinction between, for example, prediction and causal inference. And not to unthinkingly apply a tool before we've thought about what our question is.
Being clear on this distinction can also help with our interpretation. For example, there is clear evidence that the number of books in the home that a child grows up in is associated with their educational attainment. So providing books (as an intervention) should boost educational outcomes.
Well, only if the effect is causal.
It's possible, for example, that books in the home are a marker of, for example, parental educational attainment or the extent to which they more generally provide intellectual support and nurturance for their children. In that case, books in the home may not in and of themselves be having a causal effect on the children's outcomes.
They might still be a predictor – like the hypothetical example of a smokers' yellow fingers – but if they're non-causal then interventions targeting this will be doomed to fail. A complete causal model would include this possibility – a causal path from parents to the number of books.
We can then test whether books are a mediator (and therefore themselves causal) or just an offshoot of parental behaviour. Causal graphs allow us to test competing explanations, so we can be more confident that our answer is a correct one, and the knowledge we generate can be used effectively – for example, to develop interventions.
Collinearity: Should we worry?
All statistical tests are only as good as the data they are used on.
In regression analysis, we are usually taught that a source of potentially significant error is collinearity, where supposedly independent variables are not truly independent (i.e., they are correlated). For example, two questionnaires, ostensibly measuring different things, may both be influenced by the same phenomenon or underlying construct.
(As an aside, the word 'independent' may be an echo of experimental science that doesn't really apply in the context of observational data; perhaps the continued use of the word has been taken as some notion that it is an important property of the data).
In fact, collinearity usually is not a major problem (certainly if the correlation between variables is less than, say, r = 0.8). But, more importantly, whether and how it matters depends on your research question.
In causal inference, where we use regression analysis to estimate the effect of an exposure on an outcome, and include potential confounders as other variables in our regression model. In this case, we might want the variables we include to be correlated. For example, socioeconomic position is an important potential confounder, but no single variable can capture this adequately.
So we might want to include multiple variables – parental education, household education, deprivation index for the household postcode, and so on. By doing so, we include multiple correlated variables but better capture the underlying construct we're interested in. Since the only estimate from our model we're interested in is that of our exposure, collinearity doesn't matter.
In prediction, the situation is a bit more complicated. Including multiple variables that are correlated (i.e., collinear) means that if we include them all in our model the estimates for each of these will be distorted and difficult to interpret. This matters if, for example, we want to develop a parsimonious clinical prediction model that uses the fewest variables possible (e.g., to make it easy to use in clinical practice).
In this case, we can't select variables based on the strength with which they predict the outcome. We would be better off simply selecting the one variable (from those that are correlated) that is easiest or cheapest to measure. So pragmatic considerations might be the simplest solution.
Ultimately, collinearity matters if variables are very highly correlated (e.g., r > 0.8), and we want to interpret the estimates for all the variables in the model. But we can avoid getting bogged down in that by either remembering that our question is a causal one (in which case the only estimate that matters is the one for our exposure of interest), or by taking a pragmatic approach to variable selection if our question is related to prediction (e.g., developing a clinical prediction).
Of course, these aren't the only scenarios – we might be interested in the causal effects of two (or more!) exposures simultaneously, in which case we need to be confident that these are reasonably independent. But the key point is that we need to be clear what our question is.