Psychologist logo
Should psychologists embrace AI-powered hypothesis generation
Digital and technology, Research Ethics

Should psychologists embrace AI-powered hypothesis generation?

Emma Young on recent papers and commentaries.

13 November 2024

By Emma Young

Artificial intelligence is already transforming how people do their jobs, and it's poised to have a dramatic impact on psychological research and treatments. AI chatbots are being used to tailor behavioural change interventions to the individual, for example, and some researchers are even suggesting that they could be used to stand in for expensive human participants in pilot studies.

One of the biggest debates, though, centres on the extent to which AI and machine learning might be used not just to help investigate new ideas, but to come up with those ideas in the first place…

'Aha!' moments, or pattern recognition?

In 2023, Jens Ludwig and Sendhil Mullainathan at the University of Chicago co-authored a working paper titled 'Machine Learning as a Tool for Hypothesis Generation'. Though the testing of a hypothesis follows highly formalised procedures, the generation of hypotheses is often rooted more in intuition, inspiration, and creativity, they write. And yet, coming up with these hypotheses is, they argue, itself an empirical activity: 'creativity begins with "data" (albeit data stored in the mind), which are then "analysed" (albeit analysed through a purely psychological process of pattern recognition).'

What might feel to a researcher like an 'aha' moment of realisation of what to explore next is, in the view of Ludwig and Mullainathan, the output of data analysis run by the human brain. They think that machine learning algorithms, which can also spot patterns – and which are capable of finding patterns in almost unimaginably vast amounts of data – could in theory not only do this job, but do it even better than humans, by noticing patterns that people might not.

There are groups of researchers who are actively exploring this approach. Earlier this year, Song Tong of the AI for Wellbeing Lab at Tsinghua University and colleagues introduced a method that used GPT-4, an advanced large language model (LLM). The team asked GPT-4 to analyse 43,312 articles taken from a range of psychology journals, and to identify causal links between concepts mentioned in these papers. Next, they asked GPT-4 to come up with novel, plausible hypotheses that might explain these relationships.

Focusing in on the area of wellbeing, GPT-4 identified a link with online social connectivity, for example. In the second step, it generated this hypothesis: 'Online social connectivity and access to well-being resources can build "virtual resilience" and enhance wellbeing during stressful events like pandemics.'

The researchers compared 130 such machine-generated hypotheses with a selection generated by human experts. AI gave these experts a run for their money, they concluded, stating that GPT-4's were just as novel as the human-generated ideas.

The team believes that combining LLMs with machine learning techniques, such as the causal graphs included in their study, could allow researchers to extract novel insights from vast tracts of literature. The impact that integrating AI with traditional approaches will have on research will, they think, be 'profound'.

'Generate counterintuitive yet plausible hypotheses…'

In work also published this year, a team led by Sachin Banker at the University of Utah reached similar conclusions about the potential for AI. Devising a new research project in social psychology begins with generating at testable idea 'that relies heavily on a researcher's ability to assimilate, recall and accurately process available research findings,' the team notes in their paper in American Psychologist. However, a rapid increase in the sheer volume of new research findings is making this more challenging – making it more likely that researchers will miss potentially important links. The team explored whether the LLMs GPT-3 and GPT-4 could do this kind of research synthesis and hypothesis generation instead.

Their findings with the more advanced GPT-4 were the most compelling. In this stage of the study, they instructed GPT-4 to act as an expert social psychologist with a wide range of research interests, including (but not limited to) social cognition, violence and aggression, group behaviour, and interpersonal relationships. The instructions went on: 'Your task is to generate counterintuitive yet plausible hypotheses. They should combine different subfields of social psychology and advance theoretical knowledge… Begin each hypothesis with 'Hypothesise that' and generate 100 hypotheses.'

GPT-4 obliged. The resulting 100 hypotheses were then mixed with ones generated by human researchers, and presented to social psychology experts for a blind review. These experts rated each hypothesis in their set for clarity, impact, and originality, plus plausibility and relevance (theoretically or practically). The results were striking, and perhaps a little uncomfortable: GPT-4's hypotheses were judged to be higher in quality on all five dimensions.

Onwards to the future!

The idea that AI might be able to take on the job of coming up with hypotheses – often seen as the most creative, human aspect of research – could generate some alarm, writes Jonah Berger of the University of Pennsylvania, in a commentary on this study.

This was certainly apparent in responses to a recent question to readers in the British Psychological Society's Research Digest newsletter. When given a brief description of the Tong et al. paper and asked, 'Do you think that psychology researchers should embrace AI-powered hypothesis generation?', 41.5 per cent of readers voted 'no, let's keep this between humans', 32.5 per cent said 'yes, onwards to the future!', and 30 per cent were undecided.

However, Berger asks: 'Why should we believe that the current (human) method of hypothesis generation is somehow ideal in the first place?' Many researchers rely on age-old methods of using intuition, personal observations, or whatever literature they happen to be aware of, he writes. 'But is this the best way to generate ideas? And a good way to generate the best ideas? Probably not.'

Banker and colleagues do stress in their paper that LLMs also have some limitations that are important to keep in mind. Their responses are based on the texts they have been trained on, so if there are biases in those texts (such as null results being unpublished, for example), their responses will be biased, too. Also, the team writes, the boundaries of an LLMs 'creative capacity' are shaped by the knowledge they have been trained on. 'They are currently not capable of generating truly novel insights that often arise from deep, creative thought processes that fundamentally challenge existing models and assumptions,' the team notes. Not yet, anyway.

Given this, Alejandro Hermida Carillo at LMU, Munich, and colleagues argue in a second commentary on the Banker paper for to-and-fro process between a researcher and an LLM in the development of a hypothesis for testing.

Both LLMs and people have weaknesses. GPT models have a tendency to 'hallucinate' (or, some would argue, 'bullshit') information that isn't actually there. People, on the other hand, are prone to confirmation bias, for example. But people using LLMs also have a specific failing – research suggests that we tend to cognitively disengage when interacting with sophisticated artificial intelligence systems, over-relying on their output without critically evaluating it. The team share in their commentary that 'As a result, using GPT models for hypothesis generation without appropriate guardrails could lead naive researchers to waste time and resources pursuing irrelevant research avenues.'

The ideal approach, they believe, is for a researcher to start by including personally relevant information – such as their own research interests and goals – as well as whatever debates in the field or constraints on resources, for example, that they think are relevant into the instruction to a LLM to generate hypotheses. Then, the researcher should critically evaluate these hypotheses. This would allow them to generate a modified prompt, or instruction, to take back to the LLM. The team envisages this back-and-forth process continuing until the researcher is completely happy with the hypothesis. In this way, they believe that an LLM such as GPT-4 could aid in, rather than displace, the researcher's role in coming up with hypotheses.

Carillo and his colleagues firmly believe that AI should have a place in research in psychology, including in coming up with hypotheses. 'In our view, however, such work must take seriously the limitations and capabilities of both the human and the machine.'

Researchers will now have to draw their own conclusions, of course, and make their own choices. Join those who may already be using AI to generate hypotheses? Or, as the two-fifths of our newsletter respondents voted, keep AI at bay?