Telling people together
Nadine Lavan, winner of the British Psychological Society’s Award for Outstanding Contributions through Doctoral Research, on voice perception.
15 October 2019
Being able to recognise anyone by their voice is in fact an impressive feat that is anything but trivial to achieve. One of the key features of human voices is that they are highly flexible and incredibly variable. How we use our voice and what it sounds like can change dramatically from situation to situation: think for example about how different a voice of a person may sound when they are speaking, singing, laughing, or shouting. How does this substantial within-person variability affect our ability to extract information about identity from voices?
Why and how does our voice change?
When we speak, we are usually trying to convey information to others. How we convey this information will involve changes to our tone of voice, depending on who we are talking to and what environment we are in, alongside variations in our choice of words. Some of these changes in the sound of the voice are under our full volitional control, others are not.
We can adapt our speaking style to suit a situation and might choose to sound very different when we are, for example, giving a public presentation compared to when we are talking to a good friend over dinner. When giving a speech, we may speak more loudly, at a slower rate and pronounce words more clearly. If we are in an environment that is noisy, we may have to shout to still be able to talk to other people. Shouting raises the pitch of the voice and introduces acoustic qualities that sound harsh or rough, alongside the obvious changes in loudness. We will also press more air out of our lungs when shouting, so our sentences may become shorter and more telegraphic. If we speak to a small child we will adopt a different way of speaking compared to speaking to an adult. In child-directed speech, we may over-articulate aspects of our speech. We also use more variable pitch to clearly emphasise certain aspects of our speech and to convey emotions. In other scenarios we may be amused and laugh helplessly: here, we produce a very high-pitched vocalisation and may be completely unable to continue to speak.
At its most extreme, voice change can be illustrated through voice artistry: impersonators and impressionists can convincingly sound like another person. Voice-over artists often traverse age and gender boundaries in their performances: the cartoon character Bart Simpson, a 10-year old boy, has been voiced for decades by Nancy Cartwright, a female voice artist born in 1957. Such voice artistry and voice disguise are of course extreme cases in how voices can be changed and varied. Nonetheless, it is easy to imagine that the substantial variability in voices might present challenges for accurate identity perception.
Recognising other people from variable voices
During the first half of my PhD studies, I was researching how people how perceive laughter. While running these studies, which involved playing lots of laughter clips recorded from different people to my participants, I was struck by how little insight participants had into how many different people's voices they had heard over the course of an experiment.
After a quick literature search to see if anyone had reported anything similar, I realised that the voice identity perception literature had almost exclusively used very uniform, tightly controlled recordings of speech – vowels, words, sentences and stories – to represent voice identities. While these stimuli were perfectly suited for the purposes of the original studies, they could not tell me why people were unable to make sense of identity when listening to laughter in my experiments.
I therefore ran a study to follow up this observation. In this study, listeners were presented with pairs of sounds, including recordings of spoken vowels ('ah') and laughter. Their task was to judge whether the sounds were made by one person, or two different people. Some pairs included two recordings of spoken vowels; some included two recordings of laughter. We also included pairs that included one recording of laughter and one of spoken vowels.
We tested people who were unfamiliar with the voices and people who already knew them (students who had been lectured by the individuals providing the recordings). Previous studies have shown that people listening to unfamiliar voices can perform very well at speaker discrimination tasks: when making these same/different judgements based on, for example, full sentences spoken in the listener's native language, listeners are easily over 80-90 per cent correct on average.
In our study, however, when trying to make sense of identity across vowels and laughter, unfamiliar listeners performed at chance level – they would have been as accurate if they had been giving random responses, indicating that this task was extremely hard. Familiar listeners, who knew the voices, were more accurate, but performance was overall still poor. By using these extreme examples, of spoken vowels and laughter, we had demonstrated how variability can dramatically disrupt identity perception.
Even when dealing with more linguistically informative stimuli, studies have shown that unfamiliar listeners have trouble dealing with variability. For example, listeners get worse at making speaker discrimination judgements when hearing bilingual speakers talk in their two native languages. Similarly, listeners struggle to make sense of who is who when comparing spoken voices to sung voices, or when making judgements across voice disguises produced by lay speakers (e.g. trying to sound hoarse or very nasal). Notably, however, in all of these studies, listeners' performance stayed well above chance level.
Introducing natural within-person variability
The voices we encounter in our everyday lives will generally produce more than just two different kinds of vocalisations. To get a better idea of how people deal with more natural variability in voices, we recently adapted a task that was popularised in the face perception literature. Here, participants were given 30 brief recordings of voices – 15 recordings from one voice and 15 from another voice, each including a full meaningful utterance. These were sound clips from a popular TV show and, crucially, the clips were taken randomly from different scenes, across different speaking situations, conversation partners and environments. Our voice clips thus varied in ways that you might find in real-life conversations.
The task for our participants was simple: 'Please sort these 30 recordings into different identities. You can listen to the sounds as many times as you want'. To solve the task correctly, listeners would sort the 30 recordings into the two identities that were actually present. We tested a group of listeners who were familiar with the voices – people who had watched at least one season of the relevant TV show – and listeners who did not know the voices, that is, people who were completely unfamiliar with the actors and their work.
Listeners who were familiar with the TV show did rather well on this sorting task. There were some recordings that proved troublesome for some familiar listeners: those recordings would on occasion end up being ascribed to a third or fourth identity, but overall familiar listeners could solve this task with relative ease. The data for unfamiliar listeners, on the other hand, looked quite different. This group of participants ended up thinking that many more identities were present, most of the time nine identities. Unfamiliar listeners also told me after testing that the task was incredibly hard for them.
To better understand what unfamiliar listeners did and where they went wrong, we looked at how they grouped the recordings into what they perceived to be many different identities. An interesting pattern emerged. Unfamiliar listeners almost never put two recordings of different voices into the same perceived identity: that is, they succeeded in telling people apart. The many different perceived identities thus arose from unfamiliar listeners selectively failing to 'tell people together'. In the context of these naturally varying voices, the recordings sounded sufficiently different from one another that listeners failed to link these different recordings produced by the same person to one identity only. They instead split recordings from the same person into a larger number of perceived identities.
These were striking findings, mirroring what had been reported before in the face perception literature. The variability included in this experiment by no means extreme. These were relatively subtle changes in everyday styles of speech: being sarcastic, telling a joke, simply stating a fact, asking someone a question, being exasperated, and so on.
Of course, these findings seem to be somewhat at odds with how we experience unfamiliar voices on a daily basis: when talking to someone we do not yet know, we do not constantly think we are talking to a new person as soon as their voice changes in any way as our study may suggest. This discrepancy can be explained by the presence of rich contextual cues: most of the time, we can access a wealth of additional information that points us towards whether we are talking to the same person or not. Usually alongside the voice, we can also see the person we are talking to, the voice comes from roughly the same point in space and we know what the content and topic of our current interaction is, thus creating a sense of coherence. None of this contextual information was available to our participants, leading to the failure in 'telling people together'.
What next?
Our study therefore shows us that if we do not know a person's voice, we cannot make sense of it when it starts to vary, without additional information. When we encounter variable voices, we seem to systematically perceive voices that sound different to be different identities. Only when we get to know a voice and become familiar with it, we learn how this voice varies, allowing us to accurately group together the different-sounding signals. We have also seen that there may be limits to how accurate we can be when working out identity-related information from voices alone, because voices are so flexible – in extreme cases, familiarity may not be enough.
When thinking about whether we know someone's voice, then, it is probably best to think about which versions of that person's voice we actually know. We may recognise people from what they sound like when they are talking to us. But we might be a bit thrown if a friend slips into a completely different accent when talking to their parents on the phone, when we first hear our line manager laugh helplessly, or when we see a colleague change their entire way of speaking when interacting with their child.
Being familiar with a voice seems to be key to make accurate identity judgements in the context of within-person variability. When we get to know a voice we are likely to gradually form some kind of mental representation of a voice. Having this representation may then help us cope with the within-person variability. How we do this, and what cognitive processes and mechanisms are involved, is not yet fully understood. I'm currently looking at how we form these mental representations of voices, what information is included in them and how we use them to process familiar (and unfamiliar) voice identity processing.
Dr Nadine Lavan is a Post-doctoral Research Associate at University College London
[email protected]
Key sources
Lavan, N., Burton, A.M., Scott, S.K. & McGettigan, C. (2018). Flexible voices: Identity perception from variable vocal signals. Psychonomic Bulletin & Review, 1-13.
Lavan, N., Scott, S.K. & McGettigan, C. (2016). Impaired generalization of speaker identity in the perception of familiar and unfamiliar voices. Journal of Experimental Psychology: General, 145(12), 1604-1611.
Lavan, N., Burston, L.F., Ladwa, P. et al. (2019). Breaking voice identity perception: Expressive voices are more confusable for listeners. Quarterly Journal of Experimental Psychology, 1747021819836890.
Lavan, N., Burston, L.F. & Garrido, L. (2018). How many voices did you hear? Natural variability disrupts identity perception from unfamiliar voices. British Journal of Psychology.
Jenkins, R., White, D., Van Montfort, X. & Burton, A.M. (2011). Variability in photos of the same face. Cognition, 121(3), 313-323.
Perrachione, T.K. (2019). Speaker recognition across languages. In S. Frühholz & P. Belin (Eds.) The Oxford Handbook of Voice Perception. Oxford: Oxford University Press.