Guest blogger: Emotions Affecting Voice Recognition in Criminal Cases by Emma Keith

Emotions Affecting Voice Recognition in Criminal Cases: One Small Phonetically-Grounded Investigation

Introduction

The author, Emma Keith

Recognising a speaker by their voice might seem like second nature. If you know a person well at all, you probably have a pretty good idea of what they sound like, and you might even be able to imagine what words would sound like in their voice. You probably think you’d do a pretty good job of picking them out in a recording. But in reality, the identification of voices in recordings is hugely prone to error, and, needless to say, relies on subjective human judgment. Humans can quite easily be confident in mistaken judgments about the identity of a speaker. In courtroom scenarios, when the speaker behind a voice in a recording is disputed, it can help to be able to back up judgments about a speaker’s identity with additional evidence from acoustic measurements. Enter the field of Forensic Voice Comparison (FVC). FVC uses a wide array of techniques combining insights from acoustic phonetics, audio processing, and statistics to make probabilistic statements about the likelihood of a voice in two or more recordings being from the same speaker. These probabilistic statements, while necessarily contingent on the established facts of a case, can nonetheless be a crucial addition to legal proceedings, especially in cases involving low-quality audio. FVC is a young field, and it’s certainly not a panacea for the subjectivity inherent in trying real-life cases, but it has proven a useful tool and represents a fascinating practical application of phonetic knowledge.

Summer Scholars

Last year, I spent nine weeks over the uni break as part of the 2022-2023 Summer Scholars program learning all about the world of FVC and working on a short experimental study in the field of FVC that I just submitted to a journal along with Dr Yuko Kinoshita, Honorary Senior Lecturer at ANU’s School of Culture, History, and Language and one of my supervisors from the program. During the program, where I was supervised by Dr Kinoshita and Associate Professor Shunichi Ishihara, also from the School of Culture, History, and Language, I was able to learn about a field of linguistic research that was completely new to me, build my skills in that field, and work towards a unique research project. I would strongly recommend other university students with an interest in linguistics and interrelated disciplines apply for the 2024-2025 program when admissions open – it’s a great opportunity to build connections and get experience in research while working towards a longer-term project than what you’re typically offered in undergrad.

Parametric Cepstral Distance (PCD)

The project that I worked on at Summer Scholars utilised a technique for FVC known as Sub-band Parametric Cepstral Distance measurement (PCD), an approach first pioneered in the 90s by ANU academic Dr Frantz Clermont (Clermont and Mokhtari 1994), who remains a Visiting Professional Fellow at the university. During the program, I was able to receive invaluable advice, feedback, and MATLAB code implementing the measurement itself from Dr Clermont himself, who was always keen to see how this new research was applying his technique to new material. The “cepstrum” (reversed from “spectrum”) is a set of measurements mathematically derived from the frequency spectrum of a short clip of sound. These measurements are automatically extractable by computer, so they don’t require a human going through and checking them like formants (another tool for FVC) do. What sub-band PCD does is allow us to measure the difference between the cepstrum two sounds, but only within a certain frequency range. This is where the phonetic part comes in: particular speech sounds are associated with particular frequency ranges. Taking cepstral measurements of individual speech sounds from particular speakers, therefore, is one way to demonstrate in FVC that two recordings are likely to have come from the same speaker (see Rose 2022 for a practical demonstration of this). For the summer scholars project, we decided to look at /s/ sounds in English.

Emotion in Forensic Speaker Comparison (FVC)

One thing that can make FVC much more difficult is emotion. This is inevitable: people speak differently when under stress, happy, or anxious. They might emphasise a syllable in a way that they wouldn’t otherwise, or they might speak with a breathy tone of voice when they normally wouldn’t. In the real world, speech is dynamic, much to the chagrin of anyone trying their hand at FVC. This is a major hurdle for FVC, since a lot of criminal cases deal with recordings said with a great deal of emotion, as you might expect. The question we were trying to grapple with was: are there particular speech sounds that could be good at identifying speakers in a way that takes into account how speakers vary according to their emotional state.

My Summer Scholars Experiment

For the experiment, we decided to use audio from a pre-existing database of podcast recordings (Loftian and Busso 2019). This was a practical choice because getting a large-scale database of criminal case recordings would be too time-consuming for such a short project, and would also require ethics approval. The database was already annotated for emotion by three human annotators. While there is no objective way in which to grade emotion, this at least created a distance between us as the experimenters and the data we were working with. We found the speakers with the highest number of recordings judged “very emotional” and “very non-emotional”, since the more data per speaker the more reliable the results. We then isolated instances of /s/ from these very emotional and very non-emotional recordings and separated them into two groups. We ended up with 5 speakers in the final calculations, with between 136 and 14 tokens each for a total of 340 tokens all in all. We used a mathematical tool known as f-ratio (see Khodai-Joopari et al. 2004) to measure how much the variation was happening between the different speakers and how much it was happening between the emotional groups. The beauty of sub-band PCD was that it enabled us to also determine in which frequency ranges both types of variation was most readily observable. Usefully, these frequency ranges were different for the different types of variation. Variation from emotion was mostly in the very high frequency range, whereas variation by speaker was around 2-3kHz. This is, of course, an oversimplification – emotional variation is multilayered and multivariate, and speech is dynamic. This short experiment was only intended to suggest a potential area of further investigation in later research. However, the potential indication of this finding for later research is that /s/ could be a useful segment for FVC, paying particular attention to the 2-3kHz frequency range. Interestingly, this is outside of the frequency range where the spectral energy is typically the highest for /s/, suggesting that useful speaker-specific information may not always coincide with the frequency ranges to which we pay most attention in other areas of acoustic phonetics.

Paper Submission

After presenting the preliminary findings from the project at Summer Scholars, I continued working on a writeup of the results from the project with Dr Kinoshita. This process involved almost a year of refining, editing, and sending different versions of the paper back and forth. This allowed the initial tentative findings of the project itself to be strengthened, and was great experience of the journal submission and paper editing process. I would really recommend this aspect of Summer Scholars to undergrads looking to do postgrad later on – it was been a great way to get exposure to how the editing and submission processes work before starting my PhD, where I’m expected to write a lot more papers.

Implications

The implications of this project, while small and limited in scope, hint towards the promise of the /s/ sound as an indicator of speaker identity in FVC. The findings also point towards a number of questions relating to the effect of emotion on the realisation of /s/ – how do different emotions affect its length, the character of the spectral noise, and so on. Furthermore, the more existential question remains as to whether we can posit acoustic correlates of emotion at all, since emotion is after all a subjective measurement. Such questions aside, however, FVC experiments such as this provide a useful ground truth on which to establish fundamentals of how inter- and intra-speaker variation occurs phonetically, whether or not we can map particular features onto individual emotional parameters. The project also demonstrates the usability of sub-band PCD as a technique, and that phonetic insights about frequency range and types of speech sound have practical benefits for FVC, even when aided by computer. Research findings from project such as this might not be applicable in the courts as it stands, but tendencies hinted at in this and other research projects narrowing in on particular speech sounds can get us closer to a model of how speech varies acoustically. Such models may not be fully objective, but they are more grounded in acoustics than listener opinion.

About the Author

I am a first year PhD candidate in Linguistics at the Australian National University. While my PhD focusses on language documentation, I also have research interests in audio processing, phonology, and forensic linguistics.

References

  • Clermont, F. and Mokhtari, P. (1994) Frequency-band specification in cepstral distance computation. Proceedings of the Vth Australian International Conference on Speech Science and Technology. 1: 354-359. https://www.researchgate.net/profile/Frantz-Clermont/publication/271909361_Frequency-band_specification_in_cepstral_distance_computation/links/5dbd2af2299bf1a47b0a7175/Frequency-band-specification-in-cepstral-distance-computation.pdf
  • Khodai-Joopari, M., Clermont, F., and Barlow, M. (2004) Speaker variability on a continuum of spectral sub-bands from 297 speakers’ non-contemporaneous cepstra of Japanese vowels. Proceedings of the 10th Australian International Conference on Speech Science & Technology. 504-511. https://www.frantz-clermont-acoustic-phonetics.net/pubs/SpeechSpeakerVariability/2004.Khodai-Joopari+Clermont+Barlow_SST.pdf
  • Loftian, R. and Busso, C (2019) Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings. IEEE Transactions on Affective Computing. 10.4: 471-483. https://ieeexplore.ieee.org/document/8003425
  • Rose, P. (2022) Likelihood Ratio-based Forensic Semi-Automatic Speaker Identification with Alveolar Fricative Spectra in Real-World Case. Proceedings of the 18th Australiasian International Conference on Speech Science and Technology. 6-10.