Guest Blogger: ‘Automatic Speech Recognition Models, Neologisms, and Colloquial Language’ by Will Somers

About the author – Will Somers

I am a third year Bachelor of Arts student majoring in linguistics and criminology at the University of Melbourne. I have a particular interest in transcription and the potential for artificial intelligence models to be used to advance the efficiency and accuracy of the transcription process.

Introduction

With the current buzz surrounding automatic speech recognition (ASR) and the role these new technologies will play in society, questions are being raised regarding what they can offer to improve the use of language as forensic evidence within the legal system (as observed by Loakes, 2022). Whilst traditional methods of orthographic transcription used for covert recordings in court have been identified as time-consuming, expensive and often unreliable (Harrington et al., 2022), ASR models may offer the potential for a rapid and inexpensive solution to these issues. But can these models be considered reliable in the transcription of indistinct audio? Recent research has shown that outputs of ASR systems exposed to indistinct audio are replete with problems (Loakes, 2022; Harrington et al., 2022) but technology is rapidly advancing in this area, so what about the newer generation of systems?

This blog post will first give a brief overview of a leading ASR model, ‘Whisper’, created by OpenAI which is the company also responsible for ChatGPT; it will also cover the importance of its training data. Following this, I will discuss two “mini-experiments” I have recently conducted into the reliability of the Whisper ASR model in the transcription of spoken language under different conditions. The first was conducted at the Australian National University during the 2022-23 Summer Scholars program. This was a fantastic 9-week scholars program that gave me an insight into the world of linguistics research, under the helpful supervision of Yuko Kinoshita and Shunichi Ishihara. This program is a great way to start building contacts in academia and has just opened applications for its upcoming 2023-24 program across a wide range of disciplines, I would recommend any university students with an interest in further study apply. The second experiment was conducted with Helen Fraser and the Hub for Language in Forensic Evidence as a follow on to the Summer Scholars experiment.

The Whisper Model

An ASR model functions to convert audio files, or direct audio input from a microphone, into text, in the script of the spoken language (Malik et al., 2021). Ideally, an ASR model should be able to “perceive” given input, “recognise” the spoken words and then use those words to execute an “action” (Malik et al., 2021). If you are interested in how current ASR models are faring in the transcription of forensic audio, be sure to check out the video presentations by Lauren Harrington and the Hub’s very own research fellow Dr Debbie Loakes that were recently posted on this blog. In addition to this, Lauren has just published a fantastic new paper investigating the potential practical applications for ASRs in the legal system, specifically examining their use in the transcription of police interviews.

The Whisper ASR model is a neural network trained on over 680,000 hours of multilingual speech samples. With such a large set of training data, the creators of this model state that it can produce the most robust and accurate transcripts even when battling with foreign accents, background noise, and technical language (Radford et al. 2022). Whisper is claimed to have the ability to work with a broad range of speech samples, as opposed to other similar speech models which are often trained on a single set of speech data and can therefore only work with a very small range of audio recordings (Radford et al. 2022).

The Role of Training Data

What sets the Whisper ASR model apart from other models is the sheer amount of speech and audiovisual data the model has been trained on. Reports demonstrate that the accuracy an ASR model can achieve in transcribing a particular language is directly correlated with the size of the training data the model has been exposed to in that language (Radford et al., 2022). Much of the time and effort required to build an ASR goes into a human manually labelling and captioning the audio samples which form the training data of these models. This is called “supervision” (Baevski et al., 2021). Whisper is what is called a “weakly supervised model”. This means that Whisper’s training data strikes a balance between “unsupervised models” which use unlabelled information to categorise data together but do not produce easily usable outputs, and “supervised models” which include data labelled by humans but are limited in size due to the cost of labelling data by hand (Radford et al., 2022). Ultimately, this means that Whisper can draw information from a much broader training dataset to inform the transcription it is creating from a diverse range of audio samples.

My Summer Scholars Experiment

In the Summer Scholars project, we investigated the use of the Whisper ASR model in the transcription of covert recordings. The audio samples used in this experiment were from the Morrison forensic voice comparison database (Morrison et al., 2012). This database uses high-quality non-contemporaneous recordings of multiple speakers using different speech styles, these recordings have been manually processed to remove extraneous noises and crosstalk.  However, the audio samples we used were digitally altered to become less clear or ‘indistinct’, with background noises or human voices overlaid onto the original audio sample. The same recorded conversation was overlaid with sounds such as samples of ocean waves, a Melbourne streetscape, and dialogue from a TV show, thus creating a set of audio samples with varying ranges of indistinctness. The ‘indistinct’ samples and a clear, unaltered copy of the original Morrison recording were then processed through the ‘Large’ model of Whisper and compared with a reference transcript created by a human transcriber. Whilst the results of this experiment demonstrated some ability for the Whisper model to transcribe sections of indistinct conversation with moderate accuracy, they overwhelmingly indicated that there was still a long way to go before we can rely on ASR models for the transcription of indistinct audio.

Most prominently from this experiment, I observed that, even for the clear, unaltered recordings, this model expressed a particular difficulty in dealing with colloquial phrases and diminutives specific to the Australian context (see also Loakes 20122, who showed ASR systems had difficulty with unpredictable content even in her clear recording).

This brought to light several questions regarding the ability of the Whisper model to transcribe neologisms and colloquial language. Whilst it is known that these models are greatly influenced by the biases in training data (Loakes, 2022), these results made me question the ability of ASRs to transcribe phrases that showed similarity to common colloquialisms but were not identical. For example, it is predictable that the Whisper model would not be able to accurately transcribe neologisms such as “Pre-88er”, which is a phrase unique to the domain of NSW police and would have likely not appeared at all in Whisper training data. However, one might expect that the model could correctly transcribe a phrase such as “left a fair foul taste in my mouth”, as each individual lexeme would likely have appeared in the model’s training data, even though the phrase is not identical to the more common phrase “left a bitter taste in my mouth”. Despite this, both phrases were consistently mistranscribed by the Whisper model during this Summer Scholars experiment, across all ranges of indistinctness and even in the clear, unaltered recording. According to these results, it appeared that these colloquial phrases could not be correctly transcribed by the ASR model either in recordings of speech with background noise added, or in clear recordings of speech with no background noise.

My Experiment with the Hub

As a result of these findings, I was interested to see whether the errors in the transcription of this colloquial speech were specific to the way they were produced or recorded in the clear sample of the recording used for the Summer Scholars experiment, or if the Whisper model was just incapable of transcribing these phrases altogether.

To investigate this, I recorded two unique semi-structured conversations between myself and another speaker in a restaurant setting. We recorded each conversation as separate audio files, with a smartphone held between us acting as a microphone. While the recordings were not manually altered to remove any extraneous noises or crosstalk, our speech in these close-talking recordings was clear, and the ambient noise of the restaurant had very little effect on the quality of our recorded speech. The two conversations were of a spontaneous nature and revolved around conversation topics including job applications and the Australian federal police, which were topics also covered in the original Summer Scholars recording. In each conversation, I deliberately uttered four phrases from the Morrison recordings that had been consistently incorrectly transcribed by the Whisper model in the samples of my previous experiment.

By doing this, I intended to determine whether the ASR could accurately transcribe any of these colloquial phrases in a different but still clear recording.

These phrases were:

  • “Pre-88er”
  • “Left a fair foul taste in my mouth”
  • “I’m the glass half full man”
  • “What’s your rego number”

These phrases were either neologisms such as “Pre-88er”, utilised diminutives like “rego”, or showed a similarity to common colloquialisms in Australian speech. In each instance, these phrases were examples of idiosyncratic language which were likely not contained within Whisper’s training data.

Once recorded, the two audio samples produced in this experiment were run through the “Large” model of Whisper and the two produced transcripts were compared with a reference transcript I created. As this second experiment was working with an updated version of the Whisper model compared to the one I used for the Summer Scholars experiment, we also ran the samples from the Summer Scholars experiment through this updated model to see if the results would be any different. The updated model was again not able to correctly transcribe any of the above colloquial phrases from the Summer Scholars samples, even from the most clear, unaltered sample.

In comparison to this, all phrases except the neologism “Pre-88er” were transcribed correctly by the Whisper model in at least one of the transcriptions produced from the recordings in this second experiment. Because of this, we can say that the errors we saw in the transcription of these phrases in the Summer Scholars experiment came as a result of the ways in which they were recorded or produced by the speakers in those recordings. Therefore, the model may be able to correctly transcribe these phrases in other audio samples.

Interestingly, the term “Pre-88er” was not transcribed accurately in the Summer Scholars experiment, or in either of the transcriptions produced in the second experiment. Examples of the mistranscription of this phrase included

  • “prairie odour”
  • “pre-A data”.

These inaccurate transcriptions may be indicative of an inability of the Whisper model to accurately transcribe unique neologisms which have not appeared in training datasets. This being said, the reference transcripts in this second experiment were created by the recorded speakers, and no other humans attempted to transcribe these audio samples. Because of this, there is no human data to compare the ASR data with, and it is not known whether the average human transcriber would be able to correctly transcribe this neologism. This makes it very difficult to judge the proficiency of these ASR systems compared to human transcribers and future experiments should employ human transcribers to ensure this comparison can be made.

Implications and Further Directions

Whilst the results of this mini-experiment demonstrate the Whisper model does in fact show a certain level of proficiency in transcribing colloquial phrases, it must be stressed that the audio transcribed in this experiment was very clear compared with the quality of audio usually investigated in covert recordings. These results in no way indicate that ASRs in their current state should be relied on for the transcription of covert and forensic audio.

Further to this, these results indicate that the Whisper ASR model demonstrates a particular difficulty in transcribing unfamiliar names or neologisms. This presents a challenge for ASR systems if they were to be used to transcribe covert recordings of speakers with personal relationships, who may use a high frequency of nicknames, in-group terms or unique slang. For court transcription, transcribers are typically provided with these sorts of terms alongside contextual information to assist in the transcription of unpredictable content (Fraser, 2022). Without purposefully embedding these nicknames and in-group terms into the training data of systems like Whisper, these models cannot gain this same contextual understanding and therefore cannot predict new lexical items that are not contained within their training data. Because of this, the Whisper ASR model proves incompetent in the transcription of covert recordings.

This research and mini-experiment form only a very small part of all the research that needs to be conducted regarding the role ASRs can play in the future of transcription and the use of language as forensic evidence in the courtroom. In future experiments investigating ASRs and colloquial language, it would be interesting to compare the proficiency human transcribers with no contextual information have in transcribing in-group slang and neologisms. The potential for these artificial intelligence models to assist with a multitude of tasks within linguistics in the future cannot be understated, but the development of these models must continue to be investigated and scrutinized by researchers to ensure these models become accurate, efficient and reliable.

References

Bardsley, D., & Simpson, J. (2009). Hypocoristics in New Zealand and Australian English. Comparative studies in Australian and New Zealand English: Grammar and beyond, 49-69.

Baevski, A., Hsu, W. N., Conneau, A., & Auli, M. (2021). Unsupervised speech recognition. Advances in Neural Information Processing Systems, 34, 27826-27839.

Fraser, H. (2022). A Framework for Deciding How to Create and Evaluate Transcripts for Forensic and Other Purposes. Frontiers in Communication, 7, 1-14.

Harrington, L., Love, R. & Wright, D. 2022. Analysing the performance of automated transcription tools for covert audio recordings. [Poster] In: Conference of the International Association for Forensic Phonetics and Acoustics, Prague, July 10-13.

Loakes, D. (2022). Does Automatic Speech Recognition (ASR) Have a Role in the Transcription of Indistinct Covert Recordings for Forensic Purposes?. Frontiers in Communication, 7, 1-13.

Malik, M., Malik, M. K., Mehmood, K., and Makhdoom, I. (2021). Automatic speech recognition: a survey. Multimed. Tools. Appl. 80, 9411–9457.

Morrison, G. S., Rose, P., & Zhang, C. (2012). Protocol for the collection of databases of recordings for forensic-voice-comparison research and practice. Australian Journal of Forensic Sciences, 44, 155–167.
doi:10.1080/00450618.2011.630412

Morrison, G. S., Enzinger, E., Hughes, V., Jessen, M., Meuwly, D., Neumann, C., … & Anonymous, B. (2021). Consensus on validation of forensic voice comparison. Science & Justice, 61(3), 299-309. doi:10.1080/00450618.2011.630412

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. https://doi.org/10.48550/arXiv.2212.04356