Guest blogger: ‘Humans vs machines’ by Will Somers
This post is a small, informal follow-up experiment to my previous blog post in August, where I examined the ability of the Whisper automatic speech recognition system (ASR) to transcribe neologisms and colloquial language.
The results of the previous experiments demonstrated this artificial intelligence system was rather ineffective in transcribing neologisms and colloquial language. In particular, the Whisper ASR system was found to demonstrate a distinct difficulty in accurately transcribing the neologism “Pre-88er”, which is a slang term used to refer to a NSW police officer who joined the force before 1988.
Whilst this reflected an area for potential future improvement in ASR systems, these results also made me wonder if humans would even be able to correctly identify these unknown words in these recordings. For this reason, I ran a follow-up experiment to test this research question using the same neologisms and colloquial language the ASR from the previous experiments struggled with so considerably.
Credit for featured image:
human head-to-head with robot by Valeriy Kachaev
To conduct this mini-experiment, I contacted a number of my “non-linguist” friends online and asked them if they would be interested in attempting to identify an unknown word in a recording.
Twelve of my friends ended up participating in this experiment and they were each aged between 21 and 25. I selected a 10-second audio sample from the recording I used in my previous experiment that contained the target neologism “Pre-88er”. The 10-second audio clip was incorporated into a survey prepared with Google Forms and sent via Facebook Messenger to these friends. The participants were first presented with the sample audio clip on the form, followed by the question:
“In the above recording, it is thought that the speaker says: I was talking to someone in the um, that’s been working in the AFP for ages, they were a ******, um, and, yeah, they, they love it, they think it’s a great place to work. What do you think is said in the “*****” part of the transcript?”
The participants were required to answer this question in a text box below the audio sample. Following this first question, the participants were also asked to describe how difficult they found the task to be. On the next page of the survey, they were presented with the proposed correct transcription of the unknown word and were asked if they agreed that was what was said.
Of the 12 participants, only two identified the unknown word as some variation of “Pre-88er” in their preliminary transcription. Of these two, one noted that identifying the word was “not too difficult”, and the other noted it was “very difficult”. The results from the experiment are presented in the Appendix below.
The ten participants who did not transcribe the unknown word as some variation of “Pre-88er” gave responses describing the difficulty of the task ranging from “pretty difficult” to “super difficult”. However, when presented with the proposed transcription of the sample sentence on the following page, all but one participant agreed that this was what was said in the audio recording. It was also observed that seven out of ten of these participants guessed the unknown word was a Standard English word. Additionally, three participants guessed a non-Standard word that was not “Pre-88er”.
The fact that the majority of the participants identified the neologism as a Standard English word suggests a tendency for listeners to assume an unknown word they are hearing is one they have heard before based on the linguistic features they can identify.
For example, ten out of twelve human participants identified the initial consonant cluster /pr/ in “Pre-88er”, and gave answers which were Standard English words containing this cluster such as response 6 “predictor” and responses 8 & 9 “predator”.
Additionally, ten out of twelve participants also identified the -er/-or/-ar suffix at the end of the “Pre-88er”, which is a suffix often used for nouns representing roles people take on. This suffix was present in the Standard English answers such as in response 3 “predicator” and responses 10 & 11 “creator”. The inclusion of this suffix indicates that these human transcribers were aware a noun should occupy the syntactical position of the unknown word in the sentence and gave an answer that not only sounded similar to the unknown word, but also made sense in the context of the sentence.
Similarity can be drawn here to how ASR systems operate in transcription, as the Whisper ASR system tends to create its transcriptions using words only found in its “training data”, which it has been exposed to before (Radford et al., 2022).
For example, in my previous experiments investigating this model, the transcriptions it had made for the word “Pre-88er” when it had appeared in a variety of samples included phrases such as “pre-A data”, “pre-aider” and “prayer”, all of which share similar linguistic features with “Pre-88er”, but are drawn from a set of training data.
This presents an interesting dilemma for the creation of transcriptions from audio files containing neologisms or colloquial language because these results indicate that both humans and ASR systems tend to transcribe words they are familiar with in the place of words they are not familiar with but share similar features.
Comparing the accuracy of human transcribers in this experiment to the Whisper ASR system in the last experiment, it is evident that we should not be so critical of these systems because as humans ourselves we struggle considerably in accurately transcribing unknown words. Whilst ASR technology is rapidly advancing and its uses are becoming more practical with each “update” and “bug-fix”, it is important that humans closely monitor the implementation of these technologies to ensure that transcript accuracy is maintained, and human discretion is used in the interpretation of colloquial language and neologisms.
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. https://doi.org/10.48550/arXiv.2212.04356
Appendix: Table of Results
|Comments on Difficulty
|Agree with proposed transcription?
|If not- why?
|Not too difficult
|Very difficult. Didn’t really sound like a word- more like a string of sounds I couldn’t make sense of.
|Similar but the ‘p’ sounds are more like ‘d’ sounds.
|So difficult I couldn’t even discern an English word from it.
|Pretty difficult to transcribe
*These four responses were submitted together on one form.