Research report – Frontiers in Communication (Loakes 2022)
As of today, the Hub has a new open access publication – this is an article published in our special issue of Frontiers in Communication, called Does Automatic Speech Recognition (ASR) Have a Role in the Transcription of Indistinct Covert Recordings for Forensic Purposes?
Motivation for this study
This study was motivated by the question we are often asked in the Hub, which is: Have you tried solving the problem with computer transcription?
While to linguists this may intuitively seem like a bad idea, this is a fair enough question because automatic methods have certainly helped many of us in our every day lives. Think, for example, about voice activated software such Alexa and Siri, which works very well these days (assuming you have a voice in the right “accent” for the task, see example this article and this one, which talk about what happens when someone’s accent is not “mainstream”).
As phoneticians who have worked with forensic evidence, it seems obvious that there will be problems with this kind of approach, but the research in the Frontiers paper looks at what these problems are specifically. What actually happens when we try and use systems to help us with this task? There is a general gap in knowledge in prior literature when it comes to the question here, so it is an important question to start exploring. As noted by Lindh (2017: 36) in his PhD on “If only limited work has been done on the combination of auditory and automatic methods in comparing voices and speakers, even less work has been done on combining automatic speech recognition (ASR) and forensic phonetic transcription”
The Frontiers paper talks about the fact that automatic transcription systems have already proved to be extremely useful in the field of phonetics, sociophonetics and speech science more generally (e.g. Gonzalez et al. 2018, Mackenzie & Turton 2020, Villarreal et al. 2020) and for language documentation purposes where there have been exceptionally interesting studies carried out (e.g. Jones et al. 2019, Bird 2021 to name two). I mentioned in a previous blog post, that one research paper shows a 30 fold increase in efficiency when automatic vs. manual methods of phonetic segmentation is used. Automatic methods are also used very well for subtitling and captioning although there are exceptions – in particular this interesting mismatch between system predictions and what the speaker actually said, which has surprisingly “exploded” in popularity.
n.b. If you are interested in the phonetic reasons behind “Canberrans” becoming “Ken Behrens” this, you can read work about vowel raising before nasals in Australian English (Cox & Palethorpe 2014). I have a suspicion this is a regional feature of AusEng – I do not hear it so much in Melbourne – but this is awaiting verification!
The fact that we can have an auto-caption produced which comes up with Ken Behrens instead of Canberrans proves the point that when using automatic software to help with a task “…human confirmation is needed to correct errors …” (Mackenzie & Turton 2019: 1). But why do such errors occur? As noted by Villarreal et al. (2020: 1) automatic systems “[rest] on the assumption that there is some learnable, predictable pattern in the input that can be used to predict new cases”. While errors may not be problematic in research situations as explored in this previous blog post (and may even give us a laugh in the case of Ken Behrens), the situation is far more serious in forensic situations.
The main issues in automatic transcription of indistinct audio are: we do not know what the speakers are saying and we do not know who the speakers are. There is no definitive transcript to check an automatic version against. This is effectively the opposite of research situations, where we can verify these things.
The systems we tried out are the Munich Automatic Segmentation System (MAUS) and Descript. In order to test how indistinct forensic-like) audio fares when automatic systems are tasked with transcription, we had two types of audio which we used in two systems, and these are:
Poor quality recording – The audio is forensic-like, it involves overlapping speech, changing volume (as the speakers move towards / away from the mobile phone which the audio was recorded through), background noise, and noises overlapping with the speech stream(s). There are three Australian English male speakers, and one female speaker. We have been able to make a reliable transcription of the audio – this was not too problematic because, advantageously for our research purposes, we know the recording condition (number of speakers, positioning, and we have video), and we also know the speakers so we can verify some names and nouns used. n.b. an ongoing pilot study using this data showed that even experienced human transcribers do not get all the content correct, but they get some correct, and some sections are harder than others.
Good quality recording – This was also recorded on a phone as well, but close up, there was little background noise. The speaker is mindful of being understood and knew the recording would be used to test automatic transcription. This is a female speaker of Irish English recorded in Australia. n.b we have permission to use these recordings. The forensic-like recording is not from a crime.
Results
So, what did we find when we ran the audio through the systems?
When we used MAUS no speech at all was recognised when we fed it the poor quality recording – although the system did very well at finding silence intervals. On the other hand, the good quality recording fared well. MAUS worked out what our speaker was saying – even though she speaks Irish English and there is no Irish English model it worked. So, then we tried using a transcript which we prepared to see how MAUS would go at recognising where particular words were. Obviously, this is beside the point of the research question because we actually fed information to the machine, when the aim was to see how the machine would go in feeding us the information (i.e. we wanted to see how automatic systems would go in solving the problem themselves). The important thing to note is that this was a reliable transcript – we know the subject matter and we know the speakers, so we can verify what was talked about (such as names etc). When the transcript was used, MAUS was able to correctly segment some of the words, although there were errors too. The background noise and overlapping speech made the task very difficult for the machine. For example, drumming noise and laughter was recognised as speech. We liken this to what happens when software which is designed to recognise faces “believes” that clouds and trees are people. As a side note, have a look here to see what happens when the face recognition algorithm goes wrong and it is turned into art.
When we tried Descript, three words were recognised by the system, the words yes, yeah and okay. In actual fact, the number of words uttered by the four speakers was 116. While three words were identified, the word yes was not exactly correct (it was another repetition of yeah). Why were these words recognised but others were not? They were somewhat louder, and so it is likely that they “stood out” from the background noise. As you will see in the paper, Descript did well at identifying what our Irish English speaker said.
Discussion and Concusion
It is not surprising that the systems we tried out were unable to recognise what speakers were saying in indistinct audio. They are simply not designed to do this. However, if we have clear, non-overlapping speech the systems work quite well, and if we have a transcript (in the case of MAUS) they also do well. This should not be surprising, because the tools are doing the job they are designed to do.
It is of course true that automatic methods can be used to solve some issues in forensics – for example systems could be used to cut down work and making analysis more efficient by segmenting speech from non-speech in clear recordings. As we saw with MAUS, the system was very good at segmenting non-speech sections.
As things currently stand, when recordings are poor quality recording and there is no known transcript (typical for forensic contexts). Our research has demonstrated that automatic methods cannot solve the problem of what was said in indistinct forensic audio. However, some early piloting shows that people with aptitude for transcription of indistinct forensic audio get quite a lot of the material correct, and their mistakes make sense (phonetically). We think human intervention is a better way forward despite the time consuming nature of transcribing indistinct forensic audio; human intervention would actually reduce future problems and be more cost effective.
Via the Hub, we are currently engaged in more research on this topic. We are trying out some more speech samples (indistinct forensic audio this time), as well as other automatic systems.
Note that this paper explores ideas we presented at the conference run by the International Association for Forensic Phonetics and Acoustics in 2021. We have a version of that presentation you can watch/listen to here – the research is more up-to-date in the Frontiers paper, but if you’re interested in the audio used you can hear it that the presentation.
References
Bird, S. (2021). Sparse Transcription. Computational Linguistics, 46(4), 713-744.
Cox, F. and S. Palethorpe (2014) “Phonologisation of vowel duration and nasalised /æ/ in Australian English”, Proceedings of the 15th Australasian International Conference on Speech Science and Technology. 33-36.
Gonzalez, S., C. Travis, J. Grama, D. Barth and S. Ananthanarayan. (2018). Recursive forced alignment: A test on a minority language. In J. Epps, J. Wolfe, J. Smith & C. Jones (eds) Proceedings of the 17th Australasian International Conference on Speech Science and Technology, ASSTA Inc: Sydney. 145-148.
Jones, C., W. Li, A. Almeida and A. German. (2019). Evaluating cross-linguistic forced alignment of conversational data in north Australian Kriol, an under-resourced language. Language Documentation and Conservation, 13, 281-299.
Lindh, J. (2017). Forensic Comparison of Voices, Speech and Speakers: Tools and Methods in Forensic Phonetics. PhD dissertation, University of Gothenburg.
Loakes, D. (2022) Does Automatic Speech Recognition (ASR) Have a Role in the Transcription of Indistinct Covert Recordings for Forensic Purposes? Frontiers in Communication Article 803452, Vol. 7, 1-13
Mackenzie, L. And D. Turton. (2020). Assessing the accuracy of existing forced alignment software on varieties of British English, Linguistics Vanguard, 6(s1). https://doi.org/10.1515/lingvan-2018-0061
Villarreal, D., L. Clark, J. Hay and K. Watson. (2020). From categories to gradience: Auto-coding sociophonetic variation with random forests Laboratory Phonology 11 (1), 1-31. http://doi.org/10.5334/labphon.216
Software
MAUS Munich Automatic Segmentation System
Acknowledgements
Thanks to work experience student Caitlin Fox for her help editing this blog post.