Event Summary – Linguistics in the Pub

On Wednesday April 28 2021, I led a discussion at Naughton’s, a lovely old “pub” opposite The University of Melbourne on Royal Parade. This was a Linguistics in the Pub event. Impressively, Linguistics in the Pub has been running for 10 years now. You can read about Linguistics in the Pub here and you can now (as of April 2021!) also follow the Facebook page for updates.

I was very happy to attend an in-person event for this discussion – after the heavy Melbourne lockdowns in 2020, it was refreshing to see friends and colleagues and students in person. The night started with us all having dinner and catching up. I was asked to come and tell people about our new Research Hub, so when the time came to start the discussion I chose to do a little introduction about our two main goals:

1) Change legal procedures around the use of transcripts in court (via engagement with judges);

2) Develop evidence based processes for reliable transcripts.

These are the topics that guide our Hub, which you can navigate to here.

I mentioned to everyone that as I am a phonetician, probably a lot of my examples would be about phonetics – I had actually given a lecture the same day talking about forensic phonetics and twins’ speech so that was all in the front of my mind. After we talked about the fact that it is not easy to maintain vocal disguise, my friend and collaborator Chloé Diskin-Holdaway started some discussion by mentioning a funny scene from Home Alone 2 where the main character (a child) uses a “voice changer” to imitate an adult voice and book a hotel room in New York – that gave us all a laugh (mainly because it is completely unrealistic!) and I had planned to look it up later. Here is the (very funny) scene for those interested:

For the session, I had said I would do some “mythbusting”. The main myths we talked around were:

Everyone has a unique voice
Automatic computational methods are the answer for forensic linguistics and phonetics
Linguists and other experts are not driven by preconceived ideas (“bias”)

Myth 1: Everyone has a unique voice

We talked about the fact that instinctively it feels like people we know have a unique voice – for example we can generally tell when we hear the voice of a friend or family member on the phone. So popular opinion would probably agree with the assertion that people have unique voices, but for the purposes of forensic phonetics (and forensic speaker comparison) it is safer to assume that, no, people do not have completely unique voices. In the current state of knowledge, we cannot compare two speech samples and say, purely on the basis of their voices, that they are definitely from the same speaker. My take away message on this myth is that it is safer to assume, for forensic purposes, that voices are not unique. Voices are not like fingerprints or DNA.

As I mentioned in an earlier blog post I did my PhD on twins’ speech, and so I explored this topic in detail with some references to twin research – twins really challenge the notion of individual differences in speech (and whether they are quantifiable to a reliable degree). In fact, in some cases, non-identical twins may have voices that are quantifiably more similar to each other than are the voices of identical twins – it really depends on the speakers and their situation, and which variable(s) are being looked at. I told people about a great study I have read recently in the Journal of Phonetics by Donghui Zhuo and Peggy Pik Ki Mok. It is about individual differences in the speech patterns of bilingual twins. There are other points about twins’ speech in the link to my earlier blog post at the top of this paragraph, but I highly recommend reading this one!

I read out this quote to people from Francis Nolan (former head of the Cambridge Phonetics Lab), which says, on the point of idiolect:

…at each point where communicative intent is mapped, there may be thought of as existing default values which are peculiar to the speaker, though they (normally) fall within the range permitted by the particular variety of the language” (Nolan 1983:72)

When we think about it, the point of language is that it is shared. That means it is natural for linguistic and phonetic features overlap (to various degrees) – and that makes it normal that people could sound very similar to one another. On the flip-side, it is possible that a voice may have features that fall outside typical / default values and in this case we may indeed be able to say that a speaker has a very distinctive voice. My friend and collaborator Dr. Kirsty McDougall (also in the Cambridge Phonetics Lab) is actually working on a very large project at the moment, exploring the idea of voice similarity and vocal distinctiveness. I am really looking forward to hearing more about the findings (there is a second website here which will have updates to the project). She already has some publications in this area, including this one in the Speech Communication journal comparing automatic speaker recognition systems and voice similarity estimates by listeners. This brings me on to the next point…

Myth 2: Automatic computational methods are the answer for forensic linguistics and phonetics

I have been thinking about this issue a lot lately, having submitted an abstract on this very topic to the International Association for Forensic Phonetics and Acoustics annual conference, and a more extended abstract to our special issue in Frontiers, again the subject of another blog post.

I am also planning a larger publication on this topic. So – it was very apt that Nick Thieberger asked about the role of AI in forensic linguistics and phonetics. This is a very common question for the Research Hub.

I talked first with people about how acoustics offers a lot of information to phoneticians. In general, for clear speech recordings, phoneticians can see a lot of fine phonetic detail and can effectively “read” a spectrogram if they know what words are represented. However, it is certainly not safe to assume that a spectrogram is a “voiceprint” by analogy with a fingerprint – although interestingly this term is used in some government services in Australia. Given that such services have a lot of contextual information about people already, this is a way of confirming identity, and not establishing – however it is indeed confusing for these services to use the same pseudo-scientific term “voiceprint”, in this way.

Just on this idea of a voiceprint, we are far more likely to come close to being able to “read” what is said in a clear speech recording, compared to an indistinct forensic recording, because we have clear audio and the speaker likely has no reason to hide what is being said. However, as research shows, people can be exposed to exactly the same audio and interpret it differently depending on their (language) experience. For example there are studies showing that listeners will hear differently depending on their own regional background (i.e. Loakes et al. 2014), or even depending on what they assume the speaker’s background to be (i.e. Jannedy & Wierich 2014).

And back to Nick’s question about the role of AI for forensics. Because there is so much background noise, and situations in which speech is extremely indistinct, we are of the opinion that we cannot reliably use AI to solve the issue of transcription of indistinct covert recordings. We are going to demonstrate this very soon for the Frontiers paper I mentioned. So, in time and along with Helen, I will have some more to report back about that. There are also some other researchers around the world working on how we can harness automatic methods to our advantage in forensics (I will also review this in the Frontiers paper).

Of course, we also need to acknowledge how incredibly useful automation can be in linguistic and phonetic research (and in daily life – think of all the voice activated software we have that works so much better than it used to!). To give an example of the improvement in efficiency enjoyed in phonetics, a paper by Labov, Rosenfelder and Fruehwald (2013) talks about the immense improvement in efficiency between manual and automatic methods for phonetic analysis, finding it possible to do 30 times the amount of analysis using automation compared to the manual method. While precision can then become a problem, in non-forensic cases it has been convincingly argued that the loss of precision is a risk worth taking. For example, in their paper about sociolinguistic analysis of large corpora, Evanini, Isard and Liberman (2009) note that “when very large corpora are used, errors in individual tokens and even individual speakers will not harm the analysis”. We cannot say the same for forensic situations in which there is generally so little speech material, and the stakes are so high.

Myth 3: Linguists and other experts are not driven by preconceived ideas

The goals of the Research Hub for Language in Forensic Evidence are on the topic of transcripts. We want to change the way they are handled in court, and we want to understand how to create better transcripts. At this Linguistics in the Pub event, we first discussed the idea that everyone uses some preconceived ideas when they decode speech, even experts. For example, if we were looking at a spectrogram and saw a fricative and nasal onset, we can already make a lot of predictions about what this is (or is not) – assuming the utterance is in English – by using our phonological knowledge. For example, a combination of /s/ + /m/ is likely (i.e. a word like smart, or small), whereas /s/ + /ŋ/ or /θ/ + /z/ are highly unlikely to occur as onset sequences. If we decide we are seeing evidence for /s + m/ we are slowly unveiling possibilities. It is important to realise that this is normal – this is what language is designed to do – to be efficient. However, it is also important to acknowledge that everyone is using top-down as well as bottom-up techniques when we interpret speech; we just do it to differing degrees and need to be aware of it. This contextual information is needed to make an informed judgement about what we are faced with, and if this contextual information is correct it is useful – in everyday life the stakes are relatively low if we use information to incorrectly guide our judgements, yet in forensic situations this can be highly problematic.

Helen has written a lot on this topic of what happens when a transcript exists – it acts to prime people, whether it is correct or not and no matter how objective they would like to remain, no matter their training, nor how much they resist. Nobody is immune to priming, you can try this out here. Sadly, there have also been miscarriages of justice because of this issue, see for example this recent paper.

Helen and I also have a paper coming out some time very soon called Acoustic Injustice: The experience of listening to indistinct covert recordings presented as evidence in court – we talk about some of the issues explored at Linguistics in the Pub and go deeper into implications for justice.

UPDATE (25 Oct 2021)

For people wanting more information on Linguistics in the Pub, there is now an academic article about this by Gawne & Singer (2021) in the Australian Journal of Linguistics.

References

Evanini, K., S. Isard and M. Liberman (2009). Automatic formant extraction for sociolinguistic analysis of large corpora. Interspeech. Brighton, UK. 1655-1658.

French, P. and H. Fraser. (2018). Why “ad hoc experts” should not provide transcripts of indistinct forensic audio, and a proposal for a better approach. Criminal Law Journal, 42, 298-302.

Fraser, H. and D. Loakes (2020). “Acoustic injustice: The experience of listening to indistinct covert recordings presented as evidence in court“. In M. San Roque, S. Ramshaw and J. Parker (Eds.) Law, Text, Culture (special issue “The Acoustics of Justice: Law, Listening, Sound”). Vol. 24.

Gawne, L. and R. Singer (2021). “Ten years of Linguistics in the Pub” Australian Journal of Linguistics. Published online October 21.

Gerlach, L., K. McDougall, F. Kelly, A. Alexander and F. Nolan (2020). “Exploring the relationship between voice similarity estimates by listeners and by an automatic speaker recognition system incorporating phonetic features”. Speech Communication, 124, pp. 85-95.

Jannedy, S. and M. Weirich (2014). Sound change in an urban setting: Category instability of the palatal fricative in Berlin. Laboratory Phonology, 5, 91–122.

Labov, W., I, Rosenfelder, and J, Fruehwald (2013) “One hundred years of sound change in Philadelphia: Linear incrementation, reversal, and reanalysis.” Language 89(1), 30-65.

Loakes, D., J. Hajek, J. Clothier and J. Fletcher. (2014). Identifying /el/-/æl/: a comparison between two regional Australian towns. In J. Hay and E. Parnell (Eds.) Proceedings of the 15th Australasian International Conference on Speech Science and Technology, Canterbury: ASSTA. 41-44.

Nolan, F. (1983) The Phonetic Bases of Speaker Recognition. Cambridge: Cambridge University Press.

Zuo, D. and Mok, P.P.K., (2015). Formant dynamics of bilingual identical twins. Journal of Phonetics, 52, 1-12.