[This is extracted from the Spring 2022 version on Canvas, so some links/formatting may be broken.]
This week, we’re turning to speech recognition. How do Alexa, Siri, Google, and other voice-activated assistants work? What causes difficulties for them, and how can we overcome them? For that matter, how does human speech recognition work? Why do they screw up your name at cafes? Why do so many people think my name is “Dave”? We’ll try to get to the bottom of these mysteries this week and next.
Textbook reading
Sect 1.4. This covers the basics of speech recognition from a computer’s perspective, as well as a quick overview of the sound patterns of human language, which is covered in much more detail in the reading below.
Additional articles/videos
Read through Sections 2.1, 2.2, 2.3 and 2.6 of Language Files. This provides more detail, from a linguistic perspective, on how linguistic sounds are produced (2.1-2.3) and perceived (2.6). Any reasonable speech recognition system will need to incorporate this sort of information to accurately determine what sounds people are making.
Also, the discussion of syllable structure should help clarify the nature of syllabaries and abugidas from our discussion of writing systems. (Sections 2.4 and 2.5 are less important for English speech recognition, so you can skip them. But in case you’re interested in the structure of language more generally, I left them in the file for you.)
Since that’s pretty dense reading, I want to wrap up the week with one short article from Scientific American that examines which dialects of English are actually captured by current speech recognition technology. Think about cases where you or your friends are misunderstood, whether by humans or computers, and we’ll talk about how these failures arise and can be countered.
(One last thing, and strictly optional, but the Proceedings of the National Academy of Sciences article that forms the basis of the SA article is pretty good, and worth a look if you have the time/interest.)