[This is extracted from the Spring 2022 version on Canvas, so some links/formatting may be broken.]
This week, we’re digging deeper into spell checking and language models. We’ll focus on simple models first, like n-grams, which we talked about at the end of class. That will require us to spend a little time talking about probabilities (especially conditional probabilities) as well.
Textbook reading
First, please read the remainder of Chapter 2 of the textbook. Don’t panic if you’re having trouble with Section 2.4.1; that’s a too-brief overview of syntax, a quite complex part of linguistics. We’ll go into more depth on syntax next week once we understand simpler language models like n-grams, but getting some familiarity with the concepts now will help when we come back to it next week.
Additional articles, etc.
I have some additional notes that should help with understanding the concepts in this chapter. The first is a basic overview of probability theory for linguistics (PDF). This is optional reading, but if you’re not familiar with probabilities or hate math, I hope you’ll find it an accessible introduction to the topic, which will come up a few times this semester. I originally developed it for my Ling 502 class, but the concepts apply equally well in this class. We’ll talk about conditional probability and Bayes’ Rule as ways of working with n-gram models this week.
The second set of notes looks at how we collect and use linguistic data to try to build better language models (PDF). In particular, you might find it interesting to play around on Google Books N-grams to look at real-world usage data and see how increased context changes the probabilities of certain words.
The last set of notes (PDF) digs into the topic of Under the Hood 3: dynamic programming. I don’t think the book does a great job of explaining how dynamic programming (in the form of topological orderings on directed acyclic graphs) works, so I worked through a few examples.
Finally, let’s wrap it up with a look at how spell checkers succeed and fail in practice. First, here are two blogposts on the Cupertino effect, an unintended consequence of early automatic spelling correction systems (link, link). Second, a blogpost from the team at Microsoft that worked on Office 2007’s, discussing how they chose to trade off between high precision (if it labels something an error, it’s probably right, but it also misses some errors) versus high recall (it catches most errors, but also flags a lot of non-errors). I found their discussion of user preferences really interesting, and I’d like us to talk on Thursday about user design in these kinds of systems.