Categories
Uncategorized

Ling 354: Language & Computers

So much of our modern lives happens through computers, phones, and other electronic devices. The same is true of our language. How do computers shape our language, and how do we adapt computers to our language use? This class considers a range of topics connecting language and computers, including speech recognition systems (Alexa, Siri, Google Assistant), emoji use, and sentiment analysis. It also covers basic linguistic and algorithmic concepts to help understand the strengths and failures of our contemporary language systems.

Students in this class tend to have a pretty wide range of majors, interests, and experiences. Some know much more about language, others much more about computers, and some know both or neither. My goal in the class is to fill in the gaps in students’ linguistic or computational knowledge, then build out into the applications they find interesting. For instance, one Ling 354 student ended up submitting an emoji proposal to the Unicode Consortium after this class!

Here are some of the topics we cover, along with links to the reading lists (extracted from Canvas, so apologies for their imperfections). I’ll be adding to these slowly, because extracting them out from Canvas is exhausting!

Writing Systems, Unicode, and Emoji

Computers think in ones and zeros, or more accurately, in “ons” and “offs”. Human writing systems are much more complex; even the smallest, Rotokas, contains 12 letters. Furthermore, many writing systems lack “letters” in the sense of the English alphabet, with characters that indicate whole syllables or even words.

In this topic, we’ll look at the diversity of linguistic writing systems (alphabets, abjads, syllabaries, etc.), and how these get represented on a computer through Unicode, with each character getting its own number.

We’ll also examine emoji and start asking: what’s their relationship to language? Are they linguistic elements, like words? Are they paralinguistic, like gestures or intonation? Or are they something else entirely?

Readings:

Project: Propose an emoji

Speech recognition

Speech recognition is probably one of the most common ways that you encounter “natural language processing”: getting a computer to understand human language. How do Alexa, Siri, Google, and other voice-activated assistants work? What causes difficulties for them, and how can we overcome them?

For that matter, how does human speech recognition work? Why do they screw up your name at cafes? Why do so many people think my name is “Dave”? We’ll try to get to the bottom of these mysteries and more.

In this topic, we’ll look at the range of linguistic sounds, how language structure their sound inventories, how humans hear and understand spoken language, and how computers can (try to) do the same. We’ll also examine some of the shortcomings of speech recognition, especially involving less-studied languages and dialects. What biases do these systems have, and what can we do about them?

Readings:

Spell-check, autocorrect, & grammar checking

Once we have the basics of speech recognition down, we can turn to how computers understand larger linguistic structures, like words and sentences. What “language models” does a computer have, and how are these used?

We’ll look through the lens of autocorrect and grammar checking, to understand how computers deal with input that doesn’t fit their expectations. When I type “langauge”, did I mean to type “language”, or did I mean to type this weird non-word? That’s pretty easy to tell, but what if I typed “causal”? What are the odds I meant to type “casual”? Should the system ask me? Should it autocorrect? What information can the system use to improve its guesses?

We’ll examine language models from the most basic (word frequency) to more clever ones. We’ll see how they can be used in speech recognition, autocorrect, and even autocompletion.

Readings:

And more to come…

Categories
LangComp

Ling 354: Week 6

[This is extracted from the Spring 2022 version on Canvas, so some links/formatting may be broken.]

This week, we’re digging deeper into spell checking and language models. We’ll focus on simple models first, like n-grams, which we talked about at the end of class. That will require us to spend a little time talking about probabilities (especially conditional probabilities) as well.

Textbook reading

First, please read the remainder of Chapter 2 of the textbook. Don’t panic if you’re having trouble with Section 2.4.1; that’s a too-brief overview of syntax, a quite complex part of linguistics. We’ll go into more depth on syntax next week once we understand simpler language models like n-grams, but getting some familiarity with the concepts now will help when we come back to it next week.

Additional articles, etc.

I have some additional notes that should help with understanding the concepts in this chapter. The first is a basic overview of probability theory for linguistics (PDF).  This is optional reading, but if you’re not familiar with probabilities or hate math, I hope you’ll find it an accessible introduction to the topic, which will come up a few times this semester. I originally developed it for my Ling 502 class, but the concepts apply equally well in this class. We’ll talk about conditional probability and Bayes’ Rule as ways of working with n-gram models this week.

The second set of notes looks at how we collect and use linguistic data to try to build better language models (PDF). In particular, you might find it interesting to play around on Google Books N-grams to look at real-world usage data and see how increased context changes the probabilities of certain words.

The last set of notes (PDF) digs into the topic of Under the Hood 3: dynamic programming. I don’t think the book does a great job of explaining how dynamic programming (in the form of topological orderings on directed acyclic graphs) works, so I worked through a few examples.

Finally, let’s wrap it up with a look at how spell checkers succeed and fail in practice. First, here are two blogposts on the Cupertino effect, an unintended consequence of early automatic spelling correction systems (linklink). Second, a blogpost from the team at Microsoft that worked on Office 2007’s, discussing how they chose to trade off between high precision (if it labels something an error, it’s probably right, but it also misses some errors) versus high recall (it catches most errors, but also flags a lot of non-errors). I found their discussion of user preferences really interesting, and I’d like us to talk on Thursday about user design in these kinds of systems.

Categories
LangComp

Ling 354: Week 5

[This is extracted from the Spring 2022 version on Canvas, so some links/formatting may be broken.]

This week, we’re building on last week’s core ideas about speech recognition. We’ll start by discussing biases in speech recognition and how to overcome them, based on the Scientific American article as well as two others I’m posting here on machine learning biases. Next, we’ll talk about a simple but conceptually useful algorithm, which can help us design a simple speech recognition model: the nearest-neighbors algorithm. Those readings will help us shore up our phonetic model, and the last topic of this week’s class will be starting in on the language model. We’ll go back to the textbook and examine how spell checkers and autocorrect systems work, and what information they use to infer what you meant when the input data is noisy or erroneous.

Additional articles/videos

We’ll start by talking about the short article from Scientific American that examines which dialects of English are actually captured by current speech recognition technology. This is the same article as last week’s reading, so hopefully you’ve already read it.

I wanted to add a little more context for how biases emerge in machine learning/AI algorithms like speech recognition systems, and I think these two are pretty good. The first, from the MIT Technology Review (HTML), is a brief summary of some of the common sources of AI bias and why fixing them is nontrivial.

The second, a Medium post by a data scientist (HTML), digs deeper into how we can quantify biases and makes the argument that the very process of machine learning is a biased perception of data, so addressing bias in machine learning is even more complex than it initially seems. (I’ll confess I’m not entirely won over by this argument, which feels a little too hand-washy, but I think the idea is worth ruminating on.)

You might be wondering how we actually implement speech recognition. We talked in class about the features that different sounds have, but how does an algorithm classify them? I’ve written up some notes on a simple classification algorithm, known as Nearest Neighbors. This algorithm trains on labelled examples of different sounds in a language or dialect (e.g., a bunch of examples of people producing a specified vowel) and classifies a new sound based on which labelled examples it’s most similar to. While modern speech recognition systems use more complex algorithms than this one, nearest neighbors is a nice tradeoff between effectiveness and ease of implementation. We’ll discuss the algorithm and how to apply it to speech recognition and other linguistic tasks in class.

Textbook reading

Lastly, please read Sections 2.1-2.3, excluding “Under the Hood 3: Dynamic programming” from the textbook. This section covers the basics of spell checking/autocorrect, as well as our first exposure to trigram models, which will pop up a few more times through the semester. We’ll cover the rest of the chapter next week, including the dynamic programming section, so read ahead if you’re interested.

Categories
LangComp

Ling 354: Week 4

[This is extracted from the Spring 2022 version on Canvas, so some links/formatting may be broken.]

This week, we’re turning to speech recognition. How do Alexa, Siri, Google, and other voice-activated assistants work? What causes difficulties for them, and how can we overcome them? For that matter, how does human speech recognition work? Why do they screw up your name at cafes? Why do so many people think my name is “Dave”? We’ll try to get to the bottom of these mysteries this week and next.

Textbook reading

Sect 1.4. This covers the basics of speech recognition from a computer’s perspective, as well as a quick overview of the sound patterns of human language, which is covered in much more detail in the reading below.

Additional articles/videos

Read through Sections 2.1, 2.2, 2.3 and 2.6 of Language Files. This provides more detail, from a linguistic perspective, on how linguistic sounds are produced (2.1-2.3) and perceived (2.6). Any reasonable speech recognition system will need to incorporate this sort of information to accurately determine what sounds people are making.

Also, the discussion of syllable structure should help clarify the nature of syllabaries and abugidas from our discussion of writing systems.  (Sections 2.4 and 2.5 are less important for English speech recognition, so you can skip them. But in case you’re interested in the structure of language more generally, I left them in the file for you.)

Since that’s pretty dense reading, I want to wrap up the week with one short article from Scientific American that examines which dialects of English are actually captured by current speech recognition technology. Think about cases where you or your friends are misunderstood, whether by humans or computers, and we’ll talk about how these failures arise and can be countered.

(One last thing, and strictly optional, but the Proceedings of the National Academy of Sciences article that forms the basis of the SA article is pretty good, and worth a look if you have the time/interest.)

Categories
LangComp

Ling 354: Week 3

[This is extracted from the Spring 2022 version on Canvas, so some links/formatting may be broken.]

In our Week 3 meeting, we’ll first wrap up emoji, including digging a bit deeper into how deeply we share our understanding of emoji. We’ll then turn to the QWERTY effect, research that argues that the ways we type language has a subtle but significant influence on our perception of it. 

Textbook reading

No textbook reading for this week.  

Additional articles/videos

We’ll start class by discussing the Bai et al 2019 paper (A Systematic Review of Emoji: Current Research and Future Perspectives) that I’d meant to get to last class. Hopefully, you’ve already read it, but here’s the links again in case it’s helpful (HTML).

A couple people in the pre-class discussion had questions about a point that Bai et al made, which is that emoji are prone to “inefficiency” and “misunderstanding”. I’ll be honest: I was also confused by Bai et al’s discussion on this point. So I went back to the papers they cited, and I found one that both clarifies this point and is interesting in its own right: Tigwell & Flatla 2016. We’ll discuss this paper alongside the Bai et al one, and talk more generally about how messages are understood and misunderstood. (Optionally, if you’re interested in these issues, you may also want to read this paper.)

For the QWERTY effect, we’ll be reading an original research paper: Jasmin and Casasanto 2012. The statistical analysis in this paper may be a little tough if you’re not familiar with such things, so if you’re feeling stuck, focus on the higher-level concepts over the specific results. What is the QWERTY effect supposed to be? What do J&C think might cause it? How do they propose testing it? Do you find their methods convincing? How could you adapt this work to investigate languages/cultures with other keyboards and other writing systems?

Categories
Uncategorized

Ling 354: Week 2

[This is extracted from the Spring 2022 version on Canvas, so some links/formatting may be broken.]

In our Week 2 meeting (Feb 3), we’ll first wrap up writing systems and Unicode. Then we’ll be looking into the implications of getting languages onto computers, including ways that computerized language may support or impede certain languages’ use and ways that the Unicode system encourages novel uses of “language” via emoji.

Textbook reading

No new textbook reading, but we’ll keep using concepts from Ch. 1.1-1.3

Additional articles/videos

If you didn’t read the Daniels 2003 chapter last week, please skim through Sections 1, 2.1, 2.4, & 2.5. It’s imperfect reading for our class, unfortunately; there’s a lot of linguistic jargon that Daniels doesn’t define and some details he elides, but as Daniels himself notes, no one else has written much on this topic. So that’s why I’d like you to just skim it — take in some of the history and examples of different writing systems, but don’t panic if parts are a little hard to follow.

I also found this video from Tom Scott Links to an external site.on Canadian Aboriginal syllabics to be a helpful (and largely linguistically accurate) discussion of why different languages might want different writing systems.

One big question I want us to discuss next class is whether computers and the Internet are helping us preserve endangered languages or are encouraging the endangerment of languages. Here are two articles that focus on the positive and negative aspects, respectively. The first is a short interview on European minority languages with a Finnish professor, and the second is a special report from the Guardian.

Lastly, I want us to examine a new form of “language” that can only really exist on computers and other electronic devices: emoji. I’ve found a review article on emoji usage (HTML), however you prefer to read it) and its impact in various fields. If you’re interested in emoji, take a look at the process for proposing a new emoji, which will be one of your options for the first assignment.

Project ideas and extensions (optional)

Arts/Humanities

I’m personally fascinated by the idea of swapping out writing systems across languages, and seeing what works and doesn’t. Try taking your name, or any words that you’re fond of, and transliterate them into different writing systems. For instance, Gabe turns into 겹 in Korean, గేబ్ in Telugu, and がいぶ in Japanese (where it’s pronounced “Gabe-u”). How do you have to adapt the word to the writing system — does the pronunciation change? Is there much choice in the transliteration? For instance, I could write my name as “Gaib” or “Gaybe” and still pronounce it the same, but 겹 is essentially the only option in Korean because of its transparent orthography.

In general, how does changing the writing system of a text change your impression of the text?

Social Science

Why do people switch languages in online communication? Is it for psycholinguistic reasons (for instance, the author is more familiar with one language or another), sociolinguistic reasons (the author wants to identify with a certain group of speakers), or other reasons entirely? 

Nguyen et al (2015) examined how and why people change their language use within Twitter conversations based on audience size; for larger audiences, more common languages are preferred, while for personal communication, less common but more familiar languages are preferred. Does this fit with how you use different languages, different dialects, or even just different slang?

Emoji are largely intended as human representations, but there’s a long and complex history of how different races and cultures actually get represented by emoji. I found this paper by Kate Miltner to be a helpful overview of the history of the human side of emoji (although at least some of Miltner’s points have been addressed by more recent Unicode updates, despite the paper only being two years old!).

Engineering

One big engineering problem with emoji is understanding what people mean by them. Because different systems render emoji differently, there may be mismatches between the intended use and the interpreted meaning. Here’s a study of how much people varied in their interpretations of emoji, both in terms of inter-person and inter-platform variance, showing that there are some serious ambiguities that can arise from them. In a similar vein, we might try to use people’s emoji usage to get a better sense of their emotional meanings

Categories
Uncategorized

Ling 354: Week 1

[This is extracted from the Spring 2022 version on Canvas, so some links/formatting may be broken.]

In our Week 1 meeting (Jan 27), we’ll be looking at the world’s writing systems, and how they can be encoded into a format for computers to use. 

Textbook reading

The textbook reading for Week 1 is Ch. 1.1-1.3. This covers the basics of the Unicode system and a range of writing systems.

Additional articles

I’d also like you to read at least Sections 1 & 2 of Daniels 2003. You don’t have to read the rest unless you find the subject interesting. This digs deeper into the history of writing systems and some of the difficulties in getting sounds down on the page. I know Section 2 is a little complex, so do your best — it’ll make more sense in Week 3 as we dig into speech recognition and see how consonants and vowels differ.

Finally, I’d like to discuss these two blogposts where character encodings lead to unexpected problems. First, Xudong Zheng’s experiment with a “homograph attack”, using identical or nearly-identical characters from different languages to create URLs that look right to the naked eye but actually link to a separate (potentially dangerous) website. Second, the Telugu character that will break your iPhone! (Well, not anymore, they fixed it.)

Getting familiar with Unicode (optional)

Unicode is a pretty complex system, and it’s really hard to wrap your head around all the characters that it encodes. I found the website https://decodeunicode.org to be pretty helpful in this regard; you can search through different Unicode “blocks”, representing different languages’ writing systems or other useful characters (including emoji). The video on their home page that scrolls through every Unicode character is really interesting to scan through and surprisingly hypnotic.

Project ideas and extensions (optional)

Arts/Humanities

There are a lot of artistic directions to take writing systems, but one that I have to confess a childish fondness for is “zalgo text“, which uses Unicode’s combining characters to stack up unnerving levels of diacritics onto text for a creepy aesthetic. There’s at least one online generator to convert your boring English text into glitchy goop.

Social Science

People can be surprisingly clever when they have to adapt a language for different electronic devices. When keypad-based mobile phones were dominant, sending text messages in Arabic was tricky, since the keypads were optimized for numbers and the Latin alphabet. “Arabizi”, a rendering of Arabic into English letters and numbers, developed as a way of easily transcribing Arabic writing into something phones could handle. 

Abu-Liel et al (2021) examine how accurately and quickly students read standard (vowelless) Arabic writing, Arabizi, and explicitly vowelled Arabic writing, and find that Arabizi falls in between the two. I’d be curious to build on this by looking at how people process text that has had its diacritics removed because of keyboard limitations (e.g., writing Spanish on a keyboard that makes it hard to include accents and tildes). I’m sure such research has been done, but I can’t find any right now.

Engineering

Thinking about the homograph attacks, could you build a system to detect and warn people about such attacks? Or, as an evil hacker, could you automate the generation of such attacks by generating lists of similar letterforms and building a system to swap them out? More generally, how do you build robust systems for writing systems you don’t personally know?