Categories
Uncategorized

Ling 354: Language & Computers

So much of our modern lives happens through computers, phones, and other electronic devices. The same is true of our language. How do computers shape our language, and how do we adapt computers to our language use? This class considers a range of topics connecting language and computers, including speech recognition systems (Alexa, Siri, Google Assistant), emoji use, and sentiment analysis. It also covers basic linguistic and algorithmic concepts to help understand the strengths and failures of our contemporary language systems.

Students in this class tend to have a pretty wide range of majors, interests, and experiences. Some know much more about language, others much more about computers, and some know both or neither. My goal in the class is to fill in the gaps in students’ linguistic or computational knowledge, then build out into the applications they find interesting. For instance, one Ling 354 student ended up submitting an emoji proposal to the Unicode Consortium after this class!

Here are some of the topics we cover, along with links to the reading lists (extracted from Canvas, so apologies for their imperfections). I’ll be adding to these slowly, because extracting them out from Canvas is exhausting!

Writing Systems, Unicode, and Emoji

Computers think in ones and zeros, or more accurately, in “ons” and “offs”. Human writing systems are much more complex; even the smallest, Rotokas, contains 12 letters. Furthermore, many writing systems lack “letters” in the sense of the English alphabet, with characters that indicate whole syllables or even words.

In this topic, we’ll look at the diversity of linguistic writing systems (alphabets, abjads, syllabaries, etc.), and how these get represented on a computer through Unicode, with each character getting its own number.

We’ll also examine emoji and start asking: what’s their relationship to language? Are they linguistic elements, like words? Are they paralinguistic, like gestures or intonation? Or are they something else entirely?

Readings:

Project: Propose an emoji

Speech recognition

Speech recognition is probably one of the most common ways that you encounter “natural language processing”: getting a computer to understand human language. How do Alexa, Siri, Google, and other voice-activated assistants work? What causes difficulties for them, and how can we overcome them?

For that matter, how does human speech recognition work? Why do they screw up your name at cafes? Why do so many people think my name is “Dave”? We’ll try to get to the bottom of these mysteries and more.

In this topic, we’ll look at the range of linguistic sounds, how language structure their sound inventories, how humans hear and understand spoken language, and how computers can (try to) do the same. We’ll also examine some of the shortcomings of speech recognition, especially involving less-studied languages and dialects. What biases do these systems have, and what can we do about them?

Readings:

Spell-check, autocorrect, & grammar checking

Once we have the basics of speech recognition down, we can turn to how computers understand larger linguistic structures, like words and sentences. What “language models” does a computer have, and how are these used?

We’ll look through the lens of autocorrect and grammar checking, to understand how computers deal with input that doesn’t fit their expectations. When I type “langauge”, did I mean to type “language”, or did I mean to type this weird non-word? That’s pretty easy to tell, but what if I typed “causal”? What are the odds I meant to type “casual”? Should the system ask me? Should it autocorrect? What information can the system use to improve its guesses?

We’ll examine language models from the most basic (word frequency) to more clever ones. We’ll see how they can be used in speech recognition, autocorrect, and even autocompletion.

Readings:

And more to come…

Categories
LangComp

Ling 354: Week 6

[This is extracted from the Spring 2022 version on Canvas, so some links/formatting may be broken.]

This week, we’re digging deeper into spell checking and language models. We’ll focus on simple models first, like n-grams, which we talked about at the end of class. That will require us to spend a little time talking about probabilities (especially conditional probabilities) as well.

Textbook reading

First, please read the remainder of Chapter 2 of the textbook. Don’t panic if you’re having trouble with Section 2.4.1; that’s a too-brief overview of syntax, a quite complex part of linguistics. We’ll go into more depth on syntax next week once we understand simpler language models like n-grams, but getting some familiarity with the concepts now will help when we come back to it next week.

Additional articles, etc.

I have some additional notes that should help with understanding the concepts in this chapter. The first is a basic overview of probability theory for linguistics (PDF).  This is optional reading, but if you’re not familiar with probabilities or hate math, I hope you’ll find it an accessible introduction to the topic, which will come up a few times this semester. I originally developed it for my Ling 502 class, but the concepts apply equally well in this class. We’ll talk about conditional probability and Bayes’ Rule as ways of working with n-gram models this week.

The second set of notes looks at how we collect and use linguistic data to try to build better language models (PDF). In particular, you might find it interesting to play around on Google Books N-grams to look at real-world usage data and see how increased context changes the probabilities of certain words.

The last set of notes (PDF) digs into the topic of Under the Hood 3: dynamic programming. I don’t think the book does a great job of explaining how dynamic programming (in the form of topological orderings on directed acyclic graphs) works, so I worked through a few examples.

Finally, let’s wrap it up with a look at how spell checkers succeed and fail in practice. First, here are two blogposts on the Cupertino effect, an unintended consequence of early automatic spelling correction systems (linklink). Second, a blogpost from the team at Microsoft that worked on Office 2007’s, discussing how they chose to trade off between high precision (if it labels something an error, it’s probably right, but it also misses some errors) versus high recall (it catches most errors, but also flags a lot of non-errors). I found their discussion of user preferences really interesting, and I’d like us to talk on Thursday about user design in these kinds of systems.

Categories
LangComp

Ling 354: Week 5

[This is extracted from the Spring 2022 version on Canvas, so some links/formatting may be broken.]

This week, we’re building on last week’s core ideas about speech recognition. We’ll start by discussing biases in speech recognition and how to overcome them, based on the Scientific American article as well as two others I’m posting here on machine learning biases. Next, we’ll talk about a simple but conceptually useful algorithm, which can help us design a simple speech recognition model: the nearest-neighbors algorithm. Those readings will help us shore up our phonetic model, and the last topic of this week’s class will be starting in on the language model. We’ll go back to the textbook and examine how spell checkers and autocorrect systems work, and what information they use to infer what you meant when the input data is noisy or erroneous.

Additional articles/videos

We’ll start by talking about the short article from Scientific American that examines which dialects of English are actually captured by current speech recognition technology. This is the same article as last week’s reading, so hopefully you’ve already read it.

I wanted to add a little more context for how biases emerge in machine learning/AI algorithms like speech recognition systems, and I think these two are pretty good. The first, from the MIT Technology Review (HTML), is a brief summary of some of the common sources of AI bias and why fixing them is nontrivial.

The second, a Medium post by a data scientist (HTML), digs deeper into how we can quantify biases and makes the argument that the very process of machine learning is a biased perception of data, so addressing bias in machine learning is even more complex than it initially seems. (I’ll confess I’m not entirely won over by this argument, which feels a little too hand-washy, but I think the idea is worth ruminating on.)

You might be wondering how we actually implement speech recognition. We talked in class about the features that different sounds have, but how does an algorithm classify them? I’ve written up some notes on a simple classification algorithm, known as Nearest Neighbors. This algorithm trains on labelled examples of different sounds in a language or dialect (e.g., a bunch of examples of people producing a specified vowel) and classifies a new sound based on which labelled examples it’s most similar to. While modern speech recognition systems use more complex algorithms than this one, nearest neighbors is a nice tradeoff between effectiveness and ease of implementation. We’ll discuss the algorithm and how to apply it to speech recognition and other linguistic tasks in class.

Textbook reading

Lastly, please read Sections 2.1-2.3, excluding “Under the Hood 3: Dynamic programming” from the textbook. This section covers the basics of spell checking/autocorrect, as well as our first exposure to trigram models, which will pop up a few more times through the semester. We’ll cover the rest of the chapter next week, including the dynamic programming section, so read ahead if you’re interested.

Categories
LangComp

Ling 354: Week 4

[This is extracted from the Spring 2022 version on Canvas, so some links/formatting may be broken.]

This week, we’re turning to speech recognition. How do Alexa, Siri, Google, and other voice-activated assistants work? What causes difficulties for them, and how can we overcome them? For that matter, how does human speech recognition work? Why do they screw up your name at cafes? Why do so many people think my name is “Dave”? We’ll try to get to the bottom of these mysteries this week and next.

Textbook reading

Sect 1.4. This covers the basics of speech recognition from a computer’s perspective, as well as a quick overview of the sound patterns of human language, which is covered in much more detail in the reading below.

Additional articles/videos

Read through Sections 2.1, 2.2, 2.3 and 2.6 of Language Files. This provides more detail, from a linguistic perspective, on how linguistic sounds are produced (2.1-2.3) and perceived (2.6). Any reasonable speech recognition system will need to incorporate this sort of information to accurately determine what sounds people are making.

Also, the discussion of syllable structure should help clarify the nature of syllabaries and abugidas from our discussion of writing systems.  (Sections 2.4 and 2.5 are less important for English speech recognition, so you can skip them. But in case you’re interested in the structure of language more generally, I left them in the file for you.)

Since that’s pretty dense reading, I want to wrap up the week with one short article from Scientific American that examines which dialects of English are actually captured by current speech recognition technology. Think about cases where you or your friends are misunderstood, whether by humans or computers, and we’ll talk about how these failures arise and can be countered.

(One last thing, and strictly optional, but the Proceedings of the National Academy of Sciences article that forms the basis of the SA article is pretty good, and worth a look if you have the time/interest.)

Categories
LangComp

Ling 354: Week 3

[This is extracted from the Spring 2022 version on Canvas, so some links/formatting may be broken.]

In our Week 3 meeting, we’ll first wrap up emoji, including digging a bit deeper into how deeply we share our understanding of emoji. We’ll then turn to the QWERTY effect, research that argues that the ways we type language has a subtle but significant influence on our perception of it. 

Textbook reading

No textbook reading for this week.  

Additional articles/videos

We’ll start class by discussing the Bai et al 2019 paper (A Systematic Review of Emoji: Current Research and Future Perspectives) that I’d meant to get to last class. Hopefully, you’ve already read it, but here’s the links again in case it’s helpful (HTML).

A couple people in the pre-class discussion had questions about a point that Bai et al made, which is that emoji are prone to “inefficiency” and “misunderstanding”. I’ll be honest: I was also confused by Bai et al’s discussion on this point. So I went back to the papers they cited, and I found one that both clarifies this point and is interesting in its own right: Tigwell & Flatla 2016. We’ll discuss this paper alongside the Bai et al one, and talk more generally about how messages are understood and misunderstood. (Optionally, if you’re interested in these issues, you may also want to read this paper.)

For the QWERTY effect, we’ll be reading an original research paper: Jasmin and Casasanto 2012. The statistical analysis in this paper may be a little tough if you’re not familiar with such things, so if you’re feeling stuck, focus on the higher-level concepts over the specific results. What is the QWERTY effect supposed to be? What do J&C think might cause it? How do they propose testing it? Do you find their methods convincing? How could you adapt this work to investigate languages/cultures with other keyboards and other writing systems?

Categories
Uncategorized

Ling 354: Week 2

[This is extracted from the Spring 2022 version on Canvas, so some links/formatting may be broken.]

In our Week 2 meeting (Feb 3), we’ll first wrap up writing systems and Unicode. Then we’ll be looking into the implications of getting languages onto computers, including ways that computerized language may support or impede certain languages’ use and ways that the Unicode system encourages novel uses of “language” via emoji.

Textbook reading

No new textbook reading, but we’ll keep using concepts from Ch. 1.1-1.3

Additional articles/videos

If you didn’t read the Daniels 2003 chapter last week, please skim through Sections 1, 2.1, 2.4, & 2.5. It’s imperfect reading for our class, unfortunately; there’s a lot of linguistic jargon that Daniels doesn’t define and some details he elides, but as Daniels himself notes, no one else has written much on this topic. So that’s why I’d like you to just skim it — take in some of the history and examples of different writing systems, but don’t panic if parts are a little hard to follow.

I also found this video from Tom Scott Links to an external site.on Canadian Aboriginal syllabics to be a helpful (and largely linguistically accurate) discussion of why different languages might want different writing systems.

One big question I want us to discuss next class is whether computers and the Internet are helping us preserve endangered languages or are encouraging the endangerment of languages. Here are two articles that focus on the positive and negative aspects, respectively. The first is a short interview on European minority languages with a Finnish professor, and the second is a special report from the Guardian.

Lastly, I want us to examine a new form of “language” that can only really exist on computers and other electronic devices: emoji. I’ve found a review article on emoji usage (HTML), however you prefer to read it) and its impact in various fields. If you’re interested in emoji, take a look at the process for proposing a new emoji, which will be one of your options for the first assignment.

Project ideas and extensions (optional)

Arts/Humanities

I’m personally fascinated by the idea of swapping out writing systems across languages, and seeing what works and doesn’t. Try taking your name, or any words that you’re fond of, and transliterate them into different writing systems. For instance, Gabe turns into 겹 in Korean, గేబ్ in Telugu, and がいぶ in Japanese (where it’s pronounced “Gabe-u”). How do you have to adapt the word to the writing system — does the pronunciation change? Is there much choice in the transliteration? For instance, I could write my name as “Gaib” or “Gaybe” and still pronounce it the same, but 겹 is essentially the only option in Korean because of its transparent orthography.

In general, how does changing the writing system of a text change your impression of the text?

Social Science

Why do people switch languages in online communication? Is it for psycholinguistic reasons (for instance, the author is more familiar with one language or another), sociolinguistic reasons (the author wants to identify with a certain group of speakers), or other reasons entirely? 

Nguyen et al (2015) examined how and why people change their language use within Twitter conversations based on audience size; for larger audiences, more common languages are preferred, while for personal communication, less common but more familiar languages are preferred. Does this fit with how you use different languages, different dialects, or even just different slang?

Emoji are largely intended as human representations, but there’s a long and complex history of how different races and cultures actually get represented by emoji. I found this paper by Kate Miltner to be a helpful overview of the history of the human side of emoji (although at least some of Miltner’s points have been addressed by more recent Unicode updates, despite the paper only being two years old!).

Engineering

One big engineering problem with emoji is understanding what people mean by them. Because different systems render emoji differently, there may be mismatches between the intended use and the interpreted meaning. Here’s a study of how much people varied in their interpretations of emoji, both in terms of inter-person and inter-platform variance, showing that there are some serious ambiguities that can arise from them. In a similar vein, we might try to use people’s emoji usage to get a better sense of their emotional meanings

Categories
Uncategorized

Ling 354: Week 1

[This is extracted from the Spring 2022 version on Canvas, so some links/formatting may be broken.]

In our Week 1 meeting (Jan 27), we’ll be looking at the world’s writing systems, and how they can be encoded into a format for computers to use. 

Textbook reading

The textbook reading for Week 1 is Ch. 1.1-1.3. This covers the basics of the Unicode system and a range of writing systems.

Additional articles

I’d also like you to read at least Sections 1 & 2 of Daniels 2003. You don’t have to read the rest unless you find the subject interesting. This digs deeper into the history of writing systems and some of the difficulties in getting sounds down on the page. I know Section 2 is a little complex, so do your best — it’ll make more sense in Week 3 as we dig into speech recognition and see how consonants and vowels differ.

Finally, I’d like to discuss these two blogposts where character encodings lead to unexpected problems. First, Xudong Zheng’s experiment with a “homograph attack”, using identical or nearly-identical characters from different languages to create URLs that look right to the naked eye but actually link to a separate (potentially dangerous) website. Second, the Telugu character that will break your iPhone! (Well, not anymore, they fixed it.)

Getting familiar with Unicode (optional)

Unicode is a pretty complex system, and it’s really hard to wrap your head around all the characters that it encodes. I found the website https://decodeunicode.org to be pretty helpful in this regard; you can search through different Unicode “blocks”, representing different languages’ writing systems or other useful characters (including emoji). The video on their home page that scrolls through every Unicode character is really interesting to scan through and surprisingly hypnotic.

Project ideas and extensions (optional)

Arts/Humanities

There are a lot of artistic directions to take writing systems, but one that I have to confess a childish fondness for is “zalgo text“, which uses Unicode’s combining characters to stack up unnerving levels of diacritics onto text for a creepy aesthetic. There’s at least one online generator to convert your boring English text into glitchy goop.

Social Science

People can be surprisingly clever when they have to adapt a language for different electronic devices. When keypad-based mobile phones were dominant, sending text messages in Arabic was tricky, since the keypads were optimized for numbers and the Latin alphabet. “Arabizi”, a rendering of Arabic into English letters and numbers, developed as a way of easily transcribing Arabic writing into something phones could handle. 

Abu-Liel et al (2021) examine how accurately and quickly students read standard (vowelless) Arabic writing, Arabizi, and explicitly vowelled Arabic writing, and find that Arabizi falls in between the two. I’d be curious to build on this by looking at how people process text that has had its diacritics removed because of keyboard limitations (e.g., writing Spanish on a keyboard that makes it hard to include accents and tildes). I’m sure such research has been done, but I can’t find any right now.

Engineering

Thinking about the homograph attacks, could you build a system to detect and warn people about such attacks? Or, as an evil hacker, could you automate the generation of such attacks by generating lists of similar letterforms and building a system to swap them out? More generally, how do you build robust systems for writing systems you don’t personally know?

Categories
Class Overview Courses LMS

Ling 502: Language, Mind & Society

Class Overview

Language does not exist in a vacuum; every time we use language, it’s shaped by communicative, cognitive, and learning pressures. In this class, we combine concepts from theoretical linguistics with the real-world setting of its use, providing an overview of language acquisition, psycholinguistics (language in the mind), and sociolinguistics (language in social settings).

The core idea of this course unfolds from the ways that our minds shape language individually, the core of psycholinguistics. As we understand the influence of individual minds on the language, we can move out to examine how interactions between people, and aspects such as communicative goals and social identities can shape the language: sociolinguistics. Psycho- and sociolinguistic influences, amassed over generations, lead to language change and standardization. And all of this structure is constrained by the fact that it must be learnable, generation by generation, through language acquisition.

The goal of this class is not only to discuss key concepts within these areas of linguistics, but to build bridges between them. Ideally, this will provide a chance for you to explore applications for your linguistic knowledge, and to spur new avenues for linguistic research. We also look at ways to bring social media and other emerging linguistic data sources into linguistic research to gain new insights into the interactions of languages, minds, and society.

Textbook

We use Julie Sedivy’s textbook Language in Mind as the building block for the first part of the class, as it provides a helpful overview of the state of research in language acquisition, psycholinguistics, and the basics of sociolinguistics. Throughout the class, we dive deeper into specific research papers on these topics, and they take on a more prominent role as the class progresses into sociolinguistics.

The course syllabus is available here.

Language Acquisition

By the time you’re an adult, it’s really easy to forget that you needed to learn language in the first place. Years of relatively effortless language use can make it seem like a trivial task. But the second you step inside a language classroom, that misconception evaporates. So how do kids do it, and why do they seem so much better at it than us adults?

In this portion of the class (Weeks 1-4), we examine what, if any, linguistic structure children are born with, and what they build through exposure to other people’s language use. We discuss the innatist vs. emergentist perspectives, Universal Grammar and linguistic relativism (e.g., the Whorf hypothesis), and probablistic rational models of acquisition.

We cover chapters 4 and 5 of the textbook, along with the following papers: Maye et al 2002 (data visualization), Yurovsky et al 2017, Gentner & Goldin-Meadow 2003, and Braginsky et al 2016.

Additional readings that may be useful for this topic are in this Google Drive.

I’ve also made a some introductory notes on reading linguistics research papers. The first is a video where you can read along with me on the Maye et al paper. The second is a set of notes on the Yurovsky et al paper.

Psycholinguistics

Psycholinguistics is all about the representation of language in the mind. Have you ever wondered why some sentences are harder to understand than others? Have you ever had a word stuck on the tip of your tongue? Ever said something that was perfectly clear to you but incomprehensible to everyone else?

Much of this comes from the fact that language has to filter not only through the brain of the producer but of the audience as well. Understanding the ways that language is structured in the mind can help us understand why some linguistic tasks are easy and others are hard.

We introduce probabilistic frameworks for understanding cognitive pressures on language, including Bayesian analysis and the Rational Speech Act model.

This covers Chapters 8 through 10 of the textbook, as well as the following papers: Ferreira & Patson 2007, Doyle & Frank 2015, Goodman & Frank 2016, Yoon, Tessler, et al 2016, and Keysar et al 2012.

Additional readings on psycholinguistics can be found here.

I’ve made some notes on the RSA model to accompany Goodman & Frank 2016, as well as some notes on probability in psycholinguistics more generally.

Sociolinguistics

As we think about not just our own minds’ influences on language, but also other people’s, we are inevitably driven toward sociolinguistics, the study of how language is shaped by its use and users. We mainly consider cognition-focused aspects of sociolinguistics in this class, examining speaker and audience design, communicative goals, and assertions of identity. We also look at how small social & cognitive pressures can build up over time into language change on the historical level.

This covers Chapter 11 of the textbook, as well as the following papers: von der Malsburg et al 2020, Clark & Schaefer 1987, Guydish & Fox Tree 2021, Nevalainen & Raumolin-Brunberg 2003, Wagner 2012, Coupland 1998, Labov 1963, Eckert 2012, Eckert 2011, Lewis et al 2014, Mahowald et al 2012.

Additional readings on sociolinguistics can be found here.

images sourced from unsplash

Categories
Courses Phonology

Ling 521: Phonology

Despite the single-word name, Ling 521 covers basic phonetics and phonology. Phonetics is the study of linguistic sounds, both how they are produced (articulatory phonetics) and processed (acoustic & auditory phonetics). Phonology is the study of how sounds get used and organized within languages. In short, this class covers the key points of how spoken language is produced and perceived.

We use Elizabeth Zsiga’s book The Sounds of Language: An Introduction to Phonetics and Phonology for most of this class. I’ve included my own notes and worksheets on these various topics below.

picture of mouth

Articulatory Phonetics

We start by covering the basics of articulation. How do we produce the various sounds of a language? What makes an “s” sound different from a “z” sound? (Hint: touch your throat as you make each of them. You should feel vibrations from one but not the other.)

How do languages differ in the set of sounds they contain? How do we explain and classify the differences between sounds? How do you make the Arabic sound that gets written as “Q” in English? Why aren’t the vowels in English lay and Spanish leche quite the same?

We cover the International Phonetic Alphabet, articulatory phonetic features, and airstream mechanisms. We also spend a lot of time making sounds to each other, feeling out how unfamiliar sounds are produced.

Acoustic Phonetics

These different articulations only matter because they change the sound of the airflow being produced; it’s primarily the sounds rather than the articulations that we use to tell what someone’s saying.

Acoustic phonetics covers the basics of sound waves, and how they are broken down into their components by our auditory system. We examine how articulatory differences induce acoustic differences, looking at waveforms, spectral slices, and spectrograms. We focus on such acoustic features as the fundamental frequency and the F1 and F2 formants.


Rule-Based Phonology

Transitioning to phonology, we now examine how languages put together their sounds. We briefly look at phonotactics, the relative acceptability of different sound sequences in a specific language. We focus on formalizing the relationship between the mental lexicon’s underlying forms of a word or intonational phrase, and the way that these actually surface in production.

We cover Sounds Patterns of English-style notation for rules, including feature-value pairs, feature bundles, and alpha notation. We discuss common and uncommon phonological processes cross-linguistically, and transition from articulatory phonetic features to abstract phonological features. We handle phonological analyses from a wide variety of languages, including cases of rule ordering, feeding, and bleeding.

Constraint-Based Phonology

We wrap up with a brief overview of Optimality Theory, a prominent constraint-based approach to phonology. Whereas rule-based phonology treats the phonological processes like an assembly, with a sequence of precisely-defined changes applied unwaveringly, constraint-based phonology considers multiple possible surface forms, and chooses the one that best satisfies the violable constraints of the language. (This is my favorite part of the class.)

We cover the four key components of Optimality Theory (Gen, Con, H, and Eval) and the main constraints (Max, Dep, Ident, Agree, NoCoda, etc.). We look at how constraint rankings are determined for a language through ranking arguments, and apply these to a range of complex phenomena.

images sourced from unsplash