Categories
Uncategorized

Ling 354: Language & Computers

So much of our modern lives happens through computers, phones, and other electronic devices. The same is true of our language. How do computers shape our language, and how do we adapt computers to our language use? This class considers a range of topics connecting language and computers, including speech recognition systems (Alexa, Siri, Google Assistant), emoji use, and sentiment analysis. It also covers basic linguistic and algorithmic concepts to help understand the strengths and failures of our contemporary language systems.

Students in this class tend to have a pretty wide range of majors, interests, and experiences. Some know much more about language, others much more about computers, and some know both or neither. My goal in the class is to fill in the gaps in students’ linguistic or computational knowledge, then build out into the applications they find interesting. For instance, one Ling 354 student ended up submitting an emoji proposal to the Unicode Consortium after this class!

Here are some of the topics we cover, along with links to the reading lists (extracted from Canvas, so apologies for their imperfections). I’ll be adding to these slowly, because extracting them out from Canvas is exhausting!

Writing Systems, Unicode, and Emoji

Computers think in ones and zeros, or more accurately, in “ons” and “offs”. Human writing systems are much more complex; even the smallest, Rotokas, contains 12 letters. Furthermore, many writing systems lack “letters” in the sense of the English alphabet, with characters that indicate whole syllables or even words.

In this topic, we’ll look at the diversity of linguistic writing systems (alphabets, abjads, syllabaries, etc.), and how these get represented on a computer through Unicode, with each character getting its own number.

We’ll also examine emoji and start asking: what’s their relationship to language? Are they linguistic elements, like words? Are they paralinguistic, like gestures or intonation? Or are they something else entirely?

Readings:

Project: Propose an emoji

Speech recognition

Speech recognition is probably one of the most common ways that you encounter “natural language processing”: getting a computer to understand human language. How do Alexa, Siri, Google, and other voice-activated assistants work? What causes difficulties for them, and how can we overcome them?

For that matter, how does human speech recognition work? Why do they screw up your name at cafes? Why do so many people think my name is “Dave”? We’ll try to get to the bottom of these mysteries and more.

In this topic, we’ll look at the range of linguistic sounds, how language structure their sound inventories, how humans hear and understand spoken language, and how computers can (try to) do the same. We’ll also examine some of the shortcomings of speech recognition, especially involving less-studied languages and dialects. What biases do these systems have, and what can we do about them?

Readings:

Spell-check, autocorrect, & grammar checking

Once we have the basics of speech recognition down, we can turn to how computers understand larger linguistic structures, like words and sentences. What “language models” does a computer have, and how are these used?

We’ll look through the lens of autocorrect and grammar checking, to understand how computers deal with input that doesn’t fit their expectations. When I type “langauge”, did I mean to type “language”, or did I mean to type this weird non-word? That’s pretty easy to tell, but what if I typed “causal”? What are the odds I meant to type “casual”? Should the system ask me? Should it autocorrect? What information can the system use to improve its guesses?

We’ll examine language models from the most basic (word frequency) to more clever ones. We’ll see how they can be used in speech recognition, autocorrect, and even autocompletion.

Readings:

And more to come…

Categories
Uncategorized

Ling 354: Week 2

[This is extracted from the Spring 2022 version on Canvas, so some links/formatting may be broken.]

In our Week 2 meeting (Feb 3), we’ll first wrap up writing systems and Unicode. Then we’ll be looking into the implications of getting languages onto computers, including ways that computerized language may support or impede certain languages’ use and ways that the Unicode system encourages novel uses of “language” via emoji.

Textbook reading

No new textbook reading, but we’ll keep using concepts from Ch. 1.1-1.3

Additional articles/videos

If you didn’t read the Daniels 2003 chapter last week, please skim through Sections 1, 2.1, 2.4, & 2.5. It’s imperfect reading for our class, unfortunately; there’s a lot of linguistic jargon that Daniels doesn’t define and some details he elides, but as Daniels himself notes, no one else has written much on this topic. So that’s why I’d like you to just skim it — take in some of the history and examples of different writing systems, but don’t panic if parts are a little hard to follow.

I also found this video from Tom Scott Links to an external site.on Canadian Aboriginal syllabics to be a helpful (and largely linguistically accurate) discussion of why different languages might want different writing systems.

One big question I want us to discuss next class is whether computers and the Internet are helping us preserve endangered languages or are encouraging the endangerment of languages. Here are two articles that focus on the positive and negative aspects, respectively. The first is a short interview on European minority languages with a Finnish professor, and the second is a special report from the Guardian.

Lastly, I want us to examine a new form of “language” that can only really exist on computers and other electronic devices: emoji. I’ve found a review article on emoji usage (HTML), however you prefer to read it) and its impact in various fields. If you’re interested in emoji, take a look at the process for proposing a new emoji, which will be one of your options for the first assignment.

Project ideas and extensions (optional)

Arts/Humanities

I’m personally fascinated by the idea of swapping out writing systems across languages, and seeing what works and doesn’t. Try taking your name, or any words that you’re fond of, and transliterate them into different writing systems. For instance, Gabe turns into 겹 in Korean, గేబ్ in Telugu, and がいぶ in Japanese (where it’s pronounced “Gabe-u”). How do you have to adapt the word to the writing system — does the pronunciation change? Is there much choice in the transliteration? For instance, I could write my name as “Gaib” or “Gaybe” and still pronounce it the same, but 겹 is essentially the only option in Korean because of its transparent orthography.

In general, how does changing the writing system of a text change your impression of the text?

Social Science

Why do people switch languages in online communication? Is it for psycholinguistic reasons (for instance, the author is more familiar with one language or another), sociolinguistic reasons (the author wants to identify with a certain group of speakers), or other reasons entirely? 

Nguyen et al (2015) examined how and why people change their language use within Twitter conversations based on audience size; for larger audiences, more common languages are preferred, while for personal communication, less common but more familiar languages are preferred. Does this fit with how you use different languages, different dialects, or even just different slang?

Emoji are largely intended as human representations, but there’s a long and complex history of how different races and cultures actually get represented by emoji. I found this paper by Kate Miltner to be a helpful overview of the history of the human side of emoji (although at least some of Miltner’s points have been addressed by more recent Unicode updates, despite the paper only being two years old!).

Engineering

One big engineering problem with emoji is understanding what people mean by them. Because different systems render emoji differently, there may be mismatches between the intended use and the interpreted meaning. Here’s a study of how much people varied in their interpretations of emoji, both in terms of inter-person and inter-platform variance, showing that there are some serious ambiguities that can arise from them. In a similar vein, we might try to use people’s emoji usage to get a better sense of their emotional meanings

Categories
Uncategorized

Ling 354: Week 1

[This is extracted from the Spring 2022 version on Canvas, so some links/formatting may be broken.]

In our Week 1 meeting (Jan 27), we’ll be looking at the world’s writing systems, and how they can be encoded into a format for computers to use. 

Textbook reading

The textbook reading for Week 1 is Ch. 1.1-1.3. This covers the basics of the Unicode system and a range of writing systems.

Additional articles

I’d also like you to read at least Sections 1 & 2 of Daniels 2003. You don’t have to read the rest unless you find the subject interesting. This digs deeper into the history of writing systems and some of the difficulties in getting sounds down on the page. I know Section 2 is a little complex, so do your best — it’ll make more sense in Week 3 as we dig into speech recognition and see how consonants and vowels differ.

Finally, I’d like to discuss these two blogposts where character encodings lead to unexpected problems. First, Xudong Zheng’s experiment with a “homograph attack”, using identical or nearly-identical characters from different languages to create URLs that look right to the naked eye but actually link to a separate (potentially dangerous) website. Second, the Telugu character that will break your iPhone! (Well, not anymore, they fixed it.)

Getting familiar with Unicode (optional)

Unicode is a pretty complex system, and it’s really hard to wrap your head around all the characters that it encodes. I found the website https://decodeunicode.org to be pretty helpful in this regard; you can search through different Unicode “blocks”, representing different languages’ writing systems or other useful characters (including emoji). The video on their home page that scrolls through every Unicode character is really interesting to scan through and surprisingly hypnotic.

Project ideas and extensions (optional)

Arts/Humanities

There are a lot of artistic directions to take writing systems, but one that I have to confess a childish fondness for is “zalgo text“, which uses Unicode’s combining characters to stack up unnerving levels of diacritics onto text for a creepy aesthetic. There’s at least one online generator to convert your boring English text into glitchy goop.

Social Science

People can be surprisingly clever when they have to adapt a language for different electronic devices. When keypad-based mobile phones were dominant, sending text messages in Arabic was tricky, since the keypads were optimized for numbers and the Latin alphabet. “Arabizi”, a rendering of Arabic into English letters and numbers, developed as a way of easily transcribing Arabic writing into something phones could handle. 

Abu-Liel et al (2021) examine how accurately and quickly students read standard (vowelless) Arabic writing, Arabizi, and explicitly vowelled Arabic writing, and find that Arabizi falls in between the two. I’d be curious to build on this by looking at how people process text that has had its diacritics removed because of keyboard limitations (e.g., writing Spanish on a keyboard that makes it hard to include accents and tildes). I’m sure such research has been done, but I can’t find any right now.

Engineering

Thinking about the homograph attacks, could you build a system to detect and warn people about such attacks? Or, as an evil hacker, could you automate the generation of such attacks by generating lists of similar letterforms and building a system to swap them out? More generally, how do you build robust systems for writing systems you don’t personally know?