Ling 354: Week 1 | Gabriel Doyle

[This is extracted from the Spring 2022 version on Canvas, so some links/formatting may be broken.]

In our Week 1 meeting (Jan 27), we’ll be looking at the world’s writing systems, and how they can be encoded into a format for computers to use.

Textbook reading

The textbook reading for Week 1 is Ch. 1.1-1.3. This covers the basics of the Unicode system and a range of writing systems.

Additional articles

I’d also like you to read at least Sections 1 & 2 of Daniels 2003. You don’t have to read the rest unless you find the subject interesting. This digs deeper into the history of writing systems and some of the difficulties in getting sounds down on the page. I know Section 2 is a little complex, so do your best — it’ll make more sense in Week 3 as we dig into speech recognition and see how consonants and vowels differ.

Finally, I’d like to discuss these two blogposts where character encodings lead to unexpected problems. First, Xudong Zheng’s experiment with a “homograph attack”, using identical or nearly-identical characters from different languages to create URLs that look right to the naked eye but actually link to a separate (potentially dangerous) website. Second, the Telugu character that will break your iPhone! (Well, not anymore, they fixed it.)

Getting familiar with Unicode (optional)

Unicode is a pretty complex system, and it’s really hard to wrap your head around all the characters that it encodes. I found the website https://decodeunicode.org to be pretty helpful in this regard; you can search through different Unicode “blocks”, representing different languages’ writing systems or other useful characters (including emoji). The video on their home page that scrolls through every Unicode character is really interesting to scan through and surprisingly hypnotic.

Project ideas and extensions (optional)

Arts/Humanities

There are a lot of artistic directions to take writing systems, but one that I have to confess a childish fondness for is “zalgo text“, which uses Unicode’s combining characters to stack up unnerving levels of diacritics onto text for a creepy aesthetic. There’s at least one online generator to convert your boring English text into glitchy goop.

Social Science

People can be surprisingly clever when they have to adapt a language for different electronic devices. When keypad-based mobile phones were dominant, sending text messages in Arabic was tricky, since the keypads were optimized for numbers and the Latin alphabet. “Arabizi”, a rendering of Arabic into English letters and numbers, developed as a way of easily transcribing Arabic writing into something phones could handle.

Abu-Liel et al (2021) examine how accurately and quickly students read standard (vowelless) Arabic writing, Arabizi, and explicitly vowelled Arabic writing, and find that Arabizi falls in between the two. I’d be curious to build on this by looking at how people process text that has had its diacritics removed because of keyboard limitations (e.g., writing Spanish on a keyboard that makes it hard to include accents and tildes). I’m sure such research has been done, but I can’t find any right now.

Engineering

Thinking about the homograph attacks, could you build a system to detect and warn people about such attacks? Or, as an evil hacker, could you automate the generation of such attacks by generating lists of similar letterforms and building a system to swap them out? More generally, how do you build robust systems for writing systems you don’t personally know?