Understanding language choices in software

I'm working on a passion project, tentatively called "Barriers of Babel: How Languages and Accessibility Intersect."

Today I had a minor breakthrough. I had spent a few days trying to understand which human languages Windows and Android support. I thought it was a matter of matching the language name, and sometimes the region and script, to the IANA language subtags.

Then I looked at the NVDA screen reader. It's more transparent than other software, telling us the IANA language subtags of its UI languages. Yet even here it's still hard to know "which languages it supports," because several of these are in fact macrolanguages which contain more than one language.

Screenshot of NVDA Language settings. Tooltip: Choose the language NVDA's messages and user interface shoul be presented in.

When the IANA language registry says two things are distinct languages, it's based on a measure of intelligibility between them. So for example, what do the macrolanguages "Chinese" and "Albanian" mean in NVDA's list of UI languages?

"Chinese (Simplified, China), zh-CN" — It's clear enough that "simplified" describes the script of the writing system. However, the IANA registry lists 15 spoken languages in the zh macrolanguage, most of which appear to be languages of mainland China.

"Albanian, sq" — The IANA registry says this is a macrolanguage with four languages in it:

  • ArbĂ«reshĂ« Albanian, aae
  • Tosk Albanian, als
  • Arvanitika Albanian, aat
  • Gheg Albanian, aln

At this point I realized there's actually only a fuzzy match between all of the various languages labels in software, and the human languages they refer to.

Many of us have heard the stories of speech recognition having trouble understanding English speakers with Scottish and South Asian pronunciation. Yet the consensus says that these folks are all speaking the same language.

Instead of saying the software supports certain languages, it's more accurate to say the software offers options like English, Chinese, and Albanian. If you are a person who thinks of yourself as having skills in language variant X, and you choose language variant X in the software, then the software is saying it should work for you.

The IANA is the Internet Assigned Names Authority. When they assign names to languages, it addresses one kind of ambiguity. When software (or a person in a technical role) sees sq, there is a shared understanding of what concept the token is referring to, regardless of whether you would call that concept "Albanian" or "Shqip" in a non-technical context.

However, there are other kinds of ambiguity that IANA does not address. When IANA has a tag for a single language, like en for English or yo for Yoruba, it doesn't mean that everyone will agree on exactly which vocabulary, grammar, pronunciation, orthography, and cultural content deserve to be called "English" or "Yoruba." Nor does it mean all people who think of themselves as having skills in that single language will fully understand each other in speech and writing.

Back to the example of Chinese. When IANA lists 15 distinct spoken languages within the Chinese macrolanguage, this reflects a consensus that this is a good way to divide up the linguistic landscape, with mutual intelligibility of speakers as an important measure. But another way to look at this language group is as a dialect continuum, with many more than 15 different ways of speaking. Those differences accumulate to become different languages. Where exactly to draw lines in the continuum and declare separate languages is significantly influenced by cultural conventions.

Understanding these labels in the software as "offerings," I can formulate better questions about them. Before I was only asking: Which languages and scripts does the software support?

Now I want to ask deeper questions:

  • For people who speak (or sign) this language, and who read and write this language and script, how usable is the software for them?
  • In what ways does usability vary depending on people's diverse experiences within the language, such as spoken accents, language registers, regional variations, education level, and disabilities?

My hypothesis is that software designers are quietly choosing a prestige language and dialect for most of their languages.

Comments

Popular posts from this blog

How do you pronounce the web?

Cancelling my Planet Fitness membership was easy

Reflections during Lent