Thank you to Irfan Ali and Janina Sajka for starting the Spoken Pronunciation Task Force in the W3C Accessible Platform Architectures (APA) Working Group. As an accessibility professional, amateur linguist, and person with a cognitive disability, I'm offering this blog post as a perspective and a starting point for discussion.
What would happen if we use lexical markup in HTML for improving pronunciation in text to speech (TTS)? Lexical markup specifies the lexeme of a word or phrase, not its phonemes.
- Lexical markup can help language learners as a basis for more efficient translation.
- Lexical markup can help end users with cognitive disabilities through more accurate presentation of lexical synonyms (PDF) or conversion to simplified language.
- Existing W3C standards already offer lexical markup as a basis for pronunciation. In the Pronunciation Lexicon Specification (PLS) and Speech Synthesis Markup Language (SSML), the role attribute and token element provide this capability.
- Implementations exist which offer lexical markup as an option, such as Amazon Polly.
- For authors, lexical specification can be simpler than phonetic specification, which encourages better quality for TTS end users. For example, consider which of these Amazon examples seems easier for authors to get right:
I read a lot of <w role="amazon:NN">content</w> <!-- interpret the word as a noun -->
I read a lot of <phoneme alphabet="ipa" ph="ˈkɑn.tɛnt">content</phoneme> <-- use the specified IPA pronunciation (International Phonetic Alphabet) -->
- Phonetic markup breaks regional accents in TTS, while lexical markup allows them.
- Lexical markup makes it easier for search to match meaning.
- Example of the last two points: An author marks up "Jaguar" and "jaguar" lexically, not phonetically. TTS pronounces it correctly in both American and British English. Searches for "jaguar car" and "jaguar cat" match the right content.
At the same time, the markup specification should also allow authors to express their desired pronunciation phonetically. This is important for some cases, such as:
- Where a pronunciation lexicon is not yet available, such as specialized technical vocabularies, or quotations of minority languages.
- Nonce words.
- Situations where precise pronunciation is more important than interoperability, such as a language comprehension test.
This task force will hit an existing limitation of HTML: there's no way for authors to mark up individual words in a flat text string such as an aria-label
value, title
attribute value, or <title>
element text. Today, this limitation prevents authors from applying a lang
attribute to a substring. Is now the time to devise a solution in the HTML spec?
With the rise of voice search and ubiquitous mainstream text-to-speech, accessibility specification writers will be wise to design for simultaneous benefits in mainstream voice applications. Avoiding changes to the mainstream browsing experience was a good choice for ARIA, but this philosophy should not be overextended. In most cases, pronunciation goals are the same for screen readers, literacy software, speech output as a mainstream feature in visual browsers, and voice-first or voice-only mainstream user agents such as smart speakers. Authors might wish to optimize speech for different use cases such as "read this text naturally for my enjoyment" and "speak all information literally," but these minor variants are not enough to justify making assistive speech pronunciation completely separate from mainstream speech pronunciation. On the contrary, if accessibility can pull in the same direction as other technology interests, then good web pronunciation won't end up isolated in an education niche.
I've heard concerns that standardization of pronunciation technology infrastructure could lead to (or could only succeed with) a high degree of pronunciation standardization that does not represent the diversity of real-world human speech. These concerns are valid, but not new. In traditional print dictionaries, lexicographers have always had to choose a specific midpoint between description and prescription. Fortunately, the PLS standard already supports more pronunciation options than print dictionaries could, by allowing authors to choose a pronunciation lexicon appropriate for their content and audience.
- A highly descriptive lexicon can be derived automatically from statistical analysis of a speech performance corpus.
- A balanced descriptive–prescriptive lexicon could be derived from community consensus sources such as Wikipedia and Wiktionary.
- A highly prescriptive lexicon could be written as a speech output style guide for a particular context, such as a foreign language course.
Likewise, TTS developers will expect the freedom to choose pronunciation lexicons that range from descriptive to prescriptive. In the future, TTS developers may want to give users a choice of lexicons along the descriptiveness range, as they already offer voice options for accent or gender. So web standards should apply the principle of "author proposes, user disposes" to the choice of lexicon.
Where privacy allows, speech-to-text (STT) applications are a rich source for descriptive pronunciation lexicons. I don't know how STT might interoperate with pronunciation markup. At a minimum, STT developers should be invited for their input to the task force.
Enterprise-scale machine learning is already using human speech sources to create pronunciation lexicons. Profit motives have funded spectacular progress, but there have been problematic side effects. Corporations have zealously protected their data sets inside of walled gardens, while adding languages slowly. This currently leaves assistive technology TTS users excluded from the web's promise of cultural and linguistic inclusion. I would love to see open transparent processes for creating descriptive pronunciation lexicons from community sources like Wiktionary and linguistic research corpora.
Lexicon publication standards could yield some useful results. I bet a lot of people would like to define the pronunciation of their own name in their native language and in a dominant culture language. I also bet companies would like to define one or more pronunciations of their brand names – if this provides even a slight boost in organic search engine optimization (SEO) for voice search (analogous to schema markup for traditional search), then somebody will make a lot of money.
1 comment:
This sounds like something that would help AIs understand human language as well.
But I'm not sure it's going to keep my GPS from pronouncing Wiehle (WEE-lee) Avenue as WHY-ay-luh.
Post a Comment