CLDR Ticket #9956(closed data: fixed)
Provide mechanism for "hybrid" locales (Hinglish, etc.)
|Reported by:||mark||Owned by:||mark|
Description (last modified by mark) (diff)
We have gotten requests for locale identifiers to represent "hybrid" locales, such as Hinglish, which can be used to select content that is a "deep" mixture of a mixture of two (or more) languages. See the Background below.
This ticket proposes adding new T extension keys 'h0' and 'h1' for identifying hybrid locales.
|es-t-h0-en||Spanglish||Spanish with an admixture of English|
|en-t-h0-es||Spanglish||English with an admixture of Spanish|
Note: the boundary between these two will be rather fuzzy, like most cases in identifying. We'd recommend that es-t-h0-en be used unless English clearly predominates.
One could then also have
|es-t-hi-h0-en||Spanglish translated from Hindi|
A second key 'h1' is defined indicating that the source language for transform is a hybrid, much has we have done with the transliteration s0 and d0 keys. The value of h1 is a language tag for the language mixed into the the source language for -t-, allowing formulations like
|es-t-hi-h1-en||Spanish translated from Hinglish|
|es-t-hi-h0-en-h1-en||Spanglish translated from Hinglish|
If needed, one could even indicate what the script of the "mixed-in" language is:
|ru-t-h0-en-latn||Runglish||Russian with an admixture of English in Latin script|
|ru-t-h0-en-cyrl||Runglish||Russian with an admixture of English in Cyrillic script|
Should we ever have need for hybrids of more than two languages, corresponding pairs of keywords such as h2 and h3 can be defined.
Hybrid locales have intermixed content from 2 (or more) languages, often with one language's grammatical structure applied to words in another. See also https://en.oxforddictionaries.com/definition/spanglish for the use of the term “hybrid”. This is not simply content that has two languages in it, such as a book of parallel text containing English and Spanish:
|On the 24th of May, 1863, my uncle, Professor Liedenbrock, rushed into his little house, No. 19 Königstrasse, one of the oldest streets in the oldest portion of the city of Hamburg…||El domingo 24 de mayo de 1863, mi tío, el profesor Lidenbrock, regresó precipitadamente a su casa, situada en el número 19 de la König-strasse, una de las calles más antiguas del barrio viejo de Hamburgo…|
While text in a document can be tagged as partly in one language and partly in another, that is not the same having a hybrid locale. There is a difference between having a Spanish document that has some passages quoted in English and a Spanglish document. And fine-grained tagging doesn't work handle combinations like Denglisch "gedownloadet" or the Franglais "downloadé", cf http://www.duden.de/rechtschreibung/downloaden) which are in neither language.
More importantly, it doesn't work for a very common use case: locale selection. To communicate requests for localized content and internationalization services, locales are used, which are an extension of language tags. When people pick a language from a menu, internally they are picking a locale (en-GB, es-419, etc). If you want an application to support Spanglish or Hinglish, then you have to have a locale to represent that.
Luckily, this falls within the scope of the T extension. While the title of the RFC (https://tools.ietf.org/html/rfc6497) is “Transformed Content”, the abstract makes it clear that the scope is broader than the term "transformed" might indicate to a casual reader:
This document specifies an Extension to BCP 47 that provides subtags
for specifying the source language or script of transformed content,
including content that has been transliterated, transcribed, or
translated, or in some other way influenced by the source. It also
provides for additional information used for identification.
BTW, the U extension was never in question. Syntactically it does not allow for values that have two letters, like language subtags, because they collide with valid key values in the U extension. As a matter of fact, that was the primary reason for the T extension. Had we been prescient when we devised U, we would have only used keys that could never collide with language subtags, and then never would have needed the T extension.
(Note, this was re-edited to reflect comments made here and on the ietf languages group.)
- Description modified (diff)
- Summary changed from Provide mechanism for "code-switch" locales (Hinglish, etc.) to Provide mechanism for "hybrid" locales (Hinglish, etc.)
comment:10 Changed 3 months ago by emmons
- Status changed from new to accepted
- Component changed from unknown to bcp47
- Priority changed from assess to major
- Phase changed from dsub to rc
- Milestone changed from UNSCH to 31
- Owner changed from anybody to mark
- Type changed from unknown to data
comment:11 Changed 3 months ago by mark
- Status changed from accepted to reviewing
- Review set to pedberg
comment:20 Changed 8 weeks ago by pedberg
- Status changed from reviewing to closed
- Resolution set to fixed