[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #9956(closed data: fixed)

Opened 6 months ago

Last modified 3 months ago

Provide mechanism for "hybrid" locales (Hinglish, etc.)

Reported by: mark Owned by: mark
Component: bcp47 Data Locale:
Phase: rc Review: pedberg
Weeks: Data Xpath:
Xref:

ticket:9966

Description (last modified by mark) (diff)

We have gotten requests for locale identifiers to represent "hybrid" locales, such as Hinglish, which can be used to select content that is a "deep" mixture of a mixture of two (or more) languages. See the Background below.

Proposal

This ticket proposes adding new T extension keys 'h0' and 'h1' for identifying hybrid locales.

Examples:

es-t-h0-enSpanglishSpanish with an admixture of English
en-t-h0-esSpanglishEnglish with an admixture of Spanish

Note: the boundary between these two will be rather fuzzy, like most cases in identifying. We'd recommend that es-t-h0-en be used unless English clearly predominates.

One could then also have

es-t-hi-h0-enSpanglish translated from Hindi

A second key 'h1' is defined indicating that the source language for transform is a hybrid, much has we have done with the transliteration s0 and d0 keys. The value of h1 is a language tag for the language mixed into the the source language for -t-, allowing formulations like

es-t-hi-h1-enSpanish translated from Hinglish
es-t-hi-h0-en-h1-enSpanglish translated from Hinglish

If needed, one could even indicate what the script of the "mixed-in" language is:

ru-t-h0-en-latnRunglishRussian with an admixture of English in Latin script
ru-t-h0-en-cyrlRunglishRussian with an admixture of English in Cyrillic script

Should we ever have need for hybrids of more than two languages, corresponding pairs of keywords such as h2 and h3 can be defined.

Background

Hybrid locales have intermixed content from 2 (or more) languages, often with one language's grammatical structure applied to words in another. See also https://en.oxforddictionaries.com/definition/spanglish for the use of the term “hybrid”. This is not simply content that has two languages in it, such as a book of parallel text containing English and Spanish:

On the 24th of May, 1863, my uncle, Professor Liedenbrock, rushed into his little house, No. 19 Königstrasse, one of the oldest streets in the oldest portion of the city of Hamburg… El domingo 24 de mayo de 1863, mi tío, el profesor Lidenbrock, regresó precipitadamente a su casa, situada en el número 19 de la König-strasse, una de las calles más antiguas del barrio viejo de Hamburgo…

While text in a document can be tagged as partly in one language and partly in another, that is not the same having a hybrid locale. There is a difference between having a Spanish document that has some passages quoted in English and a Spanglish document. And fine-grained tagging doesn't work handle combinations like Denglisch "gedownloadet" or the Franglais "downloadé", cf http://www.duden.de/rechtschreibung/downloaden) which are in neither language.

More importantly, it doesn't work for a very common use case: locale selection. To communicate requests for localized content and internationalization services, locales are used, which are an extension of language tags. When people pick a language from a menu, internally they are picking a locale (en-GB, es-419, etc). If you want an application to support Spanglish or Hinglish, then you have to have a locale to represent that.

Luckily, this falls within the scope of the T extension. While the title of the RFC (https://tools.ietf.org/html/rfc6497) is “Transformed Content”, the abstract makes it clear that the scope is broader than the term "transformed" might indicate to a casual reader:

This document specifies an Extension to BCP 47 that provides subtags
for specifying the source language or script of transformed content,
including content that has been transliterated, transcribed, or
translated, or in some other way influenced by the source. It also
provides for additional information used for identification.

BTW, the U extension was never in question. Syntactically it does not allow for values that have two letters, like language subtags, because they collide with valid key values in the U extension. As a matter of fact, that was the primary reason for the T extension. Had we been prescient when we devised U, we would have only used keys that could never collide with language subtags, and then never would have needed the T extension.


(Note, this was re-edited to reflect comments made here and on the ietf languages group.)

Attachments

Change History

comment:1 Changed 6 months ago by cowan@…

I agree that there needs to be a way to encode the language of objects (texts, transcripts, recordings, etc.) that are code-switched. However, I believe this to be outside the scope of -t-, which per the RFC represents the language of an object whose creation has in some way been influenced by another object called the source. When texts are composed using code-switching, or people code-switch while conversing, there is no source object, and consequently -t- should not be used. In addition, using -t- in this way preempts the normal use of -t- to indicate a transformation: thus if a work is translated from English into Spanglish (English-Spanish code switching), it would need two -t- tags to represent this fact, which BCP 47 explicitly disallows.

I am not thrilled with a non-generative mechanism for code-switching, but I don't think any satisfactory alternative has been put forward as yet.

comment:2 follow-up: ↓ 14 Changed 6 months ago by duerst@…

Discussion on the IETF/IANA lanugages list seems to come to the conclusion that the use of the -t- extension is at best an extreme stretch, and at worst completely inappropriate. The overarching reason for this is that language mixing (code switching) cannot be explained with a 'source' and a 'target' language. For more arguments, please see the thread on Spanglish at http://www.alvestrand.no/pipermail/ietf-languages/2016-December/thread.html.

comment:3 Changed 6 months ago by mark

As to John's concern in comment:1 about being able to have a transformation of a code-switch language: I think that is a far less less important requirement than to have a general mechanism for code-switch languages.

However, I think we can accommodate that — and at the same time alleviate some of people's concerns about the terms 'source' and 'target' — by changing the syntax so that the value of the c0 key is the language that is mixed into the main language tag. We then get tags structured as follows:

es-t-c0-enSpanglishSpanish with an admixture of English
en-t-c0-esSpanglishEnglish with an admixture of Spanish

Note: the boundary between these two will be rather fuzzy, like most cases with languages. Probably best for these to recommend that es-t-c0-en be used unless English clearly predominates.

One could then have

es-t-hi-c0-enSpanglish translated from Hindi

Although it would be again quite infrequently used, we can easily allow for the case of a code-switch language being the source, and even have the translation of one code-switch language into another. We do this by using another keyword, much has we have done with the transliteration s0 and d0 keys. So we define c1 as a language that is mixed into the source language for -t-, allowing formulations like

es-t-hi-c0-en-c1-enSpanglish translated from Hinglish

The more I think about it, the more I like this formulation.

comment:4 Changed 6 months ago by mark

  • Description modified (diff)

comment:5 Changed 6 months ago by mark

  • Description modified (diff)

comment:6 Changed 6 months ago by mark

  • Description modified (diff)
  • Summary changed from Provide mechanism for "code-switch" locales (Hinglish, etc.) to Provide mechanism for "hybrid" locales (Hinglish, etc.)

comment:7 Changed 6 months ago by mark

  • Description modified (diff)

comment:8 Changed 6 months ago by mark

  • Description modified (diff)

comment:9 Changed 6 months ago by mark

  • Description modified (diff)

comment:10 Changed 6 months ago by emmons

  • Status changed from new to accepted
  • Component changed from unknown to bcp47
  • Priority changed from assess to major
  • Phase changed from dsub to rc
  • Milestone changed from UNSCH to 31
  • Owner changed from anybody to mark
  • Type changed from unknown to data

comment:11 Changed 6 months ago by mark

  • Status changed from accepted to reviewing
  • Review set to pedberg

comment:12 Changed 5 months ago by kristi

  • Cc shawnste@… added

comment:13 follow-up: ↓ 15 Changed 5 months ago by kristi

Mark, Not sure if you already responded to comment #2?

comment:14 in reply to: ↑ 2 Changed 5 months ago by mark

Replying to duerst@…:

Discussion on the IETF/IANA lanugages list seems to come to the conclusion that the use of the -t- extension is at best an extreme stretch, and at worst completely inappropriate. The overarching reason for this is that language mixing (code switching) cannot be explained with a 'source' and a 'target' language. For more arguments, please see the thread on Spanglish at http://www.alvestrand.no/pipermail/ietf-languages/2016-December/thread.html.

The proposal has been refined since then. In particular, the -t- is more general than is realized by many of the participants in the IANA mailing list, and the goal for CLDR is also somewhat different in that it focuses on locale selection, not document tagging.

comment:15 in reply to: ↑ 13 Changed 5 months ago by mark

Replying to kristi:

Mark, Not sure if you already responded to comment #2?

Thanks for the reminder. Just responded.

comment:16 Changed 5 months ago by shawn

I realize that -t- can be very general, however I do not like this use of -t- for this mechanism.

I also realize that the es-t-hi-c0-en-c1-en is enabled by this system, however I find that unlikely to be useful in practice and am unconvinced that the complexity of allowing/parsing this arrangement provides valuable benefit.

Additionally, I'm totally confused how "es-t-hi-c0-en-c1-en" is useful for locale selection. In that case I would need to know that the user speaks Spanglish (if it's translated from Hinglish, they many not speak Hindi) and it may be helpful to know that they presumably also speak English and Spanish.

For that last purpose something like http-accept-language that takes "es-m-en;es;en" seems far more useful to find a locale that might be interesting to that user than the unnecessary "translated from" information.

comment:17 Changed 5 months ago by mark

Note: some commits for this purpose have been to ticket:9966 instead of this.

comment:18 Changed 5 months ago by mark

Committee decided to revert to simpler format:

Not supporting hybrid languages as a transform source/target, then just use h0 as a suffix (with required subtag): “hi-t-en-h0-hybrid”. Also remove “language-tag” as valueType: we’d use “single”.

comment:19 Changed 5 months ago by pedberg

  • Xref set to 9966

Some of this was done under ticket:9966, specifically r13129 and r13171

comment:20 Changed 4 months ago by pedberg

  • Status changed from reviewing to closed
  • Resolution set to fixed

comment:21 Changed 3 months ago by pedberg

And some of this was misticketed under cldrbug 7902:, specifically r13127 and 13128

Last edited 3 months ago by pedberg (previous) (diff)
View

Add a comment

Modify Ticket

Action
as closed
Next status will be 'new'
Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.