[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #7699(closed: as-designed)

Opened 5 years ago

Last modified 4 years ago

Add aliases from "special" Wikimedia language codes

Reported by: federicoleva@… Owned by: mark
Component: design Data Locale:
Phase: dsub Review:
Weeks: Data Xpath:
Xref:

Description

Some Wikimedia projects subdomains and (rarely) MediaWiki language codes are not aligned to CLDR's. The standardisation is ongoing but it will take time; in the meanwhile we sometimes need to intersect the two sets of language codes to avoid false negatives, for instance in https://bugzilla.wikimedia.org/57133 we look for "fil" based on CLDR data in a set that only contains "tl" based on MediaWiki locales.

It seems to me that the CLDR feature we need to use as workaround is the list of aliases: http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/aliases.html
Most of our special language codes are not in contradiction, but several are missing: https://meta.wikimedia.org/wiki/Special_language_codes

be-x-old -> be-tarask
roa-rup -> rup
zh-classical -> lzh
zh-min-nan -> nan
zh-yue -> yue
bat-smg -> sgs
cbk-zam -> cbk
fiu-vro -> vro
nds-nl -> nds

Attachments

Change History

comment:1 Changed 5 years ago by verdy_p@…

nds and nds-nl are distinguished in Wikipedia because they encode two really different dialects using distinct conventions (as well as different fallbacks to either German or Dutch).
in Wikipedia, "nds" is for the German variants used in Niedersachsen, nds-nl is for the part in the Netherlands. This affects the choice of words borrowed locally for many modern creations. There are also specific dialectal phonology differences caused by the proximity of different major languages.
Historically when Dutch and German were not standardized, Low Saxon was part of a continuum; now it persists by trying to use a sort of local unification but working different ly betwee nte two country.
These two dialects are not converging but diverging (just like Serbian with Croatian).

The list above has already been sent multiple times to Wikimedia admins, they know the problem and it is more complicate to correct than just changing a domain name and creating an alias.

(your list forgets the severe conflict caused by Wikipedia's use of "nrm" for Norman)

The main problem is that Wikimedia uses of these codes have forced TranslateWiki.net to support these codes without taking care that other projects may start using them. The situation is getting worse since the lauche of Wikidata, these codes have "transpirated" to other open projects (e.g. in OpenStreetMap for toponyms). The list above however is not as much critical because theyr do not conflict with existing standardized languages. The following are even confirming to the BCP47 standard (even if their interoperability is very poor, their mapping (from Wikipedia to recommended newer codes) is clear:

zh-min-nan -> nan
zh-yue -> nan

The same is true for the following ones for better interoperability, but they are not invalid if we don't remap them at all.

nds -> nds-de
(for nds-nl in Wikipedia, it is not equivalent to "nds")
be-x-old -> be-tarask

The following are not conforming but currently cause no conflict. Remapping them however may preserve compatibility with possible new changes in standards because they use reserved code space (or unregistered codes where an extension is possible):

simple -> en-x-simple (may be we could register the "simpler" variant subtag and use "en-simple")
de-formal -> de-x-formal (same case)
nl-informal -> nl-x-informal (same case)
zh-classical -> lzh (no need to register the "classical" variant, just use the existng ISO 639-3 code)
roa-rup -> rup
bat-smg -> sgs
fiu-vro -> vro (all these 3 are not critical for now, they use the "langext" code space whose registration for new codes is now closed, but these 3 letters "langext" coudl be used for something else later, for the classification of language families of ISO 639-5 in subfamilies).

Your proposed mapping is questionable:

cbk-zam -> cbk

My opinion is that it should better be:

cbk-zam -> cbk-x-zam

(dialectal differences are still not encoded in that language in the IANA registry ; if this occurs, no private use will be used and the mapping will be something like "cbk-zambo" with 5 to 8 letters for the extension of the variant)

Only the following is causing conflicts with an unrelated language and it has no clear remapping.

nrm -> roa-x-nrm ? or fr-x-nrm ? (and does it apply to Jersiais official in Jersey?)

Possibly we could request the allocation of a separate ISO 639-3 code for Norman as a macrolanguage, distinguished from "standard" French just like the code "frc" for "Cajun", or other codes for French-based creoles including Haitian, Guyanese, Guadeloupean, Martiniquese, Reunionese), and then other more specfic codes For Jériais, Guernésiais, and Continental Norman.

Unless we just encode a single isolated language code in ISO 639-3 for Norman, to be used as the preimary language subtag in BCP47, and followed by the standard region subtags for France, Jersey and Guernsey (for the last one, some reports are arguing that it has two variants, in Guernsey island proper, and another in Sarcq island).

THe alternative being to not change Wikipedia and other bases already using "nrm" for Norman, and instead allocate another free code for Narom (I've not found any place where the latter language was encoded with ISO 639-3 and used in interchanged contents in that minority language of Malaysia). This would mean changing the description in ISO 639-3 (not exceptional... this has already occured, even in incompatible ways), but also in the IANA registry (as an important erratum)

comment:2 Changed 5 years ago by verdy_p@…

Add also this conflicting usage:

roa-tara -> it-x-tara (or register the tarandine variant subtag and use "it-tarandine")

it conflicts in the code space reserved for script codes (4 letter subtag "tara" could likely be used for some old South Asian scripts used in Thailand or Vietnam)

comment:3 Changed 5 years ago by federicoleva@…

The two comments above are unrelated to this request. The alias list doesn't contain any redirect *to* BCP47 subtags, file a separate request if you want such a change.

comment:4 Changed 4 years ago by 541329866@…

Also please set crh-latn and crh-cyrl -> crh (Crimean Tatar), seehttp://en.wikipedia.org/wiki/Crimean_Tatar_language.

comment:5 Changed 4 years ago by emmons

  • Owner changed from anybody to mark
  • Status changed from new to assigned
  • type changed from unknown to enhancement
  • Milestone changed from UNSCH to 27dsub

comment:6 Changed 4 years ago by emmons

  • Component changed from data-supplemental to design

comment:7 Changed 4 years ago by emmons

  • Status changed from assigned to design

comment:9 Changed 4 years ago by markus

  • Phase set to dsub
  • Milestone changed from 27dsub to 27

comment:10 Changed 4 years ago by mark

  • Status changed from design to closed
  • Resolution set to as-designed

I think there are 2 reasons not to do this.

  1. This appears to have fairly narrow usage, basically just Wikipedia. An important client, to be sure, but they can solve it themselves with a custom map.
  2. It appears to me that we can't really supply a full solution for Wikipedia, because some of the codes collide. That is, the language aliases are set up to always map from X to Y, where X and Y "mean" the same thing. But codes like "als" used in Wikipedia are valid in ISO (thus in CLDR)—they are just used in Wikipedia to mean a different code.

If you disagree, please reopen and comment.

View

Add a comment

Modify Ticket

Action
as closed
Next status will be 'new'
Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.