[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #2448(closed enhancement: fixed)

Opened 8 years ago

Last modified 8 years ago

TR35 : unicode_language_subtag

Reported by: verdy_p@… Owned by: mark
Component: xxx-spec Data Locale:
Phase: Review:
Weeks: Data Xpath:


currently, the Unicode language subtags are defined with this production rule:

unicode_language_subtag :=

BCP47_language_subtag |
ISO_639_3_code | ISO_639_5_code

however this is now redundant since the publication of RFC 5646 in July 29 (which is the revision of BCP 47 that incorporates all valid ISO 639-3 and ISO 639-5 codes in the IANA registry for valid subtags, notably all codes that were valid at that time, and that were notified to IANA using its registration form as well as the publication of the informational RFC 5645).

It can then just read now:

unicode_language_subtag := BCP47_language_subtag

This production is still needed due to the difference of interpretation of some codes, and the exclusion of some narrower ISO 639-3 codes like "cmn" for individual languages instead of macro-languages, as explained in the large table of the article.

Removing "ISO_639_3_code | ISO_639_5_code" from the production does not change things, because the same codes had to be restricted in use as well. Instead it will clarify things, by avoiding mixing the concepts between ISO 639 language codes and BCP 47 language subtags, when in fact they are now equivalent (and also because there's now a stability guarantee due to the published agreement between the relevant ISO and IETF technical comities.)

Note also that similar restrictions or modifications are applied to other subtags used in Unicode language identifiers.

This is not really a bug, but a clarification (avoiding the reduncancy), to allow better interpretation of the restrictions applied to the ISO 639-3 macro-languages codes, and to the legacy narrower definition of ISO 639-5 collections that were already encoded exclusively within ISO 639-1 and ISO 639-2.

Also, why do you maintain the UN M.49 economical groupings as valid Unicode region subtags ? They are not stable even in UN statistics, and they were only designed for statistical purpose, but not for linguistic purpose. They are compeltely unusable for identifying locales (try to guess if languages or locales are different between "developing regions" and "development regions".

Try to guess how these regions are defined (they are definitely NOT consistent across the various UN statistic documents, and their definition change every year, with the most notably being the redefinition of "developping countries" which is now all countries that are not developped, in most UN reports, but other reports excluding "transition countries". "Transition countries" are also report-dependant and vary.

Even the "developped conutries" vary, depending on whever they are defined geographically (by continents or subcontinents), or gepolitically (independantly of their location of some or all of their territory). Some reports are including also Israel (part of Western Asia which is not within the "developped region), or Japan, or countries of the Southern African Trade Union (but some not as all Africa is part of the "developing regions").

Even the definition of Northern America (in economic groupings) is inconsistant as it does not use the code assigned to all Americas but then excludes Latin America and the Carribean.

For all these reasons, the incoherent economic groupings of UN.M49 should not be allowed at all, only the supranational geographical groupings are useful.

Note that I have updated the English Wikipedia article about "UN M.49" to clearly show how they are separated and used in BCP 47 and ISO 3166-1: Unicode and CLDR projects should just need those codes in the displayed first column (including the '830' code for Channel Islands), just like what was done in the IANA Registry for BCP 47, and explained in its RFCs.

On the opposite, the use of a code for the European Union is meaningful for localization purpose, because it exists juridically as the European Community (and will soon exist as the European Union, since the last Treaty was recently accepted by the Czech Republic that will ratify it, giving to EU a juridical personality). This means that the European Union will be itself a party to international treaties, instead of the EC which was just one of its three pilars, and it will succede to the EC in all past treaties signed by the EC.

But why not using the "EU" code which was reserved for this purpose, and is widely used now on the Internet as a ccTLD? (This does not make any sense of using and maintaining in LDML and CLDR locales a private-use code "QU", which is even less interoperable, when you are already adding restrictions/differences with BCP 47 and ISO standards).

Note that ccTLD's, and TLDs in general are an important part of localization concerns (see for example IDN applications), even if it is not needed for just the identification of languages.

On the opposite, it may make sense in some future to add (for localization purpose, not for language identification) a few other private-use region codes for some regions, like NATO countries, CIS countries, states of the Gulf, and maybe ASEAN or North-Armerican and Latin-Merican trade unions (because there is currently no such code in UN M.49), if this influences the values or formats of some localized data. However I'm not requesting it for now, as there's no demonstrated need for such regions.


Change History

comment:1 Changed 8 years ago by mark

  • Owner changed from somebody to mark
  • Priority changed from assess to critical
  • Status changed from new to assigned
  • Milestone set to 1.8

We won't go to just BCP47 language tags, since we have a subset. But the formulation does need to be updated, and we do need to change QU to EU.

comment:2 Changed 8 years ago by mark

  • Milestone changed from 1.8 to 1.7.2

comment:3 Changed 8 years ago by mark

  • Status changed from assigned to closed
  • Resolution set to fixed

Add a comment

Modify Ticket

as closed
Next status will be 'new'
Next status will be 'closed'

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.