Re: lowercased Unicode language tags ? (was:ISO 15924)

From: Doug Ewell (dewell@adelphia.net)
Date: Sun May 02 2004 - 22:16:57 CDT


Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

>> ISO 15924 alpha-4 codes are already distinguishable from ISO 639 and
>> ISO 3166 codes, simply by virtue of being four letters long.
>
> Not really: Many ISO 3166-3 codes (for former countries or territories
> or those that have changed their code) are also 4 letters.
>
> For example ZRCD designates the former Zaïre (now Dem. Rep. of Congo),
> DDDE the former Dem. Rep. of Germany (now unified with Germany), BUMM
> is the former Kingdom of Burma (now.Myanmar).
>
> And there are also ISO 3166-2 codes for administrative regions in
> countries (such as FR2B for the department of Haute-Corse in France).

Neither ISO 3166-3 nor (perhaps more annoyingly) ISO 3166-2 codes are
allowed in RFC 3066 language tags. So at least in that context, there
is no possibility of confusing them with ISO 15924 script codes.

In any case, when using ISO 3166-2 region codes, they are supposed to be
separated from the ISO 3166-1 country code by a hyphen ("FR-2B").

> Languages need not only distinctions by countries but also by regions
> in countries, if this is needed.
> So Catalan in the Spanish Canaries would use the ISO3166 code "ESCI"
> after the language tag "es" (the complete code would be "es-Latn-ESCI"
> or just "es-ESCI", distinct from "es-Latn" which could be used also
> for Castillan.

There isn't actually such a code as ES-CI (note the hyphen, which makes
it distinguishable from a 4-letter script code). You would have to use
ES-GC for Las Palmas, for example, or ES-TF for Santa Cruz de Tenerife.

And again, RFC 3066 language tags don't allow for the use of these ISO
3166-2 region codes. I'm not quite sure why this is; I think it might
be useful on occasion to be able to encode:

    es-US-CA
    es-US-FL
    es-US-NY

to identify the Mexican-, Cuban, and Puerto Rican-influenced dialects of
Spanish spoken in California, Florida, and Mexico respectively. Perhaps
the variable length of ISO 3166-2 codes poses too many problems in
parsing.

> I think that the wording of TUS 4.0 chapter 15 may create confusion,
> unless this confusion is already handled in RFC 3066 related to
> language tags (in which ISO 639, ISO 15924 and ISO 3166 are only
> defining a part of its subtags). The solution to this apparent
> contradiction is to find in the successor of RFC 3066... And Unicode
> should then be updated to make a better normative reference than just
> the current RFC 3066...

The successor to RFC 3066 is already on its way. It will allow ISO
3166-1 country subtags and ISO 15924 script subtags to coexist, and be
used in a generative way instead of by registering each combination
(still no ISO 3166-2, though). Any potential confusion between country
and script subtags is resolved by the length of the subtag.

There is no "better normative reference" to language tags than RFC 3066
and its successor. In fact, the successor RFC will include special
"stability" provisions to handle situations where ISO 3166-1 codes are
reassigned, so if anything it will be greater than the sum of its parts.

As for chapter 15, or specifically section 15.10, if Unicode ever makes
a change it will probably be to deprecate the tag characters entirely,
or to downplay their existence. You will probably not see any other
"updating" of the language tag section.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/



This archive was generated by hypermail 2.1.5 : Fri May 07 2004 - 18:45:25 CDT