Re: Language identifier proposals request

From: Misha Wolf (MISHA.WOLF@reuters.com)
Date: Mon Sep 04 1995 - 13:13:03 EDT


Reuters is likely to be using the approach described by Asmus in his mail of
September 2, with one difference: The primary language ID and the secondary
language ID will be separately encoded in the Private Use Area. The 64
values starting at F000 will be used for the secondary language ID, and the
following 1024 values will be used for the primary language ID. In many
applications the secondary language ID will be superfluous and so will be
omitted. Hence, the language ID will occupy either 16 or 32 bits within a
data stream. Within an API, the primary and secondary IDs will be
recombined into 16 bits.

In his mail of September 2, Asmus added in passing that:

| Since the tags thus fit into 16-bits, one can play all sorts of games
| with how to insert them into a stream of Unicodes. For a pseudo
| plain-text approach you could insert ESC <xxx> <yyy> where <xxx> is a
| code that designates that this is a language id escape and <yyy> can
| immediately be the language id.

Note that any scheme using standardised control functions (such as Escape)
must agree with ISO 6429 'Control functions for coded character sets'. This
can, obviously, be achieved in one of two ways: (i) by conforming to the
current version of ISO 6429, or (ii) by causing that standard to be
modified.

Keld wrote on September 3:

| The language codes from ISO 639 I distributed earlier is normally
| extended with a country code form ISO 3166, so you can get
| "American English" or "British English" by saying resp. en_US or
| en_GB (in POSIX locale notation - a similar notation is being
| proposed in the Internet).

Reuters considered and rejected the ISO 639/3166 scheme, partly because of
the transient nature of ISO 639's country name qualifiers. As Asmus wrote
on September 3:

| One issue a lot of practitioners have is that the rules of the ISO
| standards have not addressed the issue of 'permanency' of tags, this is
| especially worrysome for the country tags. If these tags are to be useful,
| they need to be aplicable to archiving purposes, so once a tag exists, it
| must exist forever, although if a country goes away, new data wouldn't use
| it any more.

Take the example of "ru_SU" (ie Russian as used in the USSR). This was a
valid ISO 639/3166 language ID until December 15, 1993 when the Fourth
edition of ISO 3166 did away with the USSR and introduced a raft of new
countries, eg the Russian Federation. So, now we have "ru_RU" (ie Russian
as used in the Russian Federation). The old code "ru_SU" no longer has any
meaning and, what is more, there is no guarantee that "SU" will not be
reallocated after the five year quarantine period is up. I have no quarrel
with the abolition of the USSR. I do, though, have a quarrel with a
language ID scheme which uses volatile IDs.

Regards,
misha.wolf@reuters.com



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:30 EDT