From: verdy_p (verdy_p@wanadoo.fr)
Date: Sun Feb 01 2009 - 21:27:05 CST
> Message du 01/02/09 17:07
> De : "Doug Ewell" 
> A : "Unicode Mailing List" 
> Copie à : verdy_p@wanadoo.fr
> Objet : Re: Error on Language Codes page.
> 
> 
> Philippe Verdy  wrote:
> 
> >> That page continues to trouble me, because of its recommendation to 
> >> use ISO 639-1 codes for Hebrew, Indonesian, and Yiddish that were 
> >> withdrawn from that standard 20 years ago.
> >
> > These three casesv are not a problem: did you note the asterisk after 
> > these codes:
> 
> The text explains the asterisk as identifying the older codes that users 
> are being told to prefer over the newer codes.
No. That's not what the ISO 639-<<<*** 1 ***>>>/RA used in its code list.
> > they are also present in ISO 639, and mean deprecated codes.
> 
> Codes that are withdrawn from a standard in the ISO 639 family are not 
> still present in the standard. See the official text file provided by 
> ISO 639-2/RA at:
> 
> http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt
We are speaking of old ISO 639-<<<*** 1 ***>>> codes. Your reference to ISO 639-<<<*** 2 ***>>>/RA is completely 
out of topic here. ISO 639-2 was designed with a single code kept, ignoring the legacy codes that were still 
present in the Part 1 of the standard.
Note however that the public website for ISO 639-1 has changed a few years ago (can't remember when, it was in 2005 
I think), without any change to the standard. It had been numaintained since long and was back to the ISO web site 
for archiving. There's not been significant chagnes. The Part 1 of the standard did not have much text, it was 
merely the code list itself. However during the transition, the asterisk in the list was not commented the same 
way.
It's true that there were codes that have been deprecated, but in fact none of that have been really deleted, 
because at that time, there still existed widely deployed applications that referenced the old codes. And I think 
that these applications still exist today.
The same was true when Part 2 of the standard was published (for 3-letter codes): there still existed widely 
deployed codes used by librarians for bibliographic interchange, and they were sometimes different from the codes 
used by application and OS builders or in communication standards (like MIME in Email and other related "RFC"/"BCP" 
standards published by the IETF, and SGML initially created by content publishers like newspapers and advertising 
agencies, a set of standards partly adapted to the Web as HTML and published by CERN and later used by other 
organisations that joined to create the W3 Consortium). So there also exists duplicate codes, but the ISO 639-2 
standard clearly says that the bibliographic codes are needed for compatibility witgh existing best practices 
adopted and deployed since long by librarians. Here also there's an asterisk in the published lists of codes.
The asterisk itself is not part of the standard or part of the code. It it such referencing a note. However this 
note was not formaly described in Part 1, but it was clearer in Part 2 as it was directly indicating which of the 
two codes is the bibliographic code, the other one being the technical code recommended for all applications EXCEPT 
bibliographic codes that have NOT been deprecated.
> or the corresponding HTML versions, such as one sorted by ISO 639-2 code 
> at:
> 
> http://www.loc.gov/standards/iso639-2/php/code_list.php
Also out of topic, wrong standard. None of the alpha-2 codes in this list are normative. ISO 639-2 says absolutely 
nothing about alpha-2 codes, but only about alpha-3 codes (in two sets). The alpha-2 mapping there is just used to 
specify one of the possible codes in Part-1.
Anyway, even if you look at any part of the ISO 639 standards suite, there has always remained a severe ambiguity 
about which code to use when several distinct parts of the standard had to be used simultaneously (because of their 
incompleteness). Only BCP 47 has solved these ambiguities by defining effective recommendations for best practice, 
and then allowing the other codes as aliases (the most significant change in BCP 47 has been to abandon the 
exclusive meaning of code for language families or collections, and this decision was agreed in ISO 649 Part 5, but 
is still not aplied in older parts 1 and 2 and has no consequence in Part 3)
> There are also lists sorted by English or French language name. You 
> will not find the withdrawn codes in these lists, or anywhere on the 
> RA's official site except on their change page, where they do use both 
> "deprecated" and "withdrawn" to refer to these codes, which is 
> misleading since these are not synonyms.
> 
> You can certainly find older lists, provided by third parties, that 
> differ from the official standard. These lists are available at places 
> like:
> 
> http://ftp.ics.uci.edu/pub/ietf/http/related/iso639.txt
Unmaintained lists are also not good references. Why do you need to cite them? There are TONS of unmaintained 
copies on the web: they are just there to display which subset of the ISO 639 standard is supported by these sites 
(or applications that they describe). As long as these lists are not changed there, you can just assume that theser 
applications do not support the newer codes, or have not deprecated the older codes.
To see if new codes are usables or if deprecated codes are still supported or must be replaced, you have to look at 
their specific support area. The bad thing about these sites, that publish these old/unmaintained copies is that 
they persist in saying that these are the official code lists, but forget to give the date at which the snapshots 
of these codes were taken, and for which application there were copied, and they claim that these are standards 
without correctly referencing the actual source.
There's nothing wrong however about these lists, as long as they are left only as references for what a given site 
currently uses and supports, if the official source of such list is given and it is dated (there's no other 
versioning number in ISO standards than the date of publication). Care must be taken that these private copies are 
often maintained locally because they reflect the local use, most often in locale identifiers spread thoughout 
databases, documents and applications, and changing all these locations can take lot of time (or could be 
impossible if they are present in digitally signed documents and applications that cannot be modified in archives) 
and it can generate additional costs for the customers of these apps, if changes are not properly announced and 
supported as well (the addtional costs may mean that documents must either be changed, or adapted on the fly by 
some proxying interface to develop and maintain specifically, with the hope that this won't create ambiguities or 
conflicts).
The cost of changing codes (notably those used in locale identifiers) is really tremendous (and probably much 
higher than the change that occured for the national currencies to Euro if it can affect all existing codes without 
notices). That's why you need stability (and stability means that in fact, whatever the ISO 639 standard says, it 
cannot really "delete" a code from a standard, we know that this has only the effect of deprecating codes (except 
when a code is reused for something else (see the effect that the reassignment of "CS" from deprecated 
Czechslovakia to the temporary and now dead Serbia-and-Montenegro, just a few years after, had undesirable effects: 
in fact every where a conflict was detected, the "YU" code was maintained; even today, the "SU" code is still used 
and it will survive for very long as it is used in domain names, so ISO 3166-1 cannot even completely delete it and 
can only deprecate it indefinitely by giving it a special reserved status; the same is true for "FX" which is still 
used by INSEE in France in its published statistics that are needed and used by lot of public or private 
organisations, or "UK" in Britain...).
Did you even know that Java running in an Hebrew version of Windows will not load the Hebrew localized ressources 
if they use the recommended "he" code ("iw" had still to be used at least in Java 5, I've not checked in Java 6 if 
this is still the case), but the current model for localized resources in Java is very simplist and can't be 
changed significantly without creating compatibility problems. Unfortunately, this interface is widely used 
throughout the rest of the JRE API, so you can't even deprecate this interface and the way it resolves the vairous 
locales ; there's still no clear extension in the API to define "aliases" for locale codes, and Java still does not 
support the full BCP 47 structure (for example there's still no support for the script code between the language 
code and the region code, and there's still no aliasing mechanism, except within internal/hidden part of the OS-
specific implementation...)
> > The HTML page above correctly gives the current recommanded codes (the 
> > other codes with the asterisk are non recommended coded, that are 
> > still implicitly aliases that may be supported as they have still not 
> > be reassigned to other languages;
> 
> They aren't still supported by the ISO 639 authorities. The reason they 
> have not been reassigned is so that *older, existing* data that uses 
> these codes can still be interpreted correctly. That is very different 
> from encouraging people to continue using these codes going forward.
> 
> > anyway, there will probably be no more alpha-2 code assigned in any 
> > part of ISO 639,
> 
> While probably true, this has little or no relevance to the rest of the 
> thread.
No, this is in topic. This thread started with the use of alpha-2 codes in a (old) page maintained by Unicode. 
Either this page should be deleted, or notes should be added in it to specify its status and remove the 
ambiguities, saying that none of these codes are a recommendation made by the UTC.
> 'in' and 'iw' and 'ji' were withdrawn from ISO 639 in 1989. That was a 
> *long* time ago in computing.
No. 20 years is definitely not old: applications and documents written 20 years ago will survive and will maintain 
compatibility. What is long is just their "freshness" or adequation to the current market: they have become 
insufficient, but certainly not old. Almost every technical standard you did in computing has survived, the 
technologies have been widely reused and integrated in others that can't live now without the old ones on which 
they were built. We still find programmers for COBOL, FORTRAN, C, BASIC, or users of ASCII only, even all these 
were defined in the 1960's. The same is true about most data compression algorithms. There are technical standards 
that have survived centuries (think weights and measures: even if the imperial measures are no more official 
international standard, they are still mandatory for some domains like maritime navigation and aeronautic.
Telling people that they should "write 
> the [oldest] one... for legacy applications that cannot manage correctly 
> the new standard code or for classes of applications for which you are 
> not certain that they can use the new standard," without citing specific 
> legacy applications that have this constraint, is like telling people 
> that they should continue to use the old Unicode 1.1 Hangul syllables in 
> the U+3400 to U+4DFF range instead of the newfangled Unicode 2.0 Hangul 
> syllables.
The case of Hangul is not really a problem: there was not even a single approved technical standard for use in 
Korea at that time. even if Unicode was starting, there was no clear agreement about the approach to use. In fact, 
even Unicode was not fully in agreement with ISO 10646 at that time... When the two standards were merged and 
agreed to cooperate, it was a good solution, but this has effectively created a new standard that is unrelated to 
the standards used by ISO or Unicode before.
For me, Unicode 1.x and Unicode 2+ are unrelated standard, the same way as ISO/IEC 10646 before and after the 
merge, and in the same way, ISO 639-1 is not related to ISO 639-2 even if there's some large overlap: they are 
obeying to different definitions and policies and can't be used in the same domains of application. You can stil 
lsay that Unicode 1.1 is not recommanded, but it is not deleted, and it will survive indefinitely (as long as there 
will be people that need to refer to it).
However, the UTC could decide to stop supporting the cost of maintaining alive these old versions, by putting these 
old standard in a public domain and allowing anyone to support it as long as he needs it (I think that there's no 
real problems for getting copies of these old documents from many places now, but the UTC could decide, before 
that, to digitally sign the authentic versions, if multiple independant remote archives are not judged reliable or 
can insert their own modifications, errors, or omissions). My opinion is that the initial publication as a book is 
enough: the book is kept and archived in official public libraries, and can be available online as scanned PDF 
(using a printed and scanned format instead of plain text/tabular data could also avoid easy reuse of old standards 
in the future when they are no longer recommended); if not everything was in a book, a summary of the additional 
documents could be converted to a "fascicule" in a PDF format and sent to the same archives as a republication 
displaying the initial publication data in an encoded version.
This archive was generated by hypermail 2.1.5 : Sun Feb 01 2009 - 21:31:26 CST