From: verdy_p (verdy_p@wanadoo.fr)
Date: Sun Feb 01 2009 - 21:27:05 CST
> Message du 01/02/09 17:07
> De : "Doug Ewell"
> A : "Unicode Mailing List"
> Copie à : verdy_p@wanadoo.fr
> Objet : Re: Error on Language Codes page.
>
>
> Philippe Verdy wrote:
>
> >> That page continues to trouble me, because of its recommendation to
> >> use ISO 639-1 codes for Hebrew, Indonesian, and Yiddish that were
> >> withdrawn from that standard 20 years ago.
> >
> > These three casesv are not a problem: did you note the asterisk after
> > these codes:
>
> The text explains the asterisk as identifying the older codes that users
> are being told to prefer over the newer codes.
No. That's not what the ISO 639-<<<*** 1 ***>>>/RA used in its code list.
> > they are also present in ISO 639, and mean deprecated codes.
>
> Codes that are withdrawn from a standard in the ISO 639 family are not
> still present in the standard. See the official text file provided by
> ISO 639-2/RA at:
>
> http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt
We are speaking of old ISO 639-<<<*** 1 ***>>> codes. Your reference to ISO 639-<<<*** 2 ***>>>/RA is completely
out of topic here. ISO 639-2 was designed with a single code kept, ignoring the legacy codes that were still
present in the Part 1 of the standard.
Note however that the public website for ISO 639-1 has changed a few years ago (can't remember when, it was in 2005
I think), without any change to the standard. It had been numaintained since long and was back to the ISO web site
for archiving. There's not been significant chagnes. The Part 1 of the standard did not have much text, it was
merely the code list itself. However during the transition, the asterisk in the list was not commented the same
way.
It's true that there were codes that have been deprecated, but in fact none of that have been really deleted,
because at that time, there still existed widely deployed applications that referenced the old codes. And I think
that these applications still exist today.
The same was true when Part 2 of the standard was published (for 3-letter codes): there still existed widely
deployed codes used by librarians for bibliographic interchange, and they were sometimes different from the codes
used by application and OS builders or in communication standards (like MIME in Email and other related "RFC"/"BCP"
standards published by the IETF, and SGML initially created by content publishers like newspapers and advertising
agencies, a set of standards partly adapted to the Web as HTML and published by CERN and later used by other
organisations that joined to create the W3 Consortium). So there also exists duplicate codes, but the ISO 639-2
standard clearly says that the bibliographic codes are needed for compatibility witgh existing best practices
adopted and deployed since long by librarians. Here also there's an asterisk in the published lists of codes.
The asterisk itself is not part of the standard or part of the code. It it such referencing a note. However this
note was not formaly described in Part 1, but it was clearer in Part 2 as it was directly indicating which of the
two codes is the bibliographic code, the other one being the technical code recommended for all applications EXCEPT
bibliographic codes that have NOT been deprecated.
> or the corresponding HTML versions, such as one sorted by ISO 639-2 code
> at:
>
> http://www.loc.gov/standards/iso639-2/php/code_list.php
Also out of topic, wrong standard. None of the alpha-2 codes in this list are normative. ISO 639-2 says absolutely
nothing about alpha-2 codes, but only about alpha-3 codes (in two sets). The alpha-2 mapping there is just used to
specify one of the possible codes in Part-1.
Anyway, even if you look at any part of the ISO 639 standards suite, there has always remained a severe ambiguity
about which code to use when several distinct parts of the standard had to be used simultaneously (because of their
incompleteness). Only BCP 47 has solved these ambiguities by defining effective recommendations for best practice,
and then allowing the other codes as aliases (the most significant change in BCP 47 has been to abandon the
exclusive meaning of code for language families or collections, and this decision was agreed in ISO 649 Part 5, but
is still not aplied in older parts 1 and 2 and has no consequence in Part 3)
> There are also lists sorted by English or French language name. You
> will not find the withdrawn codes in these lists, or anywhere on the
> RA's official site except on their change page, where they do use both
> "deprecated" and "withdrawn" to refer to these codes, which is
> misleading since these are not synonyms.
>
> You can certainly find older lists, provided by third parties, that
> differ from the official standard. These lists are available at places
> like:
>
> http://ftp.ics.uci.edu/pub/ietf/http/related/iso639.txt
Unmaintained lists are also not good references. Why do you need to cite them? There are TONS of unmaintained
copies on the web: they are just there to display which subset of the ISO 639 standard is supported by these sites
(or applications that they describe). As long as these lists are not changed there, you can just assume that theser
applications do not support the newer codes, or have not deprecated the older codes.
To see if new codes are usables or if deprecated codes are still supported or must be replaced, you have to look at
their specific support area. The bad thing about these sites, that publish these old/unmaintained copies is that
they persist in saying that these are the official code lists, but forget to give the date at which the snapshots
of these codes were taken, and for which application there were copied, and they claim that these are standards
without correctly referencing the actual source.
There's nothing wrong however about these lists, as long as they are left only as references for what a given site
currently uses and supports, if the official source of such list is given and it is dated (there's no other
versioning number in ISO standards than the date of publication). Care must be taken that these private copies are
often maintained locally because they reflect the local use, most often in locale identifiers spread thoughout
databases, documents and applications, and changing all these locations can take lot of time (or could be
impossible if they are present in digitally signed documents and applications that cannot be modified in archives)
and it can generate additional costs for the customers of these apps, if changes are not properly announced and
supported as well (the addtional costs may mean that documents must either be changed, or adapted on the fly by
some proxying interface to develop and maintain specifically, with the hope that this won't create ambiguities or
conflicts).
The cost of changing codes (notably those used in locale identifiers) is really tremendous (and probably much
higher than the change that occured for the national currencies to Euro if it can affect all existing codes without
notices). That's why you need stability (and stability means that in fact, whatever the ISO 639 standard says, it
cannot really "delete" a code from a standard, we know that this has only the effect of deprecating codes (except
when a code is reused for something else (see the effect that the reassignment of "CS" from deprecated
Czechslovakia to the temporary and now dead Serbia-and-Montenegro, just a few years after, had undesirable effects:
in fact every where a conflict was detected, the "YU" code was maintained; even today, the "SU" code is still used
and it will survive for very long as it is used in domain names, so ISO 3166-1 cannot even completely delete it and
can only deprecate it indefinitely by giving it a special reserved status; the same is true for "FX" which is still
used by INSEE in France in its published statistics that are needed and used by lot of public or private
organisations, or "UK" in Britain...).
Did you even know that Java running in an Hebrew version of Windows will not load the Hebrew localized ressources
if they use the recommended "he" code ("iw" had still to be used at least in Java 5, I've not checked in Java 6 if
this is still the case), but the current model for localized resources in Java is very simplist and can't be
changed significantly without creating compatibility problems. Unfortunately, this interface is widely used
throughout the rest of the JRE API, so you can't even deprecate this interface and the way it resolves the vairous
locales ; there's still no clear extension in the API to define "aliases" for locale codes, and Java still does not
support the full BCP 47 structure (for example there's still no support for the script code between the language
code and the region code, and there's still no aliasing mechanism, except within internal/hidden part of the OS-
specific implementation...)
> > The HTML page above correctly gives the current recommanded codes (the
> > other codes with the asterisk are non recommended coded, that are
> > still implicitly aliases that may be supported as they have still not
> > be reassigned to other languages;
>
> They aren't still supported by the ISO 639 authorities. The reason they
> have not been reassigned is so that *older, existing* data that uses
> these codes can still be interpreted correctly. That is very different
> from encouraging people to continue using these codes going forward.
>
> > anyway, there will probably be no more alpha-2 code assigned in any
> > part of ISO 639,
>
> While probably true, this has little or no relevance to the rest of the
> thread.
No, this is in topic. This thread started with the use of alpha-2 codes in a (old) page maintained by Unicode.
Either this page should be deleted, or notes should be added in it to specify its status and remove the
ambiguities, saying that none of these codes are a recommendation made by the UTC.
> 'in' and 'iw' and 'ji' were withdrawn from ISO 639 in 1989. That was a
> *long* time ago in computing.
No. 20 years is definitely not old: applications and documents written 20 years ago will survive and will maintain
compatibility. What is long is just their "freshness" or adequation to the current market: they have become
insufficient, but certainly not old. Almost every technical standard you did in computing has survived, the
technologies have been widely reused and integrated in others that can't live now without the old ones on which
they were built. We still find programmers for COBOL, FORTRAN, C, BASIC, or users of ASCII only, even all these
were defined in the 1960's. The same is true about most data compression algorithms. There are technical standards
that have survived centuries (think weights and measures: even if the imperial measures are no more official
international standard, they are still mandatory for some domains like maritime navigation and aeronautic.
Telling people that they should "write
> the [oldest] one... for legacy applications that cannot manage correctly
> the new standard code or for classes of applications for which you are
> not certain that they can use the new standard," without citing specific
> legacy applications that have this constraint, is like telling people
> that they should continue to use the old Unicode 1.1 Hangul syllables in
> the U+3400 to U+4DFF range instead of the newfangled Unicode 2.0 Hangul
> syllables.
The case of Hangul is not really a problem: there was not even a single approved technical standard for use in
Korea at that time. even if Unicode was starting, there was no clear agreement about the approach to use. In fact,
even Unicode was not fully in agreement with ISO 10646 at that time... When the two standards were merged and
agreed to cooperate, it was a good solution, but this has effectively created a new standard that is unrelated to
the standards used by ISO or Unicode before.
For me, Unicode 1.x and Unicode 2+ are unrelated standard, the same way as ISO/IEC 10646 before and after the
merge, and in the same way, ISO 639-1 is not related to ISO 639-2 even if there's some large overlap: they are
obeying to different definitions and policies and can't be used in the same domains of application. You can stil
lsay that Unicode 1.1 is not recommanded, but it is not deleted, and it will survive indefinitely (as long as there
will be people that need to refer to it).
However, the UTC could decide to stop supporting the cost of maintaining alive these old versions, by putting these
old standard in a public domain and allowing anyone to support it as long as he needs it (I think that there's no
real problems for getting copies of these old documents from many places now, but the UTC could decide, before
that, to digitally sign the authentic versions, if multiple independant remote archives are not judged reliable or
can insert their own modifications, errors, or omissions). My opinion is that the initial publication as a book is
enough: the book is kept and archived in official public libraries, and can be available online as scanned PDF
(using a printed and scanned format instead of plain text/tabular data could also avoid easy reuse of old standards
in the future when they are no longer recommended); if not everything was in a book, a summary of the additional
documents could be converted to a "fascicule" in a PDF format and sent to the same archives as a republication
displaying the initial publication data in an encoded version.
This archive was generated by hypermail 2.1.5 : Sun Feb 01 2009 - 21:31:26 CST