UN/LOCODE perspective on character sets

Philippe Verdy verdy_p at wanadoo.fr
Thu Dec 17 15:55:00 CST 2015

Good catch. Once again a lot of misconception by someone who wrote it
without looking at conformance requirements in these standards.
The so called "standard United States character set (437)" is also a
proprietary legacy charset widely used in the US but not adopted as an US
standard. It should have been named "IBM/MS DOS code page 437" without
reference to US (in fact it was used worldwide as the default charset on
many PC's).

But basically what this says is that UN/LOCODE works only with the subset
of characters found in both ISO-8859-1 and CP437, and this is what "diacritic
signs, when practicable" means. Of course it is *interoperable" with ISO
10646-1, but only via a transcoding conversion. CP437 ***was*** widely used
in trada date interchange, it this is no longer true since long (ISO 8859-1
was adopted much more widely and now ISO 10646-1 is prefered (most of the
time using UTF-8).

But there still exists some old files for dBase II/III (as used in the
1980's in old softwares running MSDOS) or similar that are encoded in CP437
but those old files are not updated with the changes needed in 2015. Modern
databases are running via SQL engines with interfaces exposing ISO 10646-1
(UTF-8) or only ISO8859-1 in US and western Europe.

UN/LOCODE should not target just US or Western Europe. It should work as a
worldwide standard, so it has to accept names in languages such as Czech or
Polish that need Latin letters with diacritics not found in ISO8859-1 but
other legacy ISO8859-* charsets: those languages are not transliterated to
simpler forms, unlike names in Russian, Chinese, Thai, Hebrew, Arabic that
define their own standard romanizations requiring also other characters not
found in ISO8859-1.

For UNLOCODE, the romanizations should better use the international
romanizations defined for toponyms. But there's not even any reference to
those existing standards (widely used in Russia, Chinab Japan, Israel, and
Arabic countries). This omission is not forgivable.

My opinion is that this paragraph has in fact not been updated since very
long as it should have been in this 2015-2 version. Due to that, the names
listed in UN/LOCODE are very questionable (and anyway the location codes in
UN/LOCODE are largely deprecated in favor of ISO3166-* codes, where
available, or names used by IATA or OACI, or postal codes in coutnries that
have defined them, or region codes defined by  their national or regional
statistics institute.

2015-12-17 22:19 GMT+01:00 Doug Ewell <doug at ewellic.org>:

> UN/LOCODE version 2015-2 has been released [1], and the Manual still
> contains the following about character sets:
> "27. Place names in UN/LOCODE are given in their national language
> versions as expressed in the Roman alphabet using the 26 characters of
> the character set adopted for international trade data interchange, with
> diacritic signs, when practicable (cf. Paragraph 3.2.2 of the UN/LOCODE
> Manual). International ISO Standard character sets are laid down in ISO
> 8859-1 (1987) and ISO10646-1 (1993). (The standard United States
> character set (437), which conforms to these ISO standards, is also
> widely used in trade data interchange)."
> Spot the errors.
> [1] http://www.unece.org/cefact/codesfortrade/codes_index.html
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ����
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151217/44435480/attachment.html>

More information about the Unicode mailing list