>
> I am no authority, but I have heard several times that using
> Unicode order for collation is a bad idea (even Unicode says
> this).
The Unicode Consortium does not say this. The guidelines on
sorting to be published in the Unicode Standard, Version 2.0,
provide some principles and examples for language-specific
and culturally-relevant sorting. Then there are default fallback
principles:
Default to a common or culturally neutral ordering for
out-of-scope characters. [e.g. if you are collating
Swedish, it would be fine to default to a culturally
neutral ordering for Han characters -- but if you are
collating Japanese phonetically, you would obviously
have to do a complex collation involving dictionary
lookup, and a culturally netural ordering for the Han
characters would not be appropriate.]
Collate irrelevant characters in Unicode bit-order, in a
specified position. [e.g. if you are simply sorting
hex formatted numbers, it doesn't matter what you do with the
rest -- just use the Unicode bit-order.]
Toyoshima-san was correct in stating the it is a good idea to default
the sorting of Han characters in Unicode to their binary order, because
the encoding of the Han characters was carefully devised to give them
a meaningful, but culturally neutral order.
>
> Apparently the order may be close for one of the Chinese's (traditional
> or simplified, I forget which), but even this should not be counted
> on.
As Toyoshima-san pointed out, it is traditional radical-stroke order.
The exact placement followed a series of rules depending primarily on
the order in the Kangxi dictionary, with subsidiary rules for characters
not in the Kangxi dictionary. [No, that is not a typo for Kanji: "Kangxi" is a
Qing dynasty reign name, during which an official, large Chinese dictionary
compendium was published.]
>
> HOWEVER, I believe that major database vendors like Oracle and Sybase
> will "sort" Unicode (actually UTF-8) using Unicode order (ie. they
> don't sort!).
The default sort order for Unicode data will certainly be in Unicode
binary order. However, the database vendors, including Sybase, provide
mechanisms for defining collation orders for databases. There is no
reason to suppose these mechanisms will not apply to Unicode, as well
as to other character sets supported by the databases. However, given
the complexities of language-specific and culturally-dependent sorting
rules, it is unlikely that particular collation orders you have in
mind will be delivered "in-the-box" with database software.
>
> My question is: what do do about this?
Press the database vendors to provide default collations for common
languages which work on tables with text stored in Unicode. And
also press them to provide simpler mechanisms for defining and
using custom collations.
--Ken Whistler
Technical Director, Unicode, Inc.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:30 EDT