Re: Collation TR

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jan 07 1999 - 19:54:42 EST


John Cowan commented:

>
> I'm concerned that this collation algorithm will break in Unicode 3.0,
> because of the arrival of the new Han characters.

The addition of the CJK Vertical Extension A to Unicode 3.0
(6582 more Han characters, at 3400..4DB5) does not "break" the
collation algorithm. The default behavior of the algorithm,
which assumes that Han characters will simply be weighted automatically
according to their Unicode value, will then result in Han characters
sorting in two runs: the common to rare characters in 4E00..9FA5,
and the exceedingly rare, special use and local variants of CJK
Vertical Extension A ahead of them in 3400..4DB5.

While it would be nice to have the default of the algorithm
automatically merge the two sets of Han characters into a single
radical/stroke order run, doing so requires a large table, and
ought to be left to tailoring for those applications that
1) actually implement Extension A and 2) don't object to dealing
with the large expansion of the default table. It doesn't seem
reasonable for run-of-the-mill alphabetic sort tailorings to
have to be saddled with the large table needed to merge the
exceedingly rare Chinese characters into the rest of them, when
even if they ran into Chinese in multilingual material they were
sorting, the default behavior would be good enough.

There are ample precedents for this in Asian computing: nearly
every legacy Asian code page out there has the phenomenon of
Level 1, Level 2, etc., with separately ordered sets of characters
in each level (since additions to the standard can't be
accomodated by reordering all the characters that were previously
standardized). And yet most Asian implementations until recently have
been merrily sorting characters by the binary order of the native
character set. And the Unicode "Level 1", namely the 20902
characters in the existing set of Unified Han characters in
Unicode 2.0, is already so much more comprehensive than any
of the Level 1's of any particular Asian character encoding
standard (except those that are simply renamed forms of Unicode),
that the binary ordering of Unicode Han is generally far better than the
binary ordering of Han characters in the legacy encodings.

> Presumably the
> non-cultural sorting will result from a merge of the main block
> with the new block, but this will not be easily achievable, given
> that the primary weights of Han characters are defined to be
> just their Unicode codepoints. It would be clearly undesirable
> to make the sort order of Han characters depend on the date of their
> addition to the Unicode Standard!

That is not a Han character problem. The addition of *any*
character to the standard (all 10000+ that are going into
Unicode 3.0) after the establishment of the first version of
the default collation tables for the Unicode Collation Algorithm
means that the "sort order of [those] characters depend[s] on
the date of their addition to the Unicode Standard." The default
collation tables will need to be versioned to take all those
new characters into account. And in some instances, depending on
the relation of new characters to old, the establishment of
a default collation for a new character could affect that of
one previously in the table. Certainly the absolute values
of all the weights in the table will change, although most of
the relative ordering relations will be stable.

>
> What is the reason for not making all secondary keys backwards,
> as in the DIS? Is there strong practice requiring a forward
> option?

What DIS? If you are referring to 14651, the International String
Ordering standard, that is currently ISO/IEC FCD 14651.2 (second
ballot for a final committee draft), not a DIS. And in the
document under ballot, the common tailorable template table
does not specify making the secondary keys backwards. That is
a tailoring that one can do -- and any use of the template table
requires *some* tailoring.

--Ken Whistler

>
> --
> John Cowan cowan@ccil.org
> e'osai ko sarji la lojban.
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:43 EDT