From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jan 09 2006 - 20:50:31 CST
> > <0069, 006A> --> 103C.1054.0020.0020.0002.0002
> > <0133> --> 103C.1054.0020.0020.0004.0004
> > <0049, 004A> --> 103C.1054.0020.0020.0008.0008
> > <0132> --> 103C.1054.0020.0020.000A.000A
> > ^^^^^^^^^ ^^^^^^^^^ ^^^^^^^^^
> > primary secondary tertiary
>
Philippe asked:
> Should n't it be instead (leading zeroes suppressed only for clarity, avoiding
line breaking in emails) ?:
>
> <0069, 006A>
> --> 103C.1054.0.20.20.0.2.2.0.69.6A
> <0133>
> --> 103C.1054.0.20.20.0.4.4.0.133
> <0049, 004A>
> --> 103C.1054.0.20.20.0.8.8.0.49.4A
> <0132>
> --> 103C.1054.0.20.20.0.A.A.0.132
>
> (note the addition of .0. to separate collation levels, to allow
> binary sort order, and the addition of the trailing collation
> level for the default codepoint ordering with unlimited collation keys)
Not necessary. The UCA generally assumes a maximum level of 3, to
simplify discussion, and because that is usually all that is
needed. The 4th level values in the DUCET table are just there
to make further distinctions if people need them in certain cases.
Furthermore, because the DUCET values are constructed with all
primary weights > all secondary weights > all tertiary weights,
I make use of the implementation technique discussed in 6.1.1,
Eliminating Level Separators. Level separators aren't needed in
constructing examples from DUCET, if no table tailoring has been
applied and no other compression techniques are used.
Note that the constructed keys I posted will sort in the exact
same relative order as those you posted. The 4th level differences
are irrelevant and are swamped by the tertiary differences.
> Another related question: Why isn't there a standard 16-bit UTF
> that preserves the binary ordering of codepoints?
> (I mean for example UTF-16 modified simply by moving all
> code units or code points in E000..FFFF down to D800..F7FF
> and moving surrogate code units in D800..DFFF up to F800..FFFF).
Huh? Because it would confuse the hell out of everybody and lead
to problems, just like any other putative fixes by proliferation
of UTF's.
Sorting UTF-16 in binary order is easy. See "UTF-16 in UTF-8 Order",
p. 136 of TUS 4.0.
--Ken
This archive was generated by hypermail 2.1.5 : Mon Jan 09 2006 - 20:51:59 CST