I am not discouraged. The collation I am designing here is only for
use by customers who want something `better' than a binary sort (in
particular they demand case blind comparison), but they do not want to
pay the performance penalty required for a fully conformant collation.
In areas where performance isn't so critical, I am already, I believe,
taking into account all the issues you describe. I am not as naive
as I may appear.
On the other hand, it is my view that my main customers for fullwidth
ASCII and halfwidth Katakana, the Japanese, generally prefer these
characters not to be equivalent to the `ordinary' counterparts (due to
compatibility with existing systems), so I do not intend to map these.
I do not know if the same is true of the Arabic presentation forms,
and will research this.
Some of the Arabic presentation forms, all of the Hangul syllables,
and many other equivalences, can not be handled by this particular
collation algorithm, but customers will choose this collation
algorithm when performance issues outweigh these deficits.
----- Begin Included Message -----
From: Kent Karlsson <firstname.lastname@example.org>
Gary Roberts wrote:
> Ken and Kent bring up certain canonical equivalences, which the
> technique I proposed will not handle.
> I am now tempted to include mapping
> U+0340 -> U+0300
> U+232A -> U+3009
And the fullwidth ASCII should be mapped to "ordinary ASCII",
the halfwidth Katakana should be mapped to ordinary Katakana,
the presentation forms for Arabic mapped to their ordinary forms,
the Hangul syllables mapped to their Hangul Jamo strings,
compositions should be normalised, ...
I don't want to discourage you, but comparison of Unicode strings
is non-trivial, even when case sensitive.
----- End Included Message -----
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT