In a message dated 2001-06-13 5:29:33 Pacific Daylight Time,
Markus.Kuhn@cl.cam.ac.uk (through marco.cimarosti@essetre.it) writes:
> I think, Oracle et al. should consider to use instead of UTF-16 what I
> propose to call UTF-16F (F for "fixed") in their B-trees, to maintain
> UCS binary sorting order:
>
> Conversion between UTF-16 and UTF-16F works as follows:
>
> unsigned short utf16_to_utf16f(unsigned short u)
> {
> assert(u <= 0xffff);
> /* shift surrogates into the top 0x800 code positions of 16-bit space
*/
> if (u >= 0xe000)
> return u - 0x800;
> if (u >= 0xd800)
> return u + 0x2000;
> return u;
> }
This is what I alluded to in my earlier message about the user-defined
function supplied to qsort(). Any sorting mechanism for UTF-16 can easily
incorporate this efficient transformation to achieve binary order.
By coding the transformation inline, and reordering things trivially so that
the test for (u < 0xe000) -- by far the most common case -- appears first,
the transformation will degenerate in most cases to:
if (u < 0xe000)
;
and nobody can say that that is not efficient enough, on any hardware built
since 1985.
If you remove the assert(u <= 0xffff) statement, then the same logic can be
used for data in either UTF-8 or UTF-16, provided that no unpaired surrogates
appear in your data (a reasonable constraint).
Oracle and PeopleSoft can use this transformation in their COBOL, in their
memory cache, on the beaches and in the fields and streets, etc. instead of
UTF-8s, and it will be much less work for *them* than maintaining two
separate-but-confusable encoding schemes and fielding all the tech support
calls from irate customers who have discovered that "UTF8" does not mean
UTF-8.
-Doug Ewell
Fullerton, California
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT