FW: UTF-8S ??? UTF-16F !!!

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Wed Jun 13 2001 - 08:08:52 EDT


I guess that also this message by Markus Kuhn has been bounced as Markus is
not subscribed on the Unicode list.

(And, ahemmm, if Markus wishes to continue the discussion also on the
Unicode List he should perhaps subscribe.)

-----Original Message-----
From: Markus Kuhn [mailto:Markus.Kuhn@cl.cam.ac.uk]
Sent: Wednesday, June 13, 2001 13.19
To: linux-utf8@nl.linux.org
Cc: unicode@unicode.org
Subject: Re: UTF-8S ??? UTF-16F !!!

Marco Cimarosti wrote on 2001-06-13 10:35 UTC:
> > What is this UTF-8S please?
>
> UTF-8S is a proposal by Oracle, PeopleSoft, et al. to define a new UTF. It
> looks like UTF-8, but characters > 0xFFFF are represented with two 3-byte
> sequences, representing the two UTF-16 surrogate codes.
>
> The reason for this proposal, according to Oracle and PeopleSoft, is to
> allow UTF-8 database to have a binary sort identical to UTF-16 databases.

Oh my god! Please don't. THIS IS UGLY AND AWFUL!!!

One of the very beautiful aspects of UTF-8 is that it preserves the UCS
binary sorting order. One of the more ugly aspects (of many) of UTF-16
is that it breaks binary sorting order. I can't believe that anyone is
seriously trying to transfer the UTF-16 mess into the beautiful and
innocent world of UTF-8 as well. Oracle can happily use this as a
proprietary encoding inside their database engine, but I strongly
recommend that they don't document it anywhere for the outside world and
that they never make this visible to users on any APIs. In particular,
they should not even think about proposing this evil idea for
standardization. Yuck!!!

In any case, it's an interesting proposal for discussion, since UTF-8S
educates people on why UTF-16 was a bad idea to begin with and why
UTF-16 should definitely not be used in B-trees and similar access path
data structures for databases. Had UCS-2 left space for the surrogates
at the top end of the 16-bit space, the problem wouldn't have occurred.

Engineering proposal:

I think, Oracle et al. should consider to use instead of UTF-16 what I
propose to call UTF-16F (F for "fixed") in their B-trees, to maintain
UCS binary sorting order:

Conversion between UTF-16 and UTF-16F works as follows:

  unsigned short utf16_to_utf16f(unsigned short u)
  {
    assert(u <= 0xffff);
    /* shift surrogates into the top 0x800 code positions of 16-bit space */
    if (u >= 0xe000)
      return u - 0x800;
    if (u >= 0xd800)
      return u + 0x2000;
    return u;
  }

  unsigned short utf16f_to_utf16(unsigned short u)
  {
    assert(u <= 0xffff);
    /* shift surrogates back into UTF-16 position */
    if (u >= 0xf800)
      return u - 0x2000;
    if (u >= 0xd800)
      return u + 0x800;
    return u;
  }

UTF-16F is not backwards compatible to UCS-2 (no need for that inside a
B-tree anyway!), but it has the same space requirement as UTF-16 and
conversion is absolutely trivial and extremely efficient (see above).
And UTF-16F binary sorts like UCS-4 and UTF-8, which might make life
simpler for B-tree hackers.

There is no need to standardize UTF-16F, as it is (like UTF-8) an
internal coding trick, not something you would want to use for data
transfer. Might be useful though to document it as tutorial information
somewhere in the next Unicode book.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT