Re: UTF-8S ??? UTF-16F !!!

From: Mark Davis (mark@macchiato.com)
Date: Wed Jun 13 2001 - 11:07:40 EDT

Next message: Michael \(michka\) Kaplan: "Re: informative due to variation across langauges"
Previous message: Edward Cherlin: "Re: UTF-8S: a modest proposal"
In reply to: Marco Cimarosti: "FW: UTF-8S ??? UTF-16F !!!"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Markus, there is an archive on
http://groups.yahoo.com/group/unicode/messages/ if you really want the gory
details. However, this topic is raging on at least 3 email lists that I know
of -- with one heck of a lot of repetition. To add to that, I will forward a
couple of messages I had sent to a different list.

Mark

----- Original Message -----
From: "Marco Cimarosti" <marco.cimarosti@essetre.it>
To: <unicode@unicode.org>
Cc: "'Markus Kuhn'" <Markus.Kuhn@cl.cam.ac.uk>
Sent: Wednesday, June 13, 2001 05:07
Subject: FW: UTF-8S ??? UTF-16F !!!

> I guess that also this message by Markus Kuhn has been bounced as Markus
is
> not subscribed on the Unicode list.
>
> (And, ahemmm, if Markus wishes to continue the discussion also on the
> Unicode List he should perhaps subscribe.)
>
> -----Original Message-----
> From: Markus Kuhn [mailto:Markus.Kuhn@cl.cam.ac.uk]
> Sent: Wednesday, June 13, 2001 13.19
> To: linux-utf8@nl.linux.org
> Cc: unicode@unicode.org
> Subject: Re: UTF-8S ??? UTF-16F !!!
>
>
> Marco Cimarosti wrote on 2001-06-13 10:35 UTC:
> > > What is this UTF-8S please?
> >
> > UTF-8S is a proposal by Oracle, PeopleSoft, et al. to define a new UTF.
It
> > looks like UTF-8, but characters > 0xFFFF are represented with two
3-byte
> > sequences, representing the two UTF-16 surrogate codes.
> >
> > The reason for this proposal, according to Oracle and PeopleSoft, is to
> > allow UTF-8 database to have a binary sort identical to UTF-16
databases.
>
> Oh my god! Please don't. THIS IS UGLY AND AWFUL!!!
>
> One of the very beautiful aspects of UTF-8 is that it preserves the UCS
> binary sorting order. One of the more ugly aspects (of many) of UTF-16
> is that it breaks binary sorting order. I can't believe that anyone is
> seriously trying to transfer the UTF-16 mess into the beautiful and
> innocent world of UTF-8 as well. Oracle can happily use this as a
> proprietary encoding inside their database engine, but I strongly
> recommend that they don't document it anywhere for the outside world and
> that they never make this visible to users on any APIs. In particular,
> they should not even think about proposing this evil idea for
> standardization. Yuck!!!
>
> In any case, it's an interesting proposal for discussion, since UTF-8S
> educates people on why UTF-16 was a bad idea to begin with and why
> UTF-16 should definitely not be used in B-trees and similar access path
> data structures for databases. Had UCS-2 left space for the surrogates
> at the top end of the 16-bit space, the problem wouldn't have occurred.
>
> Engineering proposal:
>
> I think, Oracle et al. should consider to use instead of UTF-16 what I
> propose to call UTF-16F (F for "fixed") in their B-trees, to maintain
> UCS binary sorting order:
>
> Conversion between UTF-16 and UTF-16F works as follows:
>
> unsigned short utf16_to_utf16f(unsigned short u)
> {
> assert(u <= 0xffff);
> /* shift surrogates into the top 0x800 code positions of 16-bit space
*/
> if (u >= 0xe000)
> return u - 0x800;
> if (u >= 0xd800)
> return u + 0x2000;
> return u;
> }
>
> unsigned short utf16f_to_utf16(unsigned short u)
> {
> assert(u <= 0xffff);
> /* shift surrogates back into UTF-16 position */
> if (u >= 0xf800)
> return u - 0x2000;
> if (u >= 0xd800)
> return u + 0x800;
> return u;
> }
>
>
> UTF-16F is not backwards compatible to UCS-2 (no need for that inside a
> B-tree anyway!), but it has the same space requirement as UTF-16 and
> conversion is absolutely trivial and extremely efficient (see above).
> And UTF-16F binary sorts like UCS-4 and UTF-8, which might make life
> simpler for B-tree hackers.
>
> There is no need to standardize UTF-16F, as it is (like UTF-8) an
> internal coding trick, not something you would want to use for data
> transfer. Might be useful though to document it as tutorial information
> somewhere in the next Unicode book.
>
> Markus
>
> --
> Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
> Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
>
> -
> Linux-UTF8: i18n of Linux on all levels
> Archive: http://mail.nl.linux.org/linux-utf8/
>
>

Next message: Michael \(michka\) Kaplan: "Re: informative due to variation across langauges"
Previous message: Edward Cherlin: "Re: UTF-8S: a modest proposal"
In reply to: Marco Cimarosti: "FW: UTF-8S ??? UTF-16F !!!"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT