Re: CESU-8 vs UTF-8

From: DougEwell2@cs.com
Date: Sat Sep 15 2001 - 13:58:34 EDT


"Carl W. Brown" <cbrown@xnetinc.com> writes:

> This is not a true statement. "It is not intended nor recommended as an
> encoding used for open information exchange." is false. Its intent is to
> layout a format encoding between Oracle and Peoplesoft code in the hopes
> that they can get other database vendors to support it. They are really
> asking for a public standard not a private implementation.
>
> If it were only an internal protocol used internally by a single
> vendor they
> would not be submitting a UTR.

Exactly. If CESU-8 were intended only as an internal representation, it
would not matter whether it had any official recognition or blessing from
Unicode. I can store Unicode data internally any way I want, using UTF-17
[1] if I choose, and there is nothing non-conformant about this as long as I
treat the data as scalar values and can convert to the real UTFs for data
exchange purposes. To propose CESU-8 in a Technical Report is, as Carl said,
an attempt to make it an official, public standard.

> 1) Why ask for an IANA character set designation for "internal use within
> systems processing Unicode"? This is a definite indication that the real
> intent goes well beyond even the multi-vendor application to data base
> interfaces. It is apparent that the real intent is the use the force of
> standards not only to compel the major database developers to offer support
> for CESU-8 but to make it a public internet standard as well.

This section of the TR amazed me. In the Summary and elsewhere, CESU-8 "is
not intended nor recommended as an encoding used for open information
exchange," but by the end of the document we learn that it will be registered
with the Internet Assigned Numbers Authority. I have spelled out IANA for a
reason, to highlight that it is a body dealing with open information exchange
over the Internet. This completely refutes all of the "internal use only"
claims made in the rest of the document.

> Is there an alternative? Yes. You must use special code to
> compare UTF-16.
> If you use the OLD UCS-2 code it will give you the unique UTF-16 compare
> problem. However by adding two instructions to the compare that add very
> little overhead, you can provide a Unicode code point compare routine that
> sorts in exactly the same order as UTF-32 & UTF-8.

This was my solution long ago: fix the code that sorts in UCS-2 order so that
supplementary characters are sorted correctly. In case there is any
disagreement about this, sorting by UCS-2 order has been WRONG ever since
surrogates and UTF-16 were invented.

However, the database vendors' position is that there is now data sorted in
this way, and it cannot be changed or database integrity will be compromised.
 Fine, there is another alternative: sort all data in UCS-2 order, regardless
of the encoding scheme. This takes, as Carl said, about two lines of code.
You don't lose any significant processing time, and you DON'T need to invent
a new encoding scheme.

> 2) The time is now to add the specification of code point order compare
> support for systems, databases and libraries offering UTF-16 support before
> Unicode systems are split into two different migration paths for future
> multi plane character support and while vendors are upgrading from UCS-2 to
> UTF-16 support.

Unicode has, understandably, avoided recommending binary code point order,
referring people instead to the Collation Algorithm for culturally correct
sorting. This is good because it alerts designers of most applications to
the real issues surrounding collation. For database applications, however,
there is a need for binary code point order that has more to do with
consistency than cultural correctness. I accept this, but still contend that
you can sort UTF-8 data in UCS-2 code point order quickly and easily, without
the need for CESU-8 at all, let alone the need to enshrine it in a TR.

There was a lot that I liked in this PDUTR. The misleading name "UTF-8S" has
been replaced, and there are all those caveats that CESU-8 is not, not, NOT
to be used in open data exchange. None of these caveats, however, can be
taken seriously as long as Section 4, "IANA Registration," is present.

I suggest, as part of the Proposed Draft stage for this document, that
Section 4 be deleted and that IANA be informed that CESU-8 is intended as an
internal encoding only and that they are explicitly requested NOT to register
it.

-Doug Ewell
 Fullerton, California

[1] UTF-17 was a *humorous* description of an exceedingly inefficient
Unicode character encoding scheme. It was not proposed seriously and does
not contribute to the proliferation of UTFs.



This archive was generated by hypermail 2.1.2 : Sat Sep 15 2001 - 12:52:09 EDT