CESU-8 marches on

From: DougEwell2@cs.com
Date: Sat Dec 22 2001 - 03:31:57 EST


Without any fanfare, at least on this public mailing list, the proposed
Unicode Technical Report #26 defining CESU-8 (Compatibility Encoding Scheme
for UTF-16: 8-Bit) has been upgraded in the past week from "Proposed Draft"
status to "Draft" status. That means CESU-8 is moving forward along the road
to approval by the UTC, however smooth or rocky that road may be.

So it seems like a sensible time to get back on my soapbox about CESU-8, ask
the pivotal question once again concerning the motivation for this new
scheme, and point out a lingering error in the TR while I'm at it.

CESU-8, for those who may have forgotten or repressed it, is a variation of
UTF-8 which encodes supplementary characters in six bytes instead of four
bytes. Essentially, it is UTF-8 applied to UTF-16 code units instead of
Unicode scalar values. The UTF-16 transformation is applied to each
supplementary character, breaking it into a high surrogate and a low
surrogate, and then the UTF-8 transformation is applied to the two
surrogates, so that each is encoded in three bytes.

CESU-8 was originally called UTF-8S, at least on this list, the "S"
presumably denoting the variant encoding of Surrogates. It has been promoted
by representatives of Oracle, notably Jianping Yang, and PeopleSoft, notably
Toby Phipps (the author of DUTR #26), as a way to ensure that Unicode data is
sorted consistently in UTF-16 code-point binary order.

Several people on this list, including me, have been critical of CESU-8,
claiming that UTF-16 code-point order is not a suitable collation order and
should not serve as the basis of a new (or hacked) UTF. UTF's are supposed
to be character encoding forms (cf. UTR #17, "Character Encoding Model") that
map Unicode scalar values to sequences of bytes, words, double-words, etc.
You're not supposed to piggyback a UTF on top of another UTF, the way CESU-8
sits on top of UTF-16.

The critics of CESU-8 claim its reason for existence is that the database
vendors have been ignoring the designation of the supplementary code space
and have handled "Unicode" as surrogate-unaware UCS-2. Now that
supplementary characters have become a reality (as of Unicode 3.1), the
vendors have chosen to promote this new encoding scheme instead of either (a)
fixing the sort order of existing database engines to sort supplementary
characters properly, AFTER basic characters, or (b) making a small
modification to their sort routines to sort normal UTF-8 data in the
idiosyncratic UCS-2-like order.

There is also a concern that CESU-8 is really just a variation of UTF-8,
allowing (nay, requiring) sequences that are illegal in UTF-8 but otherwise
looking just like UTF-8. This could open security holes that the UTC has
worked hard to close, and is continuing to close in Unicode 3.2.

Finally, although the promoters claim that this mutant form of UTF-8 is only
for internal use within closed systems (which would make it completely
unnecessary for the Unicode Consortium to sanction, describe, or even
acknowledge it), they have not only written a Technical Report to describe it
to the public but have announced their intent to register it with the IANA, a
major step toward open interchange of CESU-8 data. (It was claimed that the
IANA registration was intended to pre-empt some other party from registering
CESU-8 with IANA, but I don't see what difference this would make or how the
pre-emptive action would help anything.)

The promoters of CESU-8 say that data in this format already exists in the
real world, and the purpose in describing it in a UTR is to codify an
existing de-facto standard. For me, there is one question that seeks to
explain the real motivation behind CESU-8. We know that basic (BMP)
characters are encoded exactly the same in UTF-8 and CESU-8. We also know
that, although the supplementary space has been designated for many years, no
actual supplementary characters (with the exception of private use planes 15
and 16) were encoded, and thus allowed for interchange, until the publication
of Unicode 3.1 earlier this year.

Furthermore, we know what characters are currently (Unicode 3.1) encoded in
the supplementary space: the ancient Old Italic and Gothic scripts; the
Deseret script, which has not been actively promoted for 130 years; a large
set of musical and mathematical symbols; the Plane 14 language tags; and
several thousand Han characters. The Han characters are generally thought to
be less commonly used than those in the BMP; otherwise (so the story goes)
they would have been encoded in Unicode sooner. Remember that none of these
non-BMP characters could be conformantly used (e.g. stored in a database)
until the publication of Unicode 3.1.

So my question is: What supplementary characters are currently, TODAY,
stored in Oracle or PeopleSoft databases that require the creation of a new
encoding scheme to ensure they can continue to be sorted consistently?

I suspect there are none, and the real rationale behind CESU-8 is not to
guarantee consistent sorting of existing non-BMP data but to validate the
continued use of surrogate-unaware, UCS-2 mechanisms for handling "Unicode"
data. I have asked this question before, and nobody was able to cite an
example of real-world supplementary characters that require this
extraordinary handling.

Oh yes, I almost forgot: the lingering error. The original PDUTR contained
the following passage:

"The bit pattern 11110xxx is illegal in any CESU-8 byte, effectively
prohibiting the occurrence of UTF-8 four-byte surrogates in CESU-8."

Somebody, I think it was Markus Scherer, pointed out that this was wrong; the
bit pattern 1111xxxx (note fifth character 'x' instead of '0') is actually
illegal. This has been changed in the DUTR, but not to the correct bit
pattern:

"The bit pattern 11111xxx is illegal in any CESU-8 byte...."

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Sat Dec 22 2001 - 03:20:33 EST