CESU-8 vs UTF-8 (Was: PDUTR #26 posted

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Sat Sep 15 2001 - 00:39:43 EDT


Julie,

> Proposed Draft Unicode Technical Report #26: Compatibility Encoding

Thank you for posting this.

"This document specifies an 8-bit Compatibility Encoding Scheme for UTF-16
(CESU) that is intended as an alternate encoding to UTF-8 for internal use
within systems processing Unicode in order to provide an ASCII-compatible
8-bit encoding that preserves UTF-16 binary collation. It is not intended
nor recommended as an encoding used for open information exchange. The
Unicode Consortium, does not encourage the use of CESU-8, but does recognize
the existence of data in this encoding and supplies this technical report to
clearly define the format and to distinguish it from UTF-8. This encoding
does not replace or amend the definition of UTF-8."

This is not a true statement. "It is not intended nor recommended as an
encoding used for open information exchange." is false. Its intent is to
layout a format encoding between Oracle and Peoplesoft code in the hopes
that they can get other database vendors to support it. They are really
asking for a public standard not a private implementation.

If it were only an internal protocol used internally by a single vendor they
would not be submitting a UTR.

The decision becomes should the Unicode committee approve this a as public
encoding? To determine that you have to ask three questions. Is there a
problem? Are there and negative impacts? I there an alternative?

Is there a problem? I think that the answer is yes. There is a problem
once you implement characters outside of BMP that binary sorts of UTF-32 &
UTF-8 sort in a different sort order from UTF-16. If you application
compares much match a databases key sort they you have problems if you
transform the Unicode from the native database encoding. They want Oracle
data stored in UTF-8 to match data encoded by other databases in UTF-16.

Are there negative impacts? Yes. It will almost work with most UTF-8
support libraries. This causes the worst type of errors. You need to have
code the work right or really breaks and not introduce subtle errors. It
will fool most UTF-8 detection routines. It can create security problems
just like non-short form encoding in UTF-8 because the "character" is not a
character but a surrogate.

Is there an alternative? Yes. You must use special code to compare UTF-16.
If you use the OLD UCS-2 code it will give you the unique UTF-16 compare
problem. However by adding two instructions to the compare that add very
little overhead, you can provide a Unicode code point compare routine that
sorts in exactly the same order as UTF-32 & UTF-8.

I propose that since all UCS-2 vendors will have to upgrade the code to
provide UTF-16 support the part of the UTF-16 compliance should be that all
UTF-16 compares default to a code point order compare. You might want to
allow an optional a binary compare but the standard compare should be in
code point order.

This provides an optimal solution to the problem for everybody. This small
extra overhead is just like the extra overhead in checking for and handling
surrogates. If this is a problem then UTF-32 is an alternate solution.

Carl



This archive was generated by hypermail 2.1.2 : Fri Sep 14 2001 - 23:19:59 EDT