RE: CESU-8 vs UTF-8 (Was: PDUTR #26 posted

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Sat Sep 15 2001 - 02:11:28 EDT


Sorry but I left out three points.

1) Why ask for an IANA character set designation for "internal use within
systems processing Unicode"? This is a definite indication that the real
intent goes well beyond even the multi-vendor application to data base
interfaces. It is apparent that the real intent is the use the force of
standards not only to compel the major database developers to offer support
for CESU-8 but to make it a public internet standard as well.

2) The time is now to add the specification of code point order compare
support for systems, databases and libraries offering UTF-16 support before
Unicode systems are split into two different migration paths for future
multi plane character support and while vendors are upgrading from UCS-2 to
UTF-16 support.

3) We don't want to have to deal with CESU-8 in systems that do not use
UTF-16.

It will be almost impossible to develop code to support both CESU-8 and
UTF-8 well. It will propagate the sort problem from the special case, to
all systems that use databases or communicate with other systems by virtue
of having to simultaneously support a mix of CESU-8 and UTF-8 which by
definition are required to have a distinctly different sort orders.

Lets fix the problem the right way.

Thank you, (Now stepping off the soap box)

Carl

> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> Behalf Of Carl W. Brown
> Sent: Friday, September 14, 2001 9:40 PM
> To: unicode@unicode.org
> Subject: CESU-8 vs UTF-8 (Was: PDUTR #26 posted
>
>
> Julie,
>
> > Proposed Draft Unicode Technical Report #26: Compatibility Encoding
>
> Thank you for posting this.
>
> "This document specifies an 8-bit Compatibility Encoding Scheme for UTF-16
> (CESU) that is intended as an alternate encoding to UTF-8 for internal use
> within systems processing Unicode in order to provide an ASCII-compatible
> 8-bit encoding that preserves UTF-16 binary collation. It is not intended
> nor recommended as an encoding used for open information exchange. The
> Unicode Consortium, does not encourage the use of CESU-8, but
> does recognize
> the existence of data in this encoding and supplies this
> technical report to
> clearly define the format and to distinguish it from UTF-8. This encoding
> does not replace or amend the definition of UTF-8."
>
> This is not a true statement. "It is not intended nor recommended as an
> encoding used for open information exchange." is false. Its intent is to
> layout a format encoding between Oracle and Peoplesoft code in the hopes
> that they can get other database vendors to support it. They are really
> asking for a public standard not a private implementation.
>
> If it were only an internal protocol used internally by a single
> vendor they
> would not be submitting a UTR.
>
> The decision becomes should the Unicode committee approve this a as public
> encoding? To determine that you have to ask three questions. Is there a
> problem? Are there and negative impacts? I there an alternative?
>
> Is there a problem? I think that the answer is yes. There is a problem
> once you implement characters outside of BMP that binary sorts of UTF-32 &
> UTF-8 sort in a different sort order from UTF-16. If you application
> compares much match a databases key sort they you have problems if you
> transform the Unicode from the native database encoding. They want Oracle
> data stored in UTF-8 to match data encoded by other databases in UTF-16.
>
> Are there negative impacts? Yes. It will almost work with most UTF-8
> support libraries. This causes the worst type of errors. You
> need to have
> code the work right or really breaks and not introduce subtle errors. It
> will fool most UTF-8 detection routines. It can create security problems
> just like non-short form encoding in UTF-8 because the
> "character" is not a
> character but a surrogate.
>
> Is there an alternative? Yes. You must use special code to
> compare UTF-16.
> If you use the OLD UCS-2 code it will give you the unique UTF-16 compare
> problem. However by adding two instructions to the compare that add very
> little overhead, you can provide a Unicode code point compare routine that
> sorts in exactly the same order as UTF-32 & UTF-8.
>
> I propose that since all UCS-2 vendors will have to upgrade the code to
> provide UTF-16 support the part of the UTF-16 compliance should
> be that all
> UTF-16 compares default to a code point order compare. You might want to
> allow an optional a binary compare but the standard compare should be in
> code point order.
>
> This provides an optimal solution to the problem for everybody.
> This small
> extra overhead is just like the extra overhead in checking for
> and handling
> surrogates. If this is a problem then UTF-32 is an alternate solution.
>
> Carl
>
>
>



This archive was generated by hypermail 2.1.2 : Sat Sep 15 2001 - 01:17:56 EDT