Re: CESU-8 vs UTF-8

From: DougEwell2@cs.com
Date: Sun Sep 16 2001 - 19:40:42 EDT


In a message dated 2001-09-16 13:13:38 Pacific Daylight Time,
cbrown@xnetinc.com writes:

> I think that some applications will find it easier to migrate to
> UTF-32 rather than convert to UTF-16.

I know I have. Handle everything internally as UTF-32, then read and write
UTF-8 or UTF-16 as appropriate.

> CESU-8 breaks that model becasue it is a form of Unicode with the sole
> purpose of supporting a non-Unicode code point order sort order. Yes I
> could devise a way to sort UTF-32 and UTF-8 in UTF-16 binary sort order but
> that is only a matter of some messy code. The real issue is that I must
now
> handle Unicode that has as part of it essential property that it must
> survive transforms with two distinctly different sort orders.

I was glad when Unicode began moving away from the doctrine of "treat all
characters as 16-bit code units" and toward "treat them as abstract code
points in the range 0..0x10FFFF." Make no mistake, UTF-16 can be a useful
16-bit transformation format; but it should not be considered the essence of
Unicode, especially not to the point where additional machinery needs to be
built on top of the Unicode standard solely to support UTF-16.

> With this standard approved my applications can be compelled to use CESU-8
> in place of UTF-8 if I was to talk to Peoplesoft or other packages that
will
> insist on this sort order of Unicode data. If I use UTF-8 as well, then I
> will need two completely different sets of support routines.

Actually, what you will need is *one* routine that works with both UTF-8 and
CESU-8, but breaks the definition of both in doing so, by permitting either
method of handling supplementary characters, and auto-detecting the data as
UTF-8 or CESU-8 based on the method encountered.

> My problem is that the correct approach is for people like Peoplesoft to
fix
> their code before accepting non BMP characters.

Still unanswered, in this proposal to sanctify hitherto non-standard
representations of non-BMP characters in commercial databases, is the
question of how much non-BMP data even exists in commercial databases in the
first place. I know I personally have some (and will soon have more, now
that SC UniPad supports Deseret), but what about users of Oracle and
Peoplesoft databases? Other than the private-use planes, it was not even
allowable to use non-BMP characters until the release of Unicode 3.1 earlier
this year. Where is the great need for a compatibility encoding?

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Sun Sep 16 2001 - 18:57:04 EDT