RE: CESU-8 vs UTF-8

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Sun Sep 16 2001 - 15:58:22 EDT


Marcin,

>
> We can't change the past, but I hope that at least UTF-8 processing can
> be done without treating surrogates in any special way. Surrogates are
> relevant only for UTF-16; by not using UTF-16 you should be free of
> surrogate issues, except by having a silly unused area in character
> numbers and a silly highest character number. Please don't spread
> UTF-16 madness where it doesn't belong.
>

I think that it took us USC-2 to get Unicode started, but I suspect that
UTF-16 usage will eventually fade out. Unlike UCS-2 UTF-16 is another MBCS
character set and has lost the advantage of a fixed width character like
UTF-32. I think that some applications will find it easier to migrate to
UTF-32 rather than convert to UTF-16.

With xIUA I demonstrate that it really does not matter much what format of
Unicode you use and the it is even trivial to process it in a mix of formats
in the same transaction. The Unicode processing is somwhat independent of
its format. To do so you must compare UTF-16 in code point coder which is
also a trivial thing to do.

CESU-8 breaks that model becasue it is a form of Unicode with the sole
purpose of supporting a non-Unicode code point order sort order. Yes I
could devise a way to sort UTF-32 and UTF-8 in UTF-16 binary sort order but
that is only a matter of some messy code. The real issue is that I must now
handle Unicode that has as part of it essential property that it must
survive transforms with two distinctly different sort orders.

With this standard approved my applications can be compelled to use CESU-8
in place of UTF-8 if I was to talk to Peoplesoft or other packages that will
insist on this sort order of Unicode data. If I use UTF-8 as well, then I
will need two completely different sets of support routines.

Fundamental to all MBCS string handling routines is character length
determination. To do that for CESU-8 I will have to not only check the
first byte but in the case of three byte sequences I will have to determine
if the value corresponds to a surrogate. If I don't do this then it is like
processing MBCS data with SBCS routines. For example if I use a UTF-8
strtok on CESU-8 data it will break the stings whenever either and initial
or trailing token matches. So you need a special CESU-8 routine. The
problem will be that CESU-8 my be detected as UTF-8. Supposedly I open a
socket and get a buffer of data that looks like UTF-8 so I decide to use the
UTF-8 support routines. The second buffer code comes in with surrogates and
I continue to process it as UTF-8. This introduces errors of the worst
kind - the subtle errors. The program runs but the data is slightly bad.
Oops I just put the amount in the credit not asset field.

If my application accepts both UTF-8 and CESU-8 then what sorting do I use
for my database?

My problem is that the correct approach is for people like Peoplesoft to fix
their code before accepting non BMP characters. They should upgrade the
UCS-2 code to truly support UTF-16 properly. CESU-8 does more than
propagate the errors but it extends the problem by implementing a bad
solution. What started out as a comparatively minor problem for a few
people ends up as a major problem for everyone.

I think that the coexistence of both UTF-8 and CESU-8 is a nightmare and the
Unicode committee has to decide on one or the other or restrict CESU-8 to
BMP character use only which of course makes it a limited UTF-8. If people
really need matching UTF-16 sequences between systems that can always
transform to UTF-8 and convert back on the other end into UTF-16 again.
Also that can compare is any order that want. If they like to compare
UTF-16 in little endian byte order more power to them, just don't ask me to
do the same.

Carl



This archive was generated by hypermail 2.1.2 : Sun Sep 16 2001 - 15:48:51 EDT