RE: CESU-8 vs UTF-8

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Sun Sep 16 2001 - 04:14:06 EDT


MichKa,

>
> Many people believe that any rule or law that makes no sense or cannot be
> enforced weakens all other laws. I believe that publishing an inconsistent
> document that would allow any reasonably intelligent reader to come to the
> same conclusions as you did, and the standard itself would be weakened
> thereby.
>

I am confused as to how the Peoplesoft justify the need to have a "private"
protocol published by a standards committee unless their intent is to have a
real public standard.

Until I read this I was of the opinion that Peoplesoft had convinced Oracle
to provide this interface but that they wanted a standard to arm twist
Microsoft, IBM and maybe others to provide this interface as well.

Now that I hear of the reference to IANA character set portion, I am afraid
that they are trying to force systems that don't even use UTF-16 to buy into
this madness.

What this says is that Peoplesoft is trying to make the world change because
they do not want to change their software to do the right thing.

Now that you closed up the UTF-8 security holes CESU-8 would open them back
up. It would allow people to impersonate UTF-8 because it would look enough
like UTF-8 to be detected as UTF-8. However if you do not kick out
surrogate encoding as bad UTF-8 to allow CESU-8 through then you must allow
data that contains non-distinct characters through your fire wall. Because
it will detect as UTF-8 you also have the dual representation problems of
the non-short form encoding.

Having dealt with security issues extensively in the past, I know that the
biggest security issues are mistakes and bugs. This is so close to UTF-8
that it may share common support code that will introduce subtle bugs. This
is the worst kind. If it can be demonstrated that there is a real need for
an encoding like CESU-8 then is should be very different from UTF-8. How
does SCSU for example sort?

If CESU-8 becomes an IANA standard then other systems can be compelled to
support it. Now these systems are faced with dealing with Unicode in two
sort sequences. If endorsed by the Unicode commitee it will be a stanard
that code be used between systems as well. Unicode describe the encoding
not the use. By endorsing it they are endorsing it for any use.

"It is not intended nor recommended as an encoding used for open information
exchange. The Unicode Consortium, does not encourage the use of CESU-8, but
does recognize the existence of data in this encoding" says that it is an
acknowledged and supported Unicode encoding standard even though its use is
not encouraged. This says that you can use it as a publicly endorsed
Unicode standard.

> I, however, work on the
> assumption that IANA is not populated by morons and that they would be at
> least willing to hear from the UTC on the inadvisabiity of supporting any
> such encoding, no matter who presents it.
>

I hope that if the Unicode committee assumes that the IANA are not morons
and would not support such an encoding that they could also credit
themselves with the brains to reject it as well.

The problem is that if Unicode blesses this encoding, then IANA is hard
pressed to deny an endorsed Unicode encoding.

It is much like the fact that UTF-8 is recommended for intersystem
communications because unlike UTF-16 and UTF-32 you don't have endian
problems. Likewise it is permissible to send little endian UTF-16 between
systems without a BOM.

If passed, it will say to the world that if your business partner wants to
use CESU-8 because they have a business need to do so, then they have the
blessing of the Unicode consortium.

By not endorsing CESU-8 you are telling the world that if you use this
standard you do so on you own. It is the proper way to say "It is not
intended nor recommended".

OTOH if they want to approve this standard because they don't feel that
anyone will take this standard seriously then they should approve it for use
with Unicode 1.x & Unicode 2.x data only.

____________________________________________________________________________
_______

The bottom line: This UTR tells the world that if a large company has too
much software that was written to support UCS-2 that it does not want to add
UTF-16 support that it can use this standard to force the smaller partner
into jumping through hoops because it has less to convert.

In all likelihood there are probably not too many places in their code where
it is critical that compares exactly match the database sort order. For
those I will supply the code wcscmpDB which will invoke either wcscmp for
databases in UTF-16 order or wcscmpCP for UTF-8/UTF-32. I will even throw
in wcsncmpDB and wcsncmpCP. This will do until code point ordering is
available on all databases.

Carl



This archive was generated by hypermail 2.1.2 : Sun Sep 16 2001 - 03:03:55 EDT