From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Nov 13 2004 - 21:41:32 CST
From: "Doug Ewell" <dewell@adelphia.net>
> What is a shame is that Unicode published a definition of the defective
> CESU-8 at all.
On that point at least we agree. I wonder why CESU-8 was created, if there
effectively exists applications needing it.
On the other side, the Java modified UTF-8 (in fact more near from CESU-8)
has proven to be useful and is widely used... Simply because it is
compatible with standard C libraries for null-terminated strings. It's
historic and lived well with Unicode, given the previous tolerance in legacy
UTF-8 decoders. Even today, it is still conforming with Unicode rules, given
that Java does not pretend that this is UTF-8 and does not label encoded
data as being UTF-8 -- it is used internally in Java JNI interfaces or in
the Java class file format which is not plain-text, and both are part of the
JVM specifications and not intended for data interchange between distinct
hosts or applications).
But the tolerance for non-shortest forms effectively existed, so that C0,80
would be interpreted safely as NUL (U+0000).
Another way to think about the Java modified UTF-8 is that it could be a
transport encoding syntax for CESU-8 (from which it differs mostly by
escaping null bytes into two bytes C0,80 where the leading byte C0 is not
used in CESU-8, and by supporting the presence of isolated/unpaired
surrogates or invalid UTF-16 code units in the CESU-8 scheme-encoded
string). So why would Sun change something there? Changing something that
works with a new API that will create incompatibilities does not look like a
good thing.
This archive was generated by hypermail 2.1.5 : Sat Nov 13 2004 - 21:43:55 CST