From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Nov 13 2004 - 18:51:40 CST
From: "Theodore H. Smith" <delete@elfdata.com>
> http://java.sun.com/j2se/1.5.0/docs/api/java/io/
> DataInput.html#modified-utf-8
>
> If only people could sue for suggesting bad coding practices ;o)
It was not bad coding practive at the time when Sun designed these APIs,
because it was explicitly based on the ISO/IEC 10646 definition of UTF-8,
which was at that time the legacy version published in the RFC, where
non-shortest encodings were allowed. Sun used it simply as a convenience to
allow using standard C libraries that expect a NUL byte to terminate
strings, but still allowing String objects to contain NUL (U+0000)
characters. Also at that time, Unicode 1.0 was defined only as a 16-bit
subset of ISO/IEC 10646, and the definitions for supporting other planes
were missing.
What is a shame is that Unicode did not consider this widely used legacy
practice when it defined CESU-8 (the way supplementary characters are
encoded with the Java-modified-UTF encoding), so that it would also allow
encoding NUL (U+0000) as {0xC0,0x80}, something that is so useful to allow
interoperatibility with standard C libraries.
Now that CESU-8 is fixed and standardized, the Sun modified UTF encoding
should have its own encoding label registered with something less ambiguous
than the expression "modified UTF". Has Sun applied for registering its
encoding (actually a encoding scheme, because the encoding form is plain
UTF-16, even though the Sun scheme allows encoding isolated or unpaired
surrogates, or invalid code units 0xFFFE and 0xFFFF) with a IANA/MIME
charset identifier? It would then be easier for Sun to reference this
encoding with this label, if Sun published a public informative RFC, for the
IANA charset registration.
Without this RFC, may be the informative page in the Java SDK documentation
may be used as the reference for the IANA registration. But Sun should
ensure that this page will remain accessible (that's why extracting this
page into a isolated plain text document for an informative RFC would be
helpful).
I won't support the idea of Sun suddenly removing one of its APIs (because
it would break lots of JNI extensions that need it, even though there are
APIs that can be used to pass String data directly in the native UTF-16
format supported by the Java native char datatype). Also redefining the API
with new names (without the UTF suffix) seems like overkill, and not needed
for Unicode conformance: the API is self-contained and there's no
restriction in Unicode about how an API function should be named.
This archive was generated by hypermail 2.1.5 : Sat Nov 13 2004 - 18:52:30 CST