From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Nov 15 2004 - 03:04:13 CST
From: "Asmus Freytag" <asmusf@ix.netcom.com>
>>CESU-8 is the documentation of someone's internal, non-standard
>>implementation of UTF-8. Of course, the "someone" is large and
>>important and their implementation affects a lot of users. If nobody
>>else is motivated by the presence of UTR #26 to adopt this non-standard
>>version, good.
>
> There are some UTF-8/UTF-16 interoperability aspects that are addressed
> by CESU-8. These concerns are real, and affect multi-component
> architectures
> that must interchange data across component boundaries. Therefore a
> standard
> specification serves a useful purpose.
>
>>What worries me is that there might be other people in the world like
>>Philippe
>
> Phillippe doesn't worry me ;-)
I'd like to note that the Java modified UTF-8 format is not purely internal
to Java and that it is used for interchange of data in a multi-component
architecture, which is the JNI interface allowing external native libraries
to interchange data with any Java-compliant VM.
So it's not only Sun's implementation, but also part of all other VM
implementations that have the support for JNI and native Java interface.
Also, this support appears within the class format which is standardized and
accessible too through Java's reflection mechanisms allowing Java programs
to control how classes are loaded or created at run-time, and interchanged
as well with other hosts. The "component boundaries" above apply to Java as
well.
Finally, the capability of Java of storing and exchanging valid Unicode
strings with embedded nulls (U+0000) is a feature rather than a limitation,
notably when this interchange requires using fixed-sized structures
containing variable-length strings, where these nulls serve as padding bytes
(for example in fixed-width plain-text table formats, where the introduction
of binary length prefixes would make the text file unreadable).
The NULL character is mostly used in plain-text formats as a ignorable
padding, with less ambiguity than spaces commonly used in so many SQL
engines or in XML formats. Some text editors are broken so that they will
not load correctly a text file with embedded nulls: these editors truncate
the read data instead of handling nulls as if it was ignorable whitespace,
because they handle the text as C strings where null bytes mean end of
strings.
There are also many places in data structures used for interchange where
plain-text strings are encoded in data fields, without any extra length
specified specified because the field is extremely small. Nulls are used as
required padding and must not be truncated, because these structures would
be desynchronized. Nulls are also used as filler bytes within some
communication protocols based on plain-text data.
Like it or not, but nulls are part of almost all character sets, from the
oldest ones to the most recent ones (with one notable exception in GSM text
for SMS, where the null byte is a printable character, as GSM don't
need/want data fillers). The support of ignorable padding characters will
remain needed for long (or ever) in plain-texts, even if a plain-text *file*
does not need it (there are other uses of plain-text than just complete
files). Those many expecting that a file containing any null byte is not
text but binary are restricting to the use of "text/plain" in MIME message
formats.
A GSM message would embed null bytes without being considered as binary, and
would contain no data filler; it could not be interchanged with a MIME
"text/plain" datatype even with a "charset" qualifier, but it would still be
plain-text in the definition accepted at the Unicode or ISO/IEC10646 level
(they don't care much about which encoding schemes or transport syntaxes are
used to interchange plain-text, but about the interpretation of the
*decoded* code points; lower encoding levels in the Unicode standard are
mandatory only if applications choose to implement these levels and label
their data with the corresponding charset identifiers that have been
reserved, and included in the Unicode standard).
So it's a fact that Unicode's UTF-8 format is fully compatible with Unicode
(i.e. it can encode any Unicode texts, including those containing NULL
characters), but not with C and other applications that can't depend on the
effective text length being specified out of band, but with an explicit and
mandatory end-of-text marker. This is the place where transport syntaxes are
used in MIME, to escape reserved bytes which have special functions in the
embedding transport: hexadecimal, Base-64, quoted-printable, uuencode, COBS,
... or escaping control bytes by the 0xC0 leading byte (unused in UTF-8)
followed by the control byte with a 0x80 offset. The string definition in C
implies that nulls must be escaped if they are needed, or that string length
be encoded separately out of band (but in that case this is no more a
standard null-terminated C string).
C does not mandate any escaping mechanism, and Java's "modified UTF-8" is
perfectly valid in this context as a transport syntax for CESU-8. In fact I
don't like the term "modified UTF-8" used by Sun in its revized
documentation; it is causing confusion, and in fact it would be more exact
if Sun said it is in fact a "modified CESU-8" (so that it will match with
how Java handles now supplementary characters), and if Sun documented that
this format includes the bijective support of strings with non-character
code units '\uFFFE' and '\uFFFF', and more critically of malformed strings
with unpaired or isolated surrogates (which are normally not acceptable even
in standard CESU-8).
A better term without reference to UTF-8 or even CESU-8 would be useful
(even if information is given that will refer to other standard UTF-8 and
CESU-8 encoding schemes). As this encoding is needed for the *serialization*
(data interchange over a byte-oriented stream) of Java String objects (which
can contain malformed Unicode text with unpaired surrogates, and any valid
or invalid or reserved or unassigned 16-bit code units), why not refering
this encoding as "Java-String-8" (or "JS-8" for short)?
This archive was generated by hypermail 2.1.5 : Mon Nov 15 2004 - 05:52:19 CST