From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Nov 15 2004 - 07:06:46 CST
----- Original Message -----
From: "John Cowan" <jcowan@reutershealth.com>
To: "Doug Ewell" <dewell@adelphia.net>
Cc: "Unicode Mailing List" <unicode@unicode.org>; "Philippe Verdy"
<verdy_p@wanadoo.fr>; "Peter Kirk" <peterkirk@qaya.org>
Sent: Monday, November 15, 2004 7:05 AM
Subject: Re: U+0000 in C strings (was: Re: Opinions on this Java URL?)
> Doug Ewell scripsit:
>
>> As soon as you can think of one, let me know. I can think of plenty of
>> *binary* protocols that require zero bytes, but no *text* protocols.
>
> Most languages other than C define a string as a sequence of characters
> rather than a sequence of non-null characters. The repertoire of
> characters
> than can exist in strings usually has a lower bound, but its full
> magnitude
> is implementation-specific. In Java, exceptionally, the repertoire is
> defined by the standard rather than the implementation, and it includes
> U+0000. In any case, I can think of no language other than C which does
> not support strings containing U+0000 in most implementations.
This is exactly the inclusion of U+0000 as a valid character in Java strings
that requires that this character be preserved in the JNI interface and in
String serializations.
Some are thinking here that this is a broken behavior, but there's no other
wimple way to represent this character when passing a Java String instance
to and from a JNI interface, or though serialization such as in class files.
My opinion is that the Java behavior does not define a new encoding, it is
rather a transfer encoding syntax (TES), so that it can effectively
serialize String instances (which are UCS-2 encoded using the 16-bit "char"
Java datatype, and not only the UTF-16 restriction of UCS-2 which also
requires paired surrogates, but does not make the '\uFFFF' and '\uFFFE' char
or code unit illegal as they are simply mapped to U+FFFF and U+FFFE code
points, even if these code points are permanently assigned as non-characters
in Unicode and ISO/IEC 10646).
The internal working storage of Java Strings is not a character set (CCS or
CES), and these strings are not necessarily bound to Unicode (even if Java
provides lots of Unicode-based character properties, and character sets
conversion libraries), as they can store as well other charsets, using other
charset encoding/decoding libraries than those found in java.io.* and
java.text.* packages. Once you admit that, Java String instances are just
arrays of code units, not arrays of code points, their interpretation as
encoded characters being left to other layers.
Should there exist any successor to Unicode (or a preference in a Chinese
implementation to handle String instances internally with GB18030), with
different mappings from code units to code points and characters, the
working model of Java String instances and "char" datatype would not be
affected. This would still be conforming to Java specifications, if the
standard java.text.* and java.io.* or java.nio.* packages that perform the
various mappings between code units and code points, characters and byte
streams are not modified: new alternate packages could be used, without
changing the String object and the unsigned 16-bit integer "char" datatype.
In Java 1.5, Sun chose to support supplementary characters without changing
the char and String representations, but the "Character" object was extended
to support the static representation of code points as static 32-bit "int",
and include the mapping from any Unicode code points in the 17 planes with
"char" code units. The String class has then been extended to allow parsing
"char"-encoded strings by "int" code points (so with the automatic support
and detection of surrogate pairs), but the legacy interface was preserved.
In ICU4J, the "UCharacter" object does not use a static representation but
stores code points directly as "int", unlike "Character" whose instances
still only store a single 16-bit "char", and offers only a static support
for code points: there's still no "Character(int codepoint)" constructor,
only a "Character(char codeunit)", because "Character" keeps its past
serialization for compatibility, and "Character" is also bound to the 16-bit
"char" datatype for object-boxing (automatic boxing only exists in Java 1.5,
explicit boxing in previous and current versions is still supported).
If Java needs some more extension, it's to include the ICU4J "UCharacter"
class that would allow storing 32-bit "int" codepoints, or building a
UCharacter from a "char"-coded surrogates pair of code units, or from a
"Character" instance; and also to add a "UString" class using internally
arrays of "int"-coded code units, with converters between String and
UString. Such extension would not need any change in the JVM, just new
supported packages.
But even with all these extensions, the U+0000 Unicode character would
remain valid and supported, and there would still remain the need to support
it in JNI and in internal JVM serializations for String instances. I really
don't like the idea of some people here that would want to deprecate a
widely used JNI interface that needs this special serialization when
interfacing with C code.
Also the fact that C assigns a role to the all-bits-set-to-zero char as a
end-of-string terminator does not imply that C is unable to represent the
NULL character, given that all other "char" values have *no* required
semantics or values (for example '\r' or '\n' are not bound to fixed Unicode
characters, but to functions, and this interpretation remains compiler- and
platform-specific)
This archive was generated by hypermail 2.1.5 : Mon Nov 15 2004 - 07:13:30 CST