From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Nov 13 2004 - 20:27:03 CST
From: "A. Vine" <andrea.vine@sun.com>
>> I'm just curious about the \0 thing. What problems would having a \0 in
>> UTF-8 present, that are not presented by having \0 in ASCII? I can't see
>> any advantage there.
>
> Beats me, I wasn't there. None of the Java folks I know were there
> either.
The problem is in the way strings that get passed to JNI via the legacy
*UTF() APIs are accessed: there's no indicator of the string length, so it
would be impossible to know if the \0 terminates the string if if is allowed
in the content of the string data.
The C080 encoding is a way to escape this character, so that it can be
passed to JNI using the legacy *UTF() APIs that exist since Java 1.0.
This encoding is also part of the Java class file format, where string
constants are also encoded this way. Note that the Java String object allows
storing ANY UTF-16 code unit, including invalid ones (0xFFFE and 0xFFFF), as
well as isolated or unpaired surrogates. So Java internally does not use
UTF-16 strictly. Using a plain UTF-8 representation would have prevented the
class format to support such string instances, which are invalid for
Unicode, but not in Java. Using CESU-8 would not work either.
There are legacy Java applications that use the String object to store
unrestricted arrays of unsigned 16-bit integers (Java native type "char"),
without any association with the fact that it may represent valid
characters, and it has the advantage that such representation allows fast
loading of classes containing large constant pools (these classes won't
perform a long class initialization code, like the one performed when
initilizing an array of integer type, but will directly use the String
constant pool which is decoded and loaded into chars directly by native CPU
code in the JVM rather than with interpreted bytecode which will never be
compiled; this may seem a bad programming practice, but the Java language
specs allows this, and Sun will not remove such possibility without breaking
compatibility with those programs).
This "modified UTF" should then be regarded as a specific encoding scheme
that supports the unrestricted encoding form used Java String instances
(extended UTF-16, more exactly UCS-2) which, by initial design, can
represent and store *more* than just valid Unicode strings.
The newer JNI interface allows reading/returning String instance data
directly in UCS-2 encoding form, without using the specific "modified UTF"
encoding scheme: there's a API parameter field to pass the actual string
length, so the interface is binary safe. Applications can then use it to
pass any valid Unicode string, or even invalid ones (with invalid code units
or unpaired surrogates) if they wish. There's no requirement that this data
represent only true characters. Note that even Windows uses an unrestricted
UCS-2 representation in its "Unicode-enabled" Win32 APIs.
The newer UCS-2 interface is enough for JNI extensions to generate true
UTF-8 if they wish. I don't see the interest of adding an additional support
for true UTF-8 in JNI, given that this support is trivial to implement using
either the null-terminated *UTF() JNI APIs or the UCS-2-based JNI APIs... In
addition, this support is not really needed for performance (the UCS-2
interface is the fastest one for JNI, as it avoids the JNI extension to
allocate internal work buffers to work with native OS APIs that can also use
UCS-2 directly without using extra code-converters).
This archive was generated by hypermail 2.1.5 : Sat Nov 13 2004 - 20:28:49 CST