From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Nov 15 2004 - 12:18:49 CST
From: "Doug Ewell" <dewell@adelphia.net>
> How does Java indicate the end of a string? It can't use the value
> U+0000, as C does, because the "modified UTF-8" sequence C0 80 still
> gets translated as U+0000. And if the answer is that Java uses a length
> count, and therefore doesn't care about zero bytes, then why is there a
> need to encode U+0000 specially?
You seem to assume that Java (the language) uses this sequence. In fact the
sequence is not for use by Java itself, but in its interfaces with other
languages, including C.
In 100% pure Java programming, you never see that sequence, you just work
with any UCS-2 code units when parsing "String" instances or comparing
"char" values.
And if you perform I/O using the supported "UTF-8" Charset instance, Java
properly encodes it as a single null byte. So why do people think that UTF-8
support in Java is broken? It is not.
The "modified UTF-8" encoding is only for use in the serialization of
compiled classes that contain a constant string pool, and through the JNI
interface to C-written modules using the legacy *UTF() APIs that want to
work with C strings.
There's no requirementto use that legacy *UTF() interface in C, because you
wan also use the UTF-26 interface which does not require that the JavaVM
allocates a bytes buffer to perform the conversion to internal String
storage (the UTF-16 JNI interface is more efficient, just a bit more complex
to handle in C when you only use the standard char-based C library; if you
use the wchar_t-based C library, you don't need this legacy interface, but
support of wchar_t in standard libraries is not guaranteed on all platforms,
even if there's a wchar_t datatype in the C libraries and headers, as
"wchar_t" is allowed in ANSI C to be defined equal to "char", i.e. only 1
byte; if this is the case, the C program will need to use something else
than "wchar_t", for example "unsigned short", if it is at least 16-bits).
On Windows, Unix, Linux and Mac OS or OSX, modern C compilers support
wchar_t with at least 16 bits, so this is not a problem. It may still be a
problem if wchar_t is 32-bits, because the fastest UTF-16 based JNI
interface requires 16-bits code units: in that case you won't be able to use
wsrtlen() and so on...
With the fastest JNI 16-bit interface, note that "wide-string" C libraries
assume that U+0000 coded as a single null wchar_t code unit is a
end-of-string terminator; so if this is an issue, and your external
C-written JNI component should work with any Java String instances, you'll
need instead to use memcpy() and similar functions with the separate length
indicator to access to all characters of a Java String instance. Such
complication is not always necessary and sometimes causes unneeded errors.
Using the legacy *UTF() JNI interface will solve the security risk or
interpretation issue...
This archive was generated by hypermail 2.1.5 : Mon Nov 15 2004 - 12:23:51 CST