Re: Strange UTF-8 in Java

From: John Cowan (cowan@locke.ccil.org)
Date: Sun Sep 27 1998 - 13:25:43 EDT


Elliotte Rusty Harold scripsit:

> As you may or may not know, Java's UTF-8 encodes the null character, ASCII
> 0, in two bytes rather than one as it should according to the UTF-8
> specification.

Not so fast. Java uses this encoding internally to represent Strhings,
and provides the readUTF() and writeUTF() methods to export it to
binary files. But those methods are not meant for general purposes:
they are meant to provide save/restore for String objects, as is
indicated by the use of a 4-byte length (big-endian) before each
modified UTF-8 content.

The proper Java machinery for handling character encodings uses
standard UTF-8 rules (that is, InputStreamReader for input and
OutputStreamWriter for output: these classes convert between
byte streams and character streams).

> 1. Will using Java's UTF-8 format produce problems for any software
> anyone's aware of?

Definitely. Software assuming that U+0000 can only be encoded as \0x00
may miss "stealth" nulls encoded against the UTF-8 rules.

> 2. In general, is it always acceptable to encode a one-byte character in
> two or three bytes? or a two-byte character in three bytes?

No.

> 3. Does anyone know why Java does not want to encode the 0 character as a
> single byte? In other words, is there any reason why a stream should not
> contain embedded nulls?

The main point is not the use in readUTF()/writeUTF(), but in the
internal representation. For compatibility with C routines, Java
Strings are stored in a guaranteed null-free representation so that
trailing 0x00 bytes can be used as C end-of-string indicators.
Since the machinery for processing mutated UTF must exist in every
JVM anyway, it was natural to use it for reading and writing Strings
as well. Note that the length values allow 0x00's to appear in the
stream anyway!

-- 
John Cowan					cowan@ccil.org
		e'osai ko sarji la lojban.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT