Re: Strange UTF-8 in Java

From: Mark Davis (marked@best.com)
Date: Mon Sep 28 1998 - 23:06:16 EDT


I believe that John has it exactly right. The variant form of UTF-8 (call it
UTF-8n for now) is not designed for general transmission, but is a special
format for serialized Java Strings.

Notice also, as on page A-8 of the Unicode Standard V2.0, that receiving
implementations do not have to check that the shortest implementation is being
used when converting. For those implementations, UTF-8n--although out of
spec--will be converted correctly.

However, any implementation that did not just convert UTF-8 into 16-bit
Unicode, and was handed UTF-8n text purporting to be UTF-8 text could end up
with non-uniqueness problems.

Mark

John Cowan wrote:

> Elliotte Rusty Harold scripsit:
>
> > As you may or may not know, Java's UTF-8 encodes the null character, ASCII
> > 0, in two bytes rather than one as it should according to the UTF-8
> > specification.
>
> Not so fast. Java uses this encoding internally to represent Strhings,
> and provides the readUTF() and writeUTF() methods to export it to
> binary files. But those methods are not meant for general purposes:
> they are meant to provide save/restore for String objects, as is
> indicated by the use of a 4-byte length (big-endian) before each
> modified UTF-8 content.
>
> The proper Java machinery for handling character encodings uses
> standard UTF-8 rules (that is, InputStreamReader for input and
> OutputStreamWriter for output: these classes convert between
> byte streams and character streams).
>
> > 1. Will using Java's UTF-8 format produce problems for any software
> > anyone's aware of?
>
> Definitely. Software assuming that U+0000 can only be encoded as \0x00
> may miss "stealth" nulls encoded against the UTF-8 rules.
>
> > 2. In general, is it always acceptable to encode a one-byte character in
> > two or three bytes? or a two-byte character in three bytes?
>
> No.
>
> > 3. Does anyone know why Java does not want to encode the 0 character as a
> > single byte? In other words, is there any reason why a stream should not
> > contain embedded nulls?
>
> The main point is not the use in readUTF()/writeUTF(), but in the
> internal representation. For compatibility with C routines, Java
> Strings are stored in a guaranteed null-free representation so that
> trailing 0x00 bytes can be used as C end-of-string indicators.
> Since the machinery for processing mutated UTF must exist in every
> JVM anyway, it was natural to use it for reading and writing Strings
> as well. Note that the length values allow 0x00's to appear in the
> stream anyway!
>
> --
> John Cowan cowan@ccil.org
> e'osai ko sarji la lojban.

--
business: medavis2@us.ibm.com, mark@unicode.org
personal: mark@macchiato.com, http://www.macchiato.com
--



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:42 EDT