Re: Strange UTF-8 in Java

From: Mark Davis (marked@best.com)
Date: Wed Sep 30 1998 - 11:45:06 EDT


Rick,

You apparently didn't read John's message. I think you are slamming the Java
folks unnecessarily.

1. Java does use standard UTF-8 in their character code conversions. Look at
http://java.sun.com/products/jdk/1.1/docs/guide/intl/encoding.doc.html to see a
full list of the currently supported encodings. (However, note that Java
licencees are free to omit any of these!)

2. For serializing strings internally, they use a byte format which is the same
UTF-8, except that they use two bytes for null (<C0, 80>). The standard algorithm
for converting UTF-8 to Unicode will convert this correctly back to a null,
unless special checks are made for shortest forms.

I suspect the reason they do this is so that the resulting byte strings are valid
C strings. All Unicode codepoints are valid in a Java Unicode String, including
\u0000. If you didn't do something like this, then the string "a\u0000b" would be
converted into a form that would be clipped, and not round-trip back from C.

This is hardly an "abomination". As long as you make it clear (which the
documentation does, after we pointed it out) that this is not standard UTF-8,
then it is a reasonable internal implementation decision.

What would *not* be proper is to ship it to some unsuspecting recipient and claim
that it was UTF-8.

What I was trying to point out was that if your implementation of UTF-8
conversion does not make special checks for the shortest form, it would still
succeed in reading this data.

>Is there such a thing as documentation for Java?

The documentation for Java is substantial, and widely available (more available
than your own company's API documentation, I believe), since it is on the web.
The class library documentation is at

http://java.sun.com/products/jdk/1.1/docs/index.html (for the 1.1 documentation)

You may be interested in the internationalization page, at
http://java.sun.com/products/jdk/1.1/docs/guide/intl/index.html.

The Java language spec is is published by Addison-Wesley and is available in
bookstores. (If you have been in a good technical bookstore lately, you will find
a large number of books available on Java.) There is also an online version of
the Java language spec at http://java.sun.com/docs/books/jls/index.html

>If so, does anyone read it?
I suspect so.

Mark

Rick McGowan wrote:

> > However, any implementation that did not just convert UTF-8 into 16-bit
> > Unicode, and was handed UTF-8n text purporting to be UTF-8 text could end up
> > with non-uniqueness problems.
>
> I thought that at one time when we first heard about this abomination that
> the Java people were perpetrating, we complained and told them how bad it
> was... Apparently they persist. Too bad they decided to unleash this
> non-conforming blankety-blank on everyone. It would be nice if they
> documented it a non-conforming with a big skull and cross-bones. Or maybe
> they do. Is there such a thing as documentation for Java? If so, does
> anyone read it?
>
> Rick

--
business: medavis2@us.ibm.com, mark@unicode.org
personal: mark@macchiato.com, http://www.macchiato.com
--



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:42 EDT