Re: Java and Unicode

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Nov 15 2000 - 18:48:25 EST


John O'Conner wrote:

> Yes. If you have been involved with Unicode for any period of time at all, you
> would know that the Unicode consortium has advertised Unicode's 16-bit
> encoding for a long, long time, even in its latest Unicode 3.0 spec. The
> Unicode 3.0 spec clearly favors the 16-bit encoding of Unicode code units, and
> the design chapter (chapter 2) never even hints at a 32-bit encoding form.

Indeed. Though, to be fair, people have been talking about UCS-4 and
then UTF-32 for quite awhile now, and the UTF-32 Technical Report has been
approved for half a year.

FYI, on November 9, the Unicode Technical Committee officially voted
to make Unicode Technical Report #19 "UTF-32" a Unicode Standard Annex (UAX).
This will be effective with the rollout of the Unicode Standard, Version
3.1, and will make the 32-bit transformation format a coequal partner
with UTF-16 and UTF-8 as sanctioned Unicode encoding forms.

>
> The previous 2.0 spec (and previous specs as well) promoted this 16-bit
> encoding too...and even claimed that Unicode was a 16-bit, "fixed-width",
> coded character set. There are lots of reasons why Java's char is a 16-bit
> value...the fact that the Unicode Consortium itself has promoted and defined
> Unicode as a 16-bit coded character set for so long is probably the biggest.

It is easy to look back from the year 2000 and wonder why.

But it is also important to remember the context of 1989-1991. During
that time frame, the loudest complaints were from those who were
proclaiming that Unicode's move from 8-bit to 16-bit characters would
break all software, choke the databases, inflate all documents by
a factor of two, and generally end the world as we knew it.

As it turns out, they were wrong on all counts. But the rhetorical
structure of the Unicode Standard was initially set up to be a hard
sell for 16-bit characters *as opposed to* 8-bit characters.

The implementation world has moved on. Now we have an encoding model
for Unicode that embraces an 8-bit, a 16-bit, *and* a 32-bit encoding
form, while acknowledging that the character encoding per se is
effectively 21 bits. This is more complicated than we hoped for
originally, of course, but I think most of us agree that the incremental
complexity in encoding forms is a price we are willing to pay in order
to have a single character encoding standard that can interoperate
in 8-, 16-, and 32-bit environments.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT