Re: Perception that Unicode is 16-bit (was: Re: Surrogate space i

From: Joel Rees (rees@server.mediafusion.co.jp)
Date: Thu Feb 22 2001 - 22:40:51 EST


Ken,

Thanks for the consideration. I threw my ego away years ago.

> Joel,
>
> > > Note that I am just sending a response to you, not to the list.
> >
> > I wouldn't mind this being on the list. I was making bad assumptions
about
> > Sun's and others's reasons for wanting to do perverse things with
surrogate
> > pairs, and this clears it up. I guess you want to reduce traffic on the
> > list?
>
> No, not necessarily. But I prefer not to say blunt, uncomplementary
> things about other members of the Consortium on an open, public list.
> I just said this privately to you, so that you would realize that there
> are implementation issues here that were different from what you
> seemed to be driving at.
>
> > Now, I'm going to have to do the math and see what happens, but if I get
the
> > results I it sounds like I will get then the Java char type really was a
bad
> > choice, and similar engineering decisions need to be avoided in the
future,
> > even to the extent of heavy evangelizing. Internal probably does need to
be
> > 32 bit.
>
> The choice of UTF-16 is done for a whole series of reasons.
>
> Java choice a 16-bit character because it was practical. There
> are some implementation issues with it, because they didn't fully
> allow for what UTF-16 would imply for the API's. Many people who
> started out with 16-bit Unicode a decade ago have the same issues today
> in adapting to Unicode 3.1.
>
> But it isn't that hard to fix things, while retaining 16-bit code
> units. I've been doing that just recently for the Unicode library
> that Sybase uses. Microsoft, no doubt, has similar issues, because
> they standardized on a 16-bit unichar long ago.
>
> And while UTF-32 has certain processing advantages in some places,
> UTF-16 works just fine for most things. I know, because I've
> implemented it for all kinds of functionality. All my tables for
> properties, normalization, collation, and such are implemented in
> UTF-16 -- they're more space efficient, among other things. And
> all my string handling is UTF-16. It is only at certain unique
> points, such as in recursive functions for doing decomposition,
> where the extra overhead for dealing with UTF-16 makes UTF-32
> attractive enough that I convert locally to UTF-32 to do
> that processing, and then convert back.
>
> This stuff is not rocket science, though it may seem to be sometimes.
>
> --Ken
>

If you can look past my extreme opinions prefering common standards to
universal, I would appreciate hearing more about how you've managed your way
around the warps in the transformations. I think the folks at Sun and Oracle
might be interested, too. Have you tried sharing some of the key elements
with them, as a sort of bribe to get them away from trying to convert
surrogate pairs directly into UTF-8?

Joel



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT