RE: Perception that Unicode is 16-bit (was: Re: Surrogate space i

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Feb 23 2001 - 19:19:56 EST


Mike Ayers responded to John Cowan:

> > From: John Cowan [mailto:jcowan@reutershealth.com]
> >
> > Ayers, Mike wrote:
> >
> > > After
> > > all, pretty much every ceiling ever established in
> > computing has been broken
> > > through, and there is no reason to believe that it won't
> > happen again!
> >
> > On the contrary. There *are* reasons to believe that it won't happen
> > in the case of character encoding.
>
> As well as reasons to believe that it will, as I explain below.

I was going to step away from this mishmash of hardware analogies
for worrying about the Unicode encoding space, but...

>
> The idea that I am trying to push here is that while Unicode may be
> near complete in its current scope, there is no reason to belive that this
> scope will not change. In fact, as we watch previously banned musical
> notation entered into the repertoire, we should acknowledge that it is
> *already* changing - where she stops, nobody knows.

True enough. But the history of the encoding process that I presented
includes such change of scope. People *are* bringing new symbologies
to the table for encoding as characters that weren't considered 11 years
ago, but despite that fact, there still is a logorithmic decay in the
number of characters encoded per year -- and even if there weren't, there
is enough coding space to deal with centuries of encoding effort at
the rate we are going.

> I do not say that it
> will happen - just that it might and that it wouldn't cost much to be
> prepared.

If you build in enough engineering leeway to deal with additions
for centuries, which is far longer than is reasonable to believe that
the encoding itself will remain in use, where is the problem?

You don't need to build a foundation for a 72-story building underneath
a 1-story wooden frame house, even if you do live in earthquake country.

And in any case, it is unclear what you mean by "being prepared" when
it refers to the Unicode Standard. Be prepared for what? More characters?
Well, refer back to my earlier notes. The committees cannot process
more than a few thousand characters per year, no matter how hard you
push them, and it is getting damn hard to find valid thousands of
characters to even try to push through. Look at the Roadmap again.
If you don't think Numidian, Vai, Chalukya, and Satavahana are obscure
enough to be considering for encoding, then I have to wonder what you
think might be needed. And even after 11 years of research, and even
with big hieroglyphic scripts like Egyptian and Mayan and big ideographic
scripts like Tangut and Khitan included, we still haven't found enough
to fill Plane 1 with characters.

Over 10 years ago, Joe Becker made the offhand prediction that there
were about "250,000 things" that people would eventually want character
codes for. A decade later, I still think that estimate is in the right
ballpark. As engineers, we built the Unicode Standard with a 4x safety
factor beyond its greatest conceivable load. And I think that is good
enough.

> Since I started monitoring this list last year, the two most
> repeated topics (other than AccuSplit pedometers) have been the 16 bit issue
> and the naming of the planes. In all that I have read (I couldn't read it
> all), the finger inevitably winds up pointing firmly at the Consortium, both
> for promoting a 16 bit model, and for being so confusing once that model
> didn't fit.

As is so easy to do, people seem to keep forgetting the history here.

In 1989, the biggest issue with 16-bit characters was not were they
big enough (although that issues *was* raised early on by those worried
about the seemingly inexhaustible supply of Han characters), but that
16-bit characters were *too* big. You would not believe the load of
c**p that passed early on this list about how dropping 8-bit characters in
favor of 16-bit characters would mean the end of technology and life
on earth. (Well, I exaggerate a smidgen.)

The rhetorical stance of Unicode 1.0 was aimed squarely at blowing
away the baroque mess of ISO 2022 and code switching between multiple
8-bit character encodings and DBCS encodings using escape sequences,
in favor of a large, flat character encoding space.

For that we used the 16-bit character, because it looked *almost* big
enough to us (although we already knew of Joe's estimate of the scale
of the problem), and because we had big sales job ahead of us to
convince the implementation community (and their managers, in particular)
that moving to 16-bit characters and "doubling the size of the text
storage" (God, how many times I heard that one!) would be good for them.
Moving directly to a 32-bit character would just have doubled the
noise of the opposition based on this issue.

Now, many years and many compromises later, we are actually in
pretty good shape. We still have a single character encoding,
completely in synch with the International Standard, accepted both
by the industrial implementation community and by the de jure
standards community. With UTF-8, there is an encoding form that
keeps the 8-bit API folks happy when they are upgrading. With UTF-16,
the 16-bit pioneers get to keep their own legacy (by now) implementations,
with only minor complications to access the supplementary characters.
And with UTF-32, the devotees of a large flat character space have
their nirvana in hand. And the encoding space is big enough to
add characters for centuries, if need be.

The model may still be confusing, but all in all, I think we are doing
dang good at this point.

The rhetorical structure of the standard's text itself has slowly evolved
over time. It will evolve further. Hopefully it can be improved to the
point where Unicode is easier for the neophyte to comprehend,
although given the scope and complexity of a universal character
encoding, it is likely never going to be an easy topic.

> In the end, what the supplemental planes (or supplementary
> planes, or suppository planes, or whatever the official term is - I can
> never remember) and the basic plane really are could be summarized as "the
> original 16 bit character set and the 31 other 16 bit character sets". The
                                        ^16
> resemblance to an Intel 8088 is disturbing.

Exactly wrong.

The encoding space is a big flat space: 0x0000..0x10FFFF, with no numerical
magic about it.

UTF-32 makes it explicit now that a flat, 32-bit encoding form for
implementation of the big flat encoding space is perfectly conformant.

It is only when trying to understand Unicode exclusively from the 16-bit
point of view that it looks like a segmented model.

(By the way, I served my time doing pointer thunking for 32-bit code
that had to run on 80286 hardware. And I have fully implemented UTF-16
libraries that access all the supplemental characters in Unicode 3.1.
I consider the analogy between segmented pointers and the UTF-16
character encoding form to be superficial and misleading at best.)

>
> Why some folk think it is so problematic just to prepare for the
> future is beyond me.

We are prepared for the future. 788 years into the future, more or less.

What more are you asking for?

--Ken

> What little such preparation has been done in the past
> has always been rewarded ('486 booster socket excepted).
>
>
> /|/|ike
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT