Re: Talk about Unicode Myths...

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Mar 20 2002 - 19:22:11 EST


Dan Kogai wrote:

> On Thursday, March 21, 2002, at 03:55 , John H. Jenkins wrote:
> > There's an issue because Ohta-san (and a few others) hate Unicode with
> > a passion. ...
> > Most Japanese disagree with them, ...

> How can you be so sure that "Most Japanese disagree"? Have you
> actually taken a poll?

Perhaps John should have qualified his statement as being "most
Japanese technologists working on system and application software
in Japan."

Of course nobody has run an opinion poll among the general populace,
and the results of such a poll would be irrelevant anyway, since
Unicode isn't a general civil policy for the population to decide,
but a technical standard relevant to software (and database) implementation
and design -- and one which impacts primarily the IT specialists.

And as regards information about the received opinion about Unicode
among Japanese IT specialists, the Unicode Consortium *does* get
a substantial amount of valid feedback via the Japanese board
member of the consortium, via Japanese member companies or Japanese
subsidiaries of American- or European-domiciled member companies,
via significant contact with well-placed and active Japanese
standards body members, ongoing contact with technical publishers
in Japan, and so on.

> I happen to be a Japanese and even I am not sure
> how much beloved or hated Unicode is here. Isn't this kind of attitude
> that makes people like Ohta-san angry?

Others have responded about Ohta-san. He has his own longstanding
reasons, unlikely to be affected one way or the other by people
assuming that they know the level of Unicode acceptance in Japan.

> To me Unicode Consortium has already showed a big incompetence when it
> introduced Surrogate Pair.... What was Han Unification for after all?

I *think* Dan's point here was that Han unification was intended to
keep down the total number of code points needed, since one wouldn't
need duplicate code points for each of the different sources. But
the Unicode Consortium somehow goofed and showed its incompetence
when they were surprised that 16 bits wasn't enough and they
had to hack on surrogate pairs to correct their mistake and have
enough code points to finish off the encoding.

There is a seed of truth in that perception, but it misses many
points about the history of Unicode that have been often discussed on
this list and other forums.

The short rebuttal is:

The Unicode founders knew from the outset that there were
more than 64K "entities" that people would come knocking on
the door for encoding as characters. Joe Becker's estimate in
1989 was "about 250,000". Joe was in a position to know, as
an architect of the Xerox Star system, the first significant
multilingual/multiscriptal text processing system, and as
a participant in the Xerox Character Code Standard, the first
real attempt at a universal character encoding and the
intellectual precursor to Unicode.

Despite this knowledge, the Unicode Standard was designed
originally as a 16-bit fixed-width character encoding for
two basic reasons:

   A. It was intended to include all the *useful* characters,
      which was originally envisioned as including all modern-use
      CJK characters -- not every headword entry in every classical
      Chinese dictionary ever compiled. By rationing the massive
      sea of CJK entities on a usefulness criterion and by
      using rational principles towards encoding the other scripts,
      a 16-bit repertoire *could* have been surprisingly complete
      for most general purposes.

   B. The architects of Unicode knew that 16-bit characters were
      already perceived of as "too big", and that a jump directly
      to 32-bit characters would have hindered the acceptance of
      Unicode rather than helped it. This perception was proved
      correct when Windows NT adopted 16-bit Unicode -- the single
      most important implementation decision in the history of
      the acceptance of the Unicode Standard.

The rationale for (A) got gradually eaten away over the years,
however -- a fact that didn't surprise the insiders all that
much, actually. The merger with 10646 brought a large number
of "useless" characters into the repertoire, including nearly
a thousand Arabic ligatures. Amendment 5 to 10646 was the culmination
of the comic opera which resulted in 11,172 Hangul syllables in
the standard, despite the fact that everyone knew that that was
insufficient for Old Korean, and that combining jamo would have
to be used for that, anyway. The need for interconvertibility
between Unicode and all the legacy standards brought all the
compatibility characters along -- and they added up to significant
numbers.

But most significant of all, it became clear that no one
in the CJK community was going to be satisfied with just a
"useful" set of Han characters. There is a kind of historical
inevitability to the process of cataloguing Chinese characters --
no one group or organization can stop it, and it leads to massive
lists of "things" that need numbers. As I have noted before, for
example, the Unicode Standard has now accumulated 9 characters
for the CJK "grass radical". (If you doubt me, look them up:
U+2EBE, U+2EBF, U+2EC0, U+2F8B, U+4491, U+8278, U+8279, U+FA5D,
U+FA5E.) This kind of formal redundancy in the CJK repertoire
additions massively inflates the numerosity that we have to
deal with for CJK, despite the source unification rules.

So, yes, it became inevitable that a fixed-width 16-bit character
architecture would be insufficient, and to protect the investment
in the character encoding, UTF-16 was invented as a cheap way
to access sufficient encoding space to complete the inventory.
As an "escape" mechanism for the character encoding, UTF-16
has proven to be notably benign and implementable. It certainly
has none of the kind of processing intractability that the
2022-style system of escapes has, for example.

So while Dan might consider UTF-16 a "big incompetence", my own
opinion is that the worst that could be held against it is that
it amounted to a bait and switch tactic, whereby implementers
were lulled into thinking they had a simple, fixed-width 16-bit
system, only to discover belatedly that they had bought into
yet another mixed-width character encoding after all. Compared
to the alternatives, however, I think most implementers are
still happy with the new model they ended up with, even though
they had to pay a little extra for the racing stripes, detailing,
and rustproof undercoating.

--Ken



This archive was generated by hypermail 2.1.2 : Wed Mar 20 2002 - 20:14:42 EST