Re: Unicode CJK Language Myth

From: Glen C. Perkins (Glen.Perkins@NativeGuide.com)
Date: Mon May 20 1996 - 18:10:49 EDT


>Ken'ichi Handa answered me:
>
>>I agree. And, I've never claimed that all unifications done by
>>Unicode are incorrect. Most of them are ok, I think.
>
>Very nice to hear that from you. In general, most of the
>people in Japan I met and discussed about Unicode say
>they dislike or hate Unicode. Apart from those cases where
>this is due to misunderstandings, it usually turns out that
>it is just a small feature or decision they dislike, and that
>otherwise, they feel okay with the rest. The problem is
>that most of them don't realize that given all the different
>requirements from all over the world, and in particular all
>the different ways, in particular, of viewing and thinking
>about Kanji, there are actually so few points in Unicode
>any single person is not exactly happy about.
>
>

This is why I agree with the idea that a white paper might be in order.
It's my impression that most of the Japanese (to whom I've spoken) who
oppose unicode do so because they think it treats Chinese, Japanese, and
Korean "Han" chars as if they were the "same," which every right-thinking
Japanese considers a slap in the face of Japanese uniqueness by foreigners
who couldn't possibly understand their real needs.

Usually this *is* a result of a misunderstanding, primarily of the meaning
of "CJK Unification" which sounds to them as though we unicode proponents
think their languages are all interchangeable variants of Chinese. They
know that current Japanese-only encodings are sometimes rendered using
weird fonts, but at least those are all "correct" Japanese weird fonts.
Unicode opens up the possibility that they will be forced to endure
"incorrect" (e.g. simplified Chinese-style) fonts used to render Japanese.
This is considered an abomination, and therefore they feel unicode must be
stopped.

Frequently they don't realize that all unicode means for the average
Japanese is that occasionally, if they happen across some Chinese or Korean
text (on the web for example), the glyphs, instead of being total
gibberish, will be "Japanized" if the text is not marked up at a higher
level and dealt with strictly as far as fonts are concerned. In other
words, essentially all "problems" in rendering will result in other
languages becoming *more* readable, not your own language becoming *less*
readable! Japanese newspapers usually render Chinese and Korean names in
Japanese-style glyphs already, so this automates a process the Japanese
have used for years to deal with bits and pieces of Chinese and Korean used
in a Japanese context.

The minority of people who really *do* understand all of this (such as
Handa-san), but still object on the basis of a few features or decisions,
as you are saying, need to be dealt with from the other direction, I think:
what alternative would be more acceptable overall (not on one specific
point, but overall)?

There aren't enough code points to include Chinese, Japanese, and Korean
without either: CJK unification; not unifying, but drastically reducing the
number of chars alloted to each language; not unifying, not reducing the
number of chars per language, but expanding beyond two bytes/char; or using
multiple, different encodings (mixture of national single and double byte
standards).

In reverse order: multiple encodings requires markup because it will be
completely unreadable without marking the switch from one encoding to
another. It then requires duplicate fonts and/or tables for mapping parts
of various encodings to parts of various fonts. The complexity usually
results in systems being effectively monolingual, or at least monoscript.
If I were to send Handa-san a message asking, "what does X mean?" where X
was the Korean 'chik' char (the 'choku/zhi' char we've been discussing)
encoded in KSC entered via my Korean input system, then it would come up as
some kanji on his screen when interpreted as if it were shift-JIS (or EUC
or whatever), but heaven only knows which one! National standards are much
poorer at handling this example than unicode is.

Going beyond two bytes per char can be considered just another form of
markup, really, with additional information attached to each char rather
than to a sequence of chars. A third byte, for example, could be used to
expand the number of code points or it could indicate the "font family"
that a font would have to belong to for the rendering of the char encoded
in the other two bytes to be considered one of the "correct" variants
(using Handa-san's use of the term). Most people would object far more
strongly to increasing the number of bytes for every character for every
document in every language in the world, than to the possibility that if
you tried to read a Japanese document completely unmarked-up at a Chinese
guy's workstation, and the document consisted of only one kanji with no
other context, and that kanji was one of the few "problem" kanji, that the
rendering might sometimes cause a problem.

You could keep it at two bytes/char if you simply told the Japanese,
Chinese and Koreans that they had to get rid of about half of the chars
they currently have available in unicode. Yeah, right.

Otherwise, you can have one encoding covering virtually all desired chars
from virtually every language all with only two bytes/char if you allow CJK
unification. If you only read one language, you'll never have a problem on
your own machine or on any other machine with your favorite style of font
available. If you read multiple languages (multiple scripts), you'll want
markup to handle things nicely, but you would need that no matter what
encoding you used (multiple single & double-byte encodings with encoding
markup, more than double-byte which means each char is "marked up," or
unicode with some form of markup.)

The ISO 10646 "extended unicode" standard even allows us to begin with all
of the above advantages and then add the ability to markup some chars,
character-by-character, when we really want that feature in the future.

When people aren't allowed to simply complain, "well, I can think of an
unusual circumstance where there could be an occasional problem for a few
people," and they have to come up with an alternative which addresses that
problem and *all* of their other needs and *everybody else's* needs in
*all* circumstances in totality better than the current unicode proposal,
it puts the unicode question in a different light.

The only passable answer I've heard to this question is, "well, keep it to
two bytes, keep the CJK unification, but don't unify it quite so much.
Separate the chars (like 'zhi/choku/chik') which aren't 'correct.'"

I don't know enough to say that this is totally a bad idea. What I can say
is that a large percentage of simplified Chinese chars are likely to be
considered "wrong" by Handa-san's definition because they don't adhere to
the standard form of the traditional radicals, so they wouldn't pass his
"schoolboy" test. I think there are too many characters in this category to
disunify them all.

If I'm mistaken, and the Japanese (and others) object to only a few cases
of unified chars, then what are they suggesting we all give up in the
double-byte standard to make room for the expansion? We still have new
characters with no current code point at all trying to enter the standard.
Which matters most: including these rare chars at all, guaranteeing an
acceptable glyph under *all* circumstances for a few common chars, keeping
other unicode features (e.g. user defined range), or keeping text encoding
at two bytes/char?

By this, I'm not suggesting that the idea is impossible or stupid, I'm just
wondering what it would take to satisfy unicode's most vociferous and
influential critics and make them supporters of the overall system on
balance.

In this spirit, I would ask Handa-san and any other critic what *overall*
solution you would support more than the current *overall* solution.

__Glen Perkins__

(What if we changed the term from "CJK Unification" to "CJK Extension" and
told each country that it was an "extension" of their national standard.
;-) That's all it would probably take for some of the people I've talked
to.)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT