Re: Unicode CJK Language Myth

From: Kenichi Handa (handa@etl.go.jp)
Date: Sun May 26 1996 - 20:34:18 EDT


Glen.Perkins@NativeGuide.com (Glen C. Perkins) wrote:
   The minority of people who really *do* understand all of this (such as
   Handa-san), but still object on the basis of a few features or decisions,
   as you are saying, need to be dealt with from the other direction, I think:
   what alternative would be more acceptable overall (not on one specific
   point, but overall)?

We already have a framework of ISO 2022.

   There aren't enough code points to include Chinese, Japanese, and Korean
   without either: CJK unification; not unifying, but drastically reducing the
   number of chars alloted to each language; not unifying, not reducing the
   number of chars per language, but expanding beyond two bytes/char;

From the first, 16-bit is too small for Han characters.

   or using
   multiple, different encodings (mixture of national single and double byte
   standards).

??? ISO-2022 can provide single encoding method for mixture of various
national standart character sets.

   In reverse order: multiple encodings requires markup because it will be
   completely unreadable without marking the switch from one encoding to
   another.

I don't understand what you are worring about. ISO-2022 can
effectively handle multiple character sets. When a program reads a
multilingual text, it can put some tag bits for each character code to
identify the character sets. This is exactly what Mule (Multilingual
Enhance emnt to GNU Emacs) is doing.

   It then requires duplicate fonts and/or tables for mapping parts
   of various encodings to parts of various fonts. The complexity usually
   results in systems being effectively monolingual, or at least monoscript.

What we need is to provide an appropriate font for each character set.
Even in Unicode, we need multiple fonts (at least for Japanese and
Chinese as far as I know).

   If I were to send Handa-san a message asking, "what does X mean?" where X
   was the Korean 'chik' char (the 'choku/zhi' char we've been discussing)
   encoded in KSC entered via my Korean input system, then it would come up as
   some kanji on his screen when interpreted as if it were shift-JIS (or EUC
   or whatever), but heaven only knows which one! National standards are much
   poorer at handling this example than unicode is.

This problem never happen if we use ISO-2022-KR (for Korean character
sets) and ISO-2022-JP (for Japanese character sets) and ISO-2022-CN
(for Chinese character sets) or mixture of them for multilingual text.
Actually, all Mule users are exchanging multilingual e-mail without
any difficulty (ISO-2022-CN is not yet supported by Mule because the
standard is decided after the release of the latest version of Mule).

So, what we really need is ISO-2022-INT.

   Going beyond two bytes per char can be considered just another form of
   markup, really, with additional information attached to each char rather
   than to a sequence of chars. A third byte, for example, could be used to
   expand the number of code points

Yes, this is what I wrote above.

   Most people would object far more
   strongly to increasing the number of bytes for every character for every

You should realize that most people don't need true multilingual
environment, the market for such softwares is also very small for the
moment. So, it's not surprising that, for most people, it's more
important to make the burden small than to truely solve the
multilingual/international problem.

I don't claim that Unicode is useless for localized software.
Actually, by using Unicode in Japanese localized software, we get much
more characters than just the combination of JISX 0208 and JISX 0212.
I only oppose to those peaple who insist on using Unicode for
internationalized software or multilingual text especially in CJK
area.

   The only passable answer I've heard to this question is, "well, keep it to
   two bytes, keep the CJK unification, but don't unify it quite so much.
   Separate the chars (like 'zhi/choku/chik') which aren't 'correct.'"

   I don't know enough to say that this is totally a bad idea. What I can say
   is that a large percentage of simplified Chinese chars are likely to be
   considered "wrong" by Handa-san's definition because they don't adhere to
   the standard form of the traditional radicals, so they wouldn't pass his
   "schoolboy" test. I think there are too many characters in this category to
   disunify them all.

Too many for what? 2-byte? Why should we start from 2-byte code?
It's not impossible nor hard to handle 3-byte or 4-byte code.

   In this spirit, I would ask Handa-san and any other critic what *overall*
   solution you would support more than the current *overall* solution.

I'm just claiming that the current one (Unicode) doesn't show
*overall* solution to multilingual environment and we should not
pretend Unicode shows that. And, I believe that a good solution for
multilingual text handling is ISO-2022-INT or the similar one.

   (What if we changed the term from "CJK Unification" to "CJK Extension" and
   told each country that it was an "extension" of their national standard.
   ;-) That's all it would probably take for some of the people I've talked
   to.)

I do agree with your suggestion because then it gets clear that
Unicode is only for localization, no one have dream of using Unicode
for internationalization.

---
Ken'ichi HANDA
handa@etl.go.jp



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT