Re: Unicode CJK Language Myth

From: Kenichi Handa (handa@etl.go.jp)
Date: Sun May 26 1996 - 20:29:12 EDT


At first, very sorry for the late reply. It was a busy week & our lab
met power cut for two days. :-(

mduerst@ifi.unizh.ch writes:
> otherwise, they feel okay with the rest. The problem is
> that most of them don't realize that given all the different
> requirements from all over the world, and in particular all
> the different ways, in particular, of viewing and thinking
> about Kanji, there are actually so few points in Unicode
> any single person is not exactly happy about.

I'm not claiming to distinguish ALL the different ways, but claiming
that Unicode should have distinguished more reasonalble variants.
Although the word "reasonalble" is vary vague, the current unfication
of Unicode appears unreasonable to many Japanese. If we don't have to
worry about the possibility of, for instance, character "choku" being
shown in an unexpected way, more and more Japanese people accept
Unicode.

I believe that 2-byte is too small for Han characters but 4-byte is
too large, perhaps 3-byte is the best, and our computing power can
easily handle 3-byte code.

> Okay, I'll start again. If you look at what we might call "multilingual
> typography", then you find mainly two cases, namely:

> - The case where a few words from another language are incorporated
> in the text of another language. In this cases, there is no font
> change; glyph differences that may exist between those two
> languages are eliminated in favor of glyphs in the base language.

Why do you think only the glyphs in the base language is used in such
cases? Typical cases are that Chinese person's name in Japanese
context. If we have no economical reason of using only Japanese
glyphs, we will use correct Chinese glyphs for Chinese names.

> - The case where there is an abrupt change between different languages,
> such as in a dictionary, in a text to learn a language, or in a
> scientific paper that uses another language only in examples.
> In these cases, there is not just a glyph change, but also a
> font change to make the differences obvious.

There may or may not be a font change. If we always need a font
change and thus higher level information than just a characer code to
display a correct glyph, there's no need of INTERNATIONAL coded
character set. In that case, what we need is to define a protocol of
sending higher level information. With that information, we can use
the existing local character sets such as JIS, GB, CNS, KSC, etc.

Of course, I don't claim that JIS and etc do not contain any problems
about unification, but, as far as Unicode doesn't solve any of them,
there's no need of Unicode.

> For structural and typographic reasons, texts where there are changing
> glyph shapes without font changes are virtually non-existent.

There exist many electric dictionaries which doesn't changes font.

Why does ASCII distinguish `a' and `A'? If we follow your logic, we
don't need the difference of lower case and upper case. (The merit of
not-distinguishing lower and upper is great in computation. Don't you
think so?) EVEN IF WE WRITE ALL TEXT IN UPPER CASE, THERE EXISTS NO
READABILITY PROBLEM. Actually, the difference between `k' and `K' is,
in most fonts, smaller than the difference of two `choku's. And, what
Unicode is doing is something like distinguishing `a' and `A' but not
distinguishing `k' and `K'.

> So it is fair to conclude that a system such as Unicode, which relegates
> glyph differences to be resolved by higher-level information such
> as font information, is a very reasonable solution for multilingual
> text processing and typography.

So it is fair to conclude that a system such as Unicode can only be
used for localized text processing but very hard for multilingual text
processing.

> We are saying the same, if we assume that software is
> behaving reasonably and is not implemented by beginners.

I want to say this again: we should not assume any sophisticated
software just to show a correct glyph.

> Just open your eyes, and have a look at advertisement and logos
> around you. If you are more interested, have a look at some books
> on Japanese logo design and modern typography. I have some such
> books here, but giving you the ISBN number won't help you as they
> are somewhat outdated (late 80s) and won't be on sale in Japan
> anymore.

You are talking about something like calligraphy. No one read a book
in which all texts are printed in such eccentric fonts. See any
Japanese magazines. Even though titles are in very eccentric glyphs,
the body text is printed in glyphs we see in a text at school.

In addition, I believe that no Japanese font variation contains
Chinese style `choku' glyph. Have you ever seen that?

> I have mentionned "guessing" techniques and setup scenarios to get
> the best glyph shapes before. If a system is not able to conclude that
> most probably the user wants to see a Japanese glyph when using
> a Japanese input method, please don't blame it on the character set.

1) I don't want to use any "guessing" techniques.
2) Even if you insist on using the word "best", the correct word to be
used here is "correct", for many Japanese.
3) I'm not talking about single program. The displaying routine and
the inputing method driver may belongs to different softwares.

>> difficulty especially in the case of ideograms. If culture A want
>> characters X and Y be unified but not with Z, and culture B want X and
>> Z be unified but not with Y, what kind of unification is good?
> This is indeed a potential cause of troubles. Some cases with
> this structure indeed exist, but in these cases, the differences

Perhaps, my example was too general and complicated. I should have
written: If culture A uses both X and Y but want to unify them, and
culture B uses X but never Y, should X and Y be unified or not?

This is the case of `choku'.

I think there should be two character codes for X of culture A and X
of culture B because those two should be regarded as different ones,
the former allows variation Y but the latter doesn't allow.

> Also, there is no possibility of misunderstanding. If a Japanese
> sees the Chinese glyph for "choku" without context, and does
> not recogize it, there is absolutely no danger of confusing it
> with something else. The only thing (s)he can say is
> "sorry, I don't know".

... and the receiver may conclude that he is not a educated person.

If he knows that the character is Chinese, he may conclude that his
friend thinks he is a Chinese.

Are these subtle problem or not? I don't know the answer.

> That will happen for the majority of the kanji characters in Unicode
> anyway.

Please don't make our discussion point generarize too much. Of
course, many Japanese don't know about, for instance, Arabic
characters.

---
Ken'ichi HANDA
handa@etl.go.jp



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT