Re: Unicode, Cure-all or Kill-all?

From: Martin J Duerst (mduerst@ifi.unizh.ch)
Date: Mon Aug 12 1996 - 09:09:58 EDT


Dear Timothy,

>Thanks for your 3 very informative letters. Your point on multiple
>codepoints is well taken. Some times, even myself have this problem too.
>However, according to the scholars involved in the CCCII, these
>characters should be coded separately, because they are different
>characters. I think this is a key point in the definition of character.
>Should the character definition be shape-base or meaning-base? I
>personally don't know, but I agree with these experts. Furthermore, if
>this is the nature of the Chinese language, what else can we do except
>accepting that as a fact. On the other hand, every language is a living
>organism and hence changing all the time. This may be changed already.
>Again, I don't know.

Well, there are scholars that deny that Chinese characters have meaning,
and they do this very fervently. But I don't agree with them, and one
has not to go that far to see that using meaning as a principal guide
to character encoding is doomed to fail. Whatever some scholars may
say, if we don't get a solution that is practically usable, we shouldn't
consider it. And distinguishing one and the same shape with different
codepoints because that shape can have different meanings is absolutely
not usable in practice. If you, as an expert very interested in character
coding, have your problems with multiple codepoints for the same
shape, how could an arbitrary user not have these problems? On the
other hand, if I send somebody an email with the correct separate
encoding of the character(s) for Taiwan, Typhoon, and Sir, what will
the recipient gain? Will (s)he ever notice?
Furtheron, there are many characters that have one and the same
historic origin, but several meanings, now or throughout history.
For example, should the character "to come" get a second code-
point for its original historic meaning? And what with all the other
characters that have a wide field of uses and meanings? How to
divide this field of meaning into reasonable patches?
Yet another example, what if somebody writes a text where the
fact that the character can mean Taiwan, Typhoon, as well as Sir,
in an artistic way? With the CCCII system, one would have to
decide for one or the other meaning, and the pun would be
lost.
Other than an overestimation of "meaning" from an ivory-tower,
I cannot understand the decision of CCCII scholars to choose
such an unpractical solution.

>Regarding to the "new character" issue -- (1) There always be new
>Chinese characters generated all the time. Unlike the alphabetical
>languages, Chinese character set is an open set. On average, since Ming
>dynasty, about every 3 days, a new character shows up.
One has to be very careful with such numbers. Most of the "new"
characters that turn up are simple mistakes.

>(2) Now, if the
>coding space is a closed (16x16 bits) or limited (such as Big-5) one,
>then, accomandating them will be difficult or impossible.
With UTF-16, Unicode has a codespace of about 1000000 codepoints.
That's enough for at least the next 500 years.

>(3) And, if
>the number of characters coded is small (such as GB, Big-5, Unicode),
>then, the users are forced to use the private zone heavily. And that
>defeat the very purpose of information interchange, because each user
>will have different code assignments -- information can not be
>interchange then.
If new characters get created, a mechanism to deal with them before
they are allocated official codepoints is necessary in any way. The
problem currently is that such a mechanism is not well established
or standardized; the easiest case currently is HTML, where you can
use an inline GIF. Anyway, such mechanisms are needed, but if they
are used extremely rarely because the basic set covers almost all
cases, nobody will really be interested to develop such mechanisms
and implement them. So, strange as it may sound, not having too
large a basic set can actually help to have mechanisms that allow
to include even very very rare characters easily into documents.

>I don't consider myself is any different from other
>average users, but from my experiences, using Big-5 character set, I
>have to make about 2~3 'new' characters per month. Especially, in the
>case of maintaining a customer mailing address database. For example, in
>my little customer database, containing only about 400 person, I am
>short of about 10 characters for their names and the addresses -- and I
>can NOT use any substitute characters for these.

I don't know much about Taiwanese names, but in Japan, it usually
turns out that most of these "missing characters" are character variants
that somebody wants to see as a character of its own based on a lack
of understanding of character history, typography, and caligraphy.

>Well, about MaJong, do you know during the past, it was considered as an
>evil gambling instrument and was banned by both Chinese governments? My
>childhood neighbor was taken by the policeman. The point I want to say
>is not the MaJong itself, or why the chess was in. The same goes for the
>emperor's names. These are not the main issues, but just the examples.
>The key point is are they "characters"? If not, why they take up so many
>precious coding spaces?
Chess and Japanese Emperors together are 17 codepoints. This is really
a marginal number. Your set of 70000+ characters won't fit in the BMP
anyway, and will fit without problems in the UTF-16 area, so there
is no problem for you.

Regards, Martin.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT