Re: TC/SC mapping

From: DougEwell2@cs.com
Date: Thu Jan 24 2002 - 11:39:16 EST


Many have responded:

> Meanwhile, it is true that there are simplified characters which
> correspond to more than one traditional form.
...
> This is the kind of mess that has discouraged anybody from doing a
> systematic survey of simplifications for the Unihan database.
...
> Before converting TC to SC, one should resolve all TC variants to
> the most "common" or "standard" TC form (good luck deciding what that
> means).
...
> I think that any mapping will fail.

Thanks to everyone for your input concerning the TC/SC mapping issue. You
have confirmed what I already knew, but needed concrete evidence of; namely,
that mapping between Traditional Chinese and Simplified Chinese is not a
simple 1-to-1 table lookup problem, but involves lexical analysis and even
knowledge of the author's intent.

Currently on the IDN mailing list there is a big debate over this topic. It
is well known that ASCII-based domain names are matched in the DNS in a
case-insensitive manner. Many people recognize that Chinese readers who are
familiar with both TC and SC consider text written in the two sub-scripts to
be interchangeable, in roughly the same way that uppercase and lowercase
Latin are interchangeable. They would like Chinese domain names written in
TC to match the "equivalent" name written in SC, just as "UNICODE.ORG"
matches "unicode.org".

The problem is getting people to understand the scope of the problem. As you
have illustrated so well, TC/SC mapping is NOT, in the general case, as
simple as Latin case mapping. It requires content analysis, and possibly
some form of tagging.

Almost all of the list members whose e-mail addresses end in .cn, .tw or .hk
seem to believe that there is a willful disregard on the part of the working
group for the needs of Chinese users in this respect. We have tried to
convince them that (a) the solution is not as simple as Latin case mapping,
as many have portrayed it; (b) the problem is not with Unicode Han
unification, since TC and SC are not unified; (c) content analysis is not
feasible for domain names; and (d) the entire problem is out of scope of the
IDN WG. We have proposed that organizations register both <TC><TC><TC>.cn
and <SC><SC><SC>.cn if they want both hits to be successful. So far, not
much convincing has taken place. In the above case, they claim that all
eight (2^3) possible combinations (e.g. "<TC><SC><TC>.cn") would need to be
registered, which is overkill.

One list member has even proposed the prohibition of all CJK code points from
internationalized domain names "until the problem can be solved," and he has
the support of several others. It is obvious that this is an attempt to
hijack the entire IDN model by claiming "it does not support Chinese at all,"
which would certainly be true if Han characters were prohibited, and imposing
a locally-constructed, Chinese-specific (i.e. not universal) model later on.

Unfortunately, as an American who does not speak or read Chinese, I have been
in a poor position to argue with these people about their own written
language. So I relied on the combined expertise of the Unicode list,
including native speakers and people with doctorates in Chinese, for
background information. Thanks again for your help.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Thu Jan 24 2002 - 11:21:50 EST