On Thursday, January 24, 2002, at 09:39 AM, DougEwell2@cs.com wrote:
>
> Currently on the IDN mailing list there is a big debate over this topic.
> It
> is well known that ASCII-based domain names are matched in the DNS in a
> case-insensitive manner. Many people recognize that Chinese readers who
> are
> familiar with both TC and SC consider text written in the two sub-scripts
> to
> be interchangeable, in roughly the same way that uppercase and lowercase
> Latin are interchangeable. They would like Chinese domain names written
> in
> TC to match the "equivalent" name written in SC, just as "UNICODE.ORG"
> matches "unicode.org".
>
Actually, this is more like asking "honor" and "honour" to match.
> Almost all of the list members whose e-mail addresses end in .cn, .tw or
> .hk
> seem to believe that there is a willful disregard on the part of the
> working
> group for the needs of Chinese users in this respect. We have tried to
> convince them that (a) the solution is not as simple as Latin case
> mapping,
> as many have portrayed it; (b) the problem is not with Unicode Han
> unification, since TC and SC are not unified; (c) content analysis is not
> feasible for domain names; and (d) the entire problem is out of scope of
> the
> IDN WG. We have proposed that organizations register both <TC><TC><TC>.cn
> and <SC><SC><SC>.cn if they want both hits to be successful. So far, not
> much convincing has taken place. In the above case, they claim that all
> eight (2^3) possible combinations (e.g. "<TC><SC><TC>.cn") would need to
> be
> registered, which is overkill.
>
The bulk of Han ideographs don't occur in TC/SC pairs, so this is specious.
I.e., to register the equivalent of "unicode.org", you only need two
registrations, "<U+540C><U+4E00><78BC>.org" (TC) and
"<U+540C><U+4E00><U+7801>.org" (SC). You don't need eight registrations.
Meanwhile, I'd like to offer a suggestion:
*If* they can live with one caveat, and *if* they can give us time to
clean up our SC/TC mapping data, we could do the following:
1) SC/TC matching on Unicode data is only to be done on the SC/TC mapping
data supplied by UTC.
2) Wherever a since SC character matches multiple TC characters, all the
characters are to be treated the same.
This means, for example, that U+53F0 (台) will be treated the same as
U+6AAF (檯), U+81FA (臺), and U+98B1 (颱). This also means, of course, that
U+6AAF, U+81FA, and U+98B1 will end up being indistinguishable even in
purely TC names.
3) This includes Unicode compatibility mappings. (Thereby reducing a lot
of turtles, if nothing else.)
The caveat is that this must be understood to be a first-order,
computer-appropriate equivalence and is not in any way to be held to be a
generalized solution to the lexically appropriate conversion between SC
and TC. It also has to be understood that some things are going to slip
through because it is not a generalized solution to Han normalization.
Lexically inappropriate matches will take place!
(Maybe we should refer to *zhengguihua* instead of "Han normalization"…)
It also means that some desired matches won't happen, and some things can
be "spoofed" by these nasty variant issues such as came up yesterday.
U+9EBC and U+9EBD aren't likely to both match U+4E48.
However, this is already a problem in Unicode. "shuowen.org" will have to
register both "<U+8AAA><U+6587>.org" and "<U+8AAC><U+6587>.org"; Jingwa,
Inc., will need both "<U+4E3C><U+86D9>" and "<U+4E95><U+86D9>".
OK, so this is more than one caveat. It will also mean that we will no
longer be able to accept both the TC and SC form for a character as a
candidate for separate encoding in the future, and future compatibility
ideographs will be excluded from use in IDN. (Actually, you could save
yourself some grief right off by excluding Han radicals and all
compatibility ideographs.)
==========
John H. Jenkins
jenkins@apple.com
jenkins@mac.com
http://homepage.mac.com/jenkins/
This archive was generated by hypermail 2.1.2 : Thu Jan 24 2002 - 12:20:41 EST