Re: N2476 a hoax?

From: John H. Jenkins (jenkins@apple.com)
Date: Sat May 25 2002 - 12:23:50 EDT


Doug, the paper was authored by me per instructions from the UTC. To the
extent that there are typos, errors of fact, or misrepresentations of the
UTC positions, I am naturally the sole person responsible. It was,
however, a good-faith attempt to represent the facts as I understood them
and the position of the UTC as I understood it. The paper was rushed, in
order to get it to WG2 in time for consideration at the Dublin meeting,
and although I did have it looked at by people other than myself, I'm
willing to grant that it could have used more thorough review and been
substantially improved.

Many of your points deal with an overall matter of style. The document
was chartered by the UTC but not vetted by the UTC. In an ideal world, it
would have been submitted to the UTC for official approval before sending
on to WG2. In the real world, there wasn't time for that, and if the
language of the paper or its overall style seem un-UTC-like to you, then I
apologize.

Specific points.

On Friday, May 24, 2002, at 10:02 PM, Doug Ewell wrote:
>
> The document, N2476, is titled "Variants and CJK Unified Ideographs in
> ISO/IEC 10646-1 and -2," which is quite a broad category. It turns out
> to be about inventing some sort of equivalence classes among Han
> characters so that they can be considered "the same" in certain
> contexts. Anyone can create this type of equivalence class for their
> own personal use, of course, but N2476 proposes that the IRG "be
> instructed to develop a classification scheme" that would have some
> sense of being officially sanctioned.
>
> This is at odds with what I have heard from the most prominent CJK
> experts on this list, that such equivalences are too dependent on
> context and writer's intent to belong in a character encoding standard.
>

Well, yes and no.

There are some classes of variation where this is not the case. The
classic example would be U+8AAA and U+8AAC. I have yet to hear anybody
claim that these two are meaningfully different under any circumstances.
There are a *lot* of characters like this; we just got one reported as an
"error" in Unicode this past week.

On a *character* level, TC/SC equivalence is also relatively
straightforward. The problem over whether or not character A is a
simplified/traditional variant of character B is well-defined and simple
to answer (in theory). The problem of how to convert string A from
simplified Chinese to traditional Chinese is *not* simple to answer
without lexical analysis.

One of the reasons why the whole problem of Han variants is so nasty is
that there are so many different kinds of variant out there. In order to
try to bring order to this chaos, we need a model and we need data, and
the IRG is the best organization to provide that model and those data.

I should point out that at the last IRG, not only did Unicode have a paper
on variants, but the rapporteur also made a presentation on why this is a
problem, and much of the work at the meeting was done using a variant
database provided by Taiwan. The HKSAR also has a similar database. And,
  of course, almost any Han dictionary has variant data in it, including in
many Chinese dictionaries TC/SC equivalence.

That Han variants exist is not an issue.

> For starters, the paper is signed "Unicode Techncial Committee." Under
> what circumstances would any member of the UTC release a paper with the
> UTC’s own name misspelled? There are other editorial mishaps:
>
>> ... end-users may want text to be treated is equivalent...
>
>> There are situations were some users...
>
> which are not at all up to the usual standards of a UTC document.
>

*sigh* To err is human.

> But enough nitpicking; it’s the content that really makes me think this
> document is a spoof. Check this justification for creating an
> equivalence class between simplified and traditional Han characters:
>
>> To give one instance which has been of some importance in early
>> 2002, most users want simplified and traditional Chinese to be
>> "the same" in internationalized domain names.
>
> "Most users" is both overstated and unsubstantiated. Several
> representatives from the Chinese, Taiwanese, and Hong Kong domain-name
> industry made this claim on the Internationalized Domain Name (IDN)
> mailing list. The topic became known simply as "TC/SC" and, for over a
> month, was more frequently and persistently discussed than any other
> topic. It got to the point where the domain-name representatives
> organized a chain-letter campaign, resulting in over 300 messages --
> many identically worded, and from previously silent contributors --
> insisting that the IDN architecture "must" implement TC/SC equivalence
> or be a complete failure.
>
>> Latin domain names, after all, are case insensitive.
>> "Www.Unicode.Org" resolves to the same address as
>> "www.unicode.org".
>
> UTC members have repeatedly stated that TC/SC equivalence is not at all
> comparable to Latin case mapping.
>
>> The inability to provide for [TC/SC equivalence] very nearly
>> prevented Chinese from being used in internationalized domain
>> names.
>
> No, it didn’t. That was a counterproposal made by the Chinese
> domain-name representatives, who claimed that prohibiting Han characters
> "for now" would give the relevant bodies more time to develop a proper
> TC/SC mapping solution (implying that the problem was solvable at all,
> an opinion disputed by many).
>

Mea culpa. I stated the facts as I understood them, and I appear to have
misunderstood them.

In any event, while I (for one) would argue that TC/SC equivalence is not
the same as English case-folding, my understanding was that there was a
body of people who argued otherwise. The existence of such a body and and
an acknowledgment of their desire is different from agreement with them.

At the same time, I *do* agree that it is possible to define on a purely
character level a function which allows a first-order approximation to SC/
TC equivalence. And I think it's a legitimate concern for companies and
individuals that some mechanism be in place so that two domain names which
are TC/SC "equivalents" aren't registered by competing organizations—
Unicode's own ideal Chinese domain name would be a case in point. Whether
this is done via TC/SC folding or via someone asking to register domain
name X and being told, "Oh, by the way, you also need to register domain
names Y and Z while you're at it" is irrelevant.

>> Programmers and users are being increasingly frustrated that as
>> ISO/IEC 10646 becomes more pervasive, they are increasingly
>> compelled to deal with a large number of variant characters some
>> of which are only subtly different from each other and which
>> cannot be automatically equated.
>
> The UTC would never refer to ISO/IEC 10646 as "pervasive"

Why not? Isn't it?

> or talk of
> programmers and users being "compelled" to deal with variant characters,

Why not?

> nor would it make such an emotional appeal that such variants should be
> "automatically equated."

Why not?

> Note the lack of standard UTC/WG2 terminology;
> if this were the UTC talking, you would be reading about canonical and
> compatibility equivalents and normalization.

No, if it were Ken Whistler or Mark Davis writing the document, you would
probably get this language. :-)

More seriously, why do compatibility or canonical equivalents or the UTC's
version of normalization come into it? The whole point here is that we
are dealing with a different category of "equivalent" than the standard
currently covers. The further issue of a normalized Han ("Cleanihan") is
also orthogonal.

> This passage also hints at
> the author’s lack of awareness that similar equivalence issues exist for
> scripts other than Han.
>

You may see the hint there; I certainly don't. In any event, I would
argue that the problem is a lot worse for Han than for any other script in
Unicode of which I'm aware.

>> What is needed, however, is something that allows at the least for
>> a first-order approximation of equivalence.... it would be up to
>> the authors of the individual application, protocol, or standard
>> to determine whether this were acceptable or not.
>
> And what if the authors decide the IRG-developed approach is not
> acceptable? What are they expected to do then?

Whatever they want.

We are repeatedly getting requests from people who are asking us how to
handle Han variants, Doug, and we currently have no answer at all beyond
pointing them to the rather limited data which is in Unihan.txt. (Indeed,
  many of the requests are coming from people who ask, "How come the data
in Unihan.txt is so crappy?") We want to solve this problem. At the same
time, if Basis or Microsoft or someone else with the resources to develop
their own solution wants to use their own solution, we don't preclude them
from doing that.

> On the very same day (2002-05-08) that N2476 was published, a new
> Proposed Draft Technical Report (PDUTR #30) titled "Character Foldings"
> was also published. PDUTR #30, available on the Unicode Web site, deals
> with several different types of mappings between characters -- mappings
> that involve digraphs and trigraphs, removal of diacritical marks,
> mappings between Hiragana and Katakana, mappings between European,
> Arabic, and Indic digits, and so on. NOWHERE in this document is there
> the slightest mention of TC/SC mappings. Isn't that a bit strange?

No, not really. There is sometimes a tendency for people who work on UTC
documents to have a subconscious Han/everything-else dichotomy as they
work.

> If
> the UTC were really driving the issue of TC/SC mapping, wouldn't they
> have at least given it a brief mention in a "Character Foldings"
> proposal?
>

I would have hoped so, but evidently that didn't happen. That the UTC is
concerned about SC/TC data and other Han equivalences is, in any event,
already a part of the public record.

==========
John H. Jenkins
jenkins@apple.com
jenkins@mac.com
http://homepage.mac.com/jenkins/



This archive was generated by hypermail 2.1.2 : Sat May 25 2002 - 10:34:54 EDT