N2476 a hoax?

From: Doug Ewell (dewell@adelphia.net)
Date: Sat May 25 2002 - 00:02:28 EDT

Previous message: Doug Ewell: "Re: Language name questions"
Next in thread: John H. Jenkins: "Re: N2476 a hoax?"
Reply: John H. Jenkins: "Re: N2476 a hoax?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

A new JTC1/SC2/WG2 document, ostensibly from the Unicode Technical
Committee, was posted on the WG2 web site this past week. The URL is:

http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2476.pdf

This document is so far removed from the stated position of the UTC, and
so far below its normal editorial standards, that I believe it was
submitted by some other organization that signed the UTC’s name to it as
a hoax, perhaps in an attempt to lend it credibility.

I’m not normally much of a conspiracy theorist, so I'm admittedly
stepping in unfamiliar territory here.

The document, N2476, is titled "Variants and CJK Unified Ideographs in
ISO/IEC 10646-1 and -2," which is quite a broad category. It turns out
to be about inventing some sort of equivalence classes among Han
characters so that they can be considered "the same" in certain
contexts. Anyone can create this type of equivalence class for their
own personal use, of course, but N2476 proposes that the IRG "be
instructed to develop a classification scheme" that would have some
sense of being officially sanctioned.

This is at odds with what I have heard from the most prominent CJK
experts on this list, that such equivalences are too dependent on
context and writer's intent to belong in a character encoding standard.

For starters, the paper is signed "Unicode Techncial Committee." Under
what circumstances would any member of the UTC release a paper with the
UTC’s own name misspelled? There are other editorial mishaps:

> ... end-users may want text to be treated is equivalent...

> There are situations were some users...

which are not at all up to the usual standards of a UTC document.

But enough nitpicking; it’s the content that really makes me think this
document is a spoof. Check this justification for creating an
equivalence class between simplified and traditional Han characters:

> To give one instance which has been of some importance in early
> 2002, most users want simplified and traditional Chinese to be
> "the same" in internationalized domain names.

"Most users" is both overstated and unsubstantiated. Several
representatives from the Chinese, Taiwanese, and Hong Kong domain-name
industry made this claim on the Internationalized Domain Name (IDN)
mailing list. The topic became known simply as "TC/SC" and, for over a
month, was more frequently and persistently discussed than any other
topic. It got to the point where the domain-name representatives
organized a chain-letter campaign, resulting in over 300 messages --
many identically worded, and from previously silent contributors --
insisting that the IDN architecture "must" implement TC/SC equivalence
or be a complete failure.

> Latin domain names, after all, are case insensitive.
> "Www.Unicode.Org" resolves to the same address as
> "www.unicode.org".

UTC members have repeatedly stated that TC/SC equivalence is not at all
comparable to Latin case mapping.

> The inability to provide for [TC/SC equivalence] very nearly
> prevented Chinese from being used in internationalized domain
> names.

No, it didn’t. That was a counterproposal made by the Chinese
domain-name representatives, who claimed that prohibiting Han characters
"for now" would give the relevant bodies more time to develop a proper
TC/SC mapping solution (implying that the problem was solvable at all,
an opinion disputed by many).

> Programmers and users are being increasingly frustrated that as
> ISO/IEC 10646 becomes more pervasive, they are increasingly
> compelled to deal with a large number of variant characters some
> of which are only subtly different from each other and which
> cannot be automatically equated.

The UTC would never refer to ISO/IEC 10646 as "pervasive" or talk of
programmers and users being "compelled" to deal with variant characters,
nor would it make such an emotional appeal that such variants should be
"automatically equated." Note the lack of standard UTC/WG2 terminology;
if this were the UTC talking, you would be reading about canonical and
compatibility equivalents and normalization. This passage also hints at
the author’s lack of awareness that similar equivalence issues exist for
scripts other than Han.

> It is vitally important that data be provided to allow
> developers, protocols, and other standards to deal with Han
> variants.

I have never before seen an official UTC paper that claimed it was
"vitally important" to solve a given problem. Individual submissions,
yes.

> What is needed, however, is something that allows at the least for
> a first-order approximation of equivalence.... it would be up to
> the authors of the individual application, protocol, or standard
> to determine whether this were acceptable or not.

And what if the authors decide the IRG-developed approach is not
acceptable? What are they expected to do then? Again, the reader is
invited to contrast this passage, in both form and content, with any
other that has been issued from the UTC in the past.

On the very same day (2002-05-08) that N2476 was published, a new
Proposed Draft Technical Report (PDUTR #30) titled "Character Foldings"
was also published. PDUTR #30, available on the Unicode Web site, deals
with several different types of mappings between characters -- mappings
that involve digraphs and trigraphs, removal of diacritical marks,
mappings between Hiragana and Katakana, mappings between European,
Arabic, and Indic digits, and so on. NOWHERE in this document is there
the slightest mention of TC/SC mappings. Isn't that a bit strange? If
the UTC were really driving the issue of TC/SC mapping, wouldn't they
have at least given it a brief mention in a "Character Foldings"
proposal?

I am convinced that N2476 is either a complete spoof, written by someone
who is not associated with the UTC at all, or by a UTC member who did
not consult with fellow members, either on content and position issues
or on mechanical issues. Either way, the UTC should find out what is
going on, because documents like this could seriously undermine the
image of stability and professionalism that the UTC has worked hard to
earn.

-Doug Ewell
Fullerton, California

Previous message: Doug Ewell: "Re: Language name questions"
Next in thread: John H. Jenkins: "Re: N2476 a hoax?"
Reply: John H. Jenkins: "Re: N2476 a hoax?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri May 24 2002 - 22:33:49 EDT