Re: Japan opposes any proposals with UNICODE

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon May 01 2000 - 15:40:11 EDT

Next message: Peter Constable: "Re:Canonical ordering"
Previous message: Magda Danish (Unicode): "FW: EUC <-> UTF translators"
Maybe in reply to: NAOI Yasushi: "Re: Japan opposes any proposals with UNICODE"
Next in thread: NAOI Yasushi: "Re: Japan opposes any proposals with UNICODE"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Naoi-san responded in this thread:

> At 3:44 PM +0900 00.4.30, Martin J. Duerst wrote:
>
> >>I can't understand what the term "the same abstract shape" means.

The text of the standard, pp. 262..265, attempts to clarify what is
meant by "abstract shape" and "actual shape" when applied to the
unification of Han characters. Admittedly, there are gray areas when
dealing with variant shapes of Han characters -- but that is precisely
why the IRG is constituted to sort out the edge cases and make determinations
that we can rely on for the standard.

> >>If six
> >>characters in Figure 10-4 "would normaly be subject to unification" then:
> >>
> >>o Do U+50C9 and U+91D1 have the same abstract shape?

No.

> >>o DO U+5202 and U+5204 have the same abstract shape?

No.

> >>
> >>I can't believe that. I agree these six Kanji characters are cognate,

This is debatable for the last character in the list of 6.

For those following along, the six characters of Figure 10-4, page 264,
are encoded, respectively, at:

5263 528D 528E 5292 5294 91F0

The first five all have the "knife" radical (cf. U+2F11) and are
pronounced jian4, meaning "sword, saber". The sixth one has the
"gold/metal" radical (cf. U+2FA6), and according to the only source
I have that lists it, is pronounced ri4, meaning "blunt". It is
conceivable, however, that in the larger dictionaries used by the IRG
U+91F0 is also listed as a (mistaken) variant of U+528D.

> >
> >>(U+5202 and U+5204) are not cognate and
> >>not similar at all.
> >
> > I'm not familliar enough with them, and don't have
> > references handy.
>
> U+5202 or U+5204 traditionally differs from another in both meaning and
> pronunciation. The evidence can be seen in KangXi dictionary, Morohashi
> dictionary, and Hanyu Da Zidian, which are standard IRG dictionaries.

U+5202 dao1 "knife" and U+5204 ren4 "blade" are not cognate in the
relevant sense used by the IRG for determination of unification of
characters. And on that basis alone, they would be separated in
the encoding (and are).

However, in a deeper historical sense, the two characters are clearly
related to each other etymologically -- an example of the slippery
problems the IRG has to grapple with. Most importantly, the knife
*radical* as a component of characters quite commonly shows variation
between four forms (at least), as illustrated by the right-hand
side (the radical portion) of 528D, 528E, 5292, and 5294. So, while
as independent ideographs, 5202 and 5204 *must* be distinguished, in
most instances for use as radicals for other ideographs, the variation
in form (and stroke count) is *not* taken as sufficient evidence of
distinctness of characters. This is also easy to verify with
dictionaries, even for the characters in question (528D, 528E,
5292, and 5294). Ci4hai3, for example, lists 5292 explicitly as
a variant glyph for 528D, which is used for the head entry.

>
> > At least the first case seems to suggest that of the six Kanji
> > in Fig. 10-4, the last one might be removed. There are other
> > rules that would prevent it's unification, too.
>
> That's for sure.

I concur that 91F0 is problematical in this list, since even
by the cognate rule and the radical identity rule it should be
distinguished. Furthermore, it is arguably in a variant pair
relationship with 91FC, which is not shown in this list.

On the other hand, 5251 clearly *should* be added to this list,
since it is the GB simplified form of the same jian4 "sword"
character.

So I would suggest emendation of the exemplary list in Figure 10-4
to:

5251 5263 528D 528E 5292 5294

Where 5251 is the GB simplified form (most commonly seen in
dictionaries in the PRC); 5263 is the traditional Japanese
simplified form; 528D is the traditional Chinese form
(most commonly seen in dictionaries in Taiwan and Hong Kong);
and 528E, 5292, and 5294 are glyphic variants of 528D.

The source separation rules required distinguishing all 6 of
these, even though a principled unification which did not have
to live with legacy encodings surely would have unified
(528D 528E 5292 5294) into a single character.

>
> > But please note that if two characters A and B differ
> > only in components C and D, and C and D are considered
> > non-cognate or different in abstract shape, this doesn't
> > automatically mean that A and B are considered to be
> > different in abstract shape.
> >
> > There are quite some examples where a difference in a simple
> > character is important, but if that appears as a component,
> > the difference becomes less relevant. The most famous case
> > (usually explained as non-cognate, not as a difference
> > in abstract shape) is U+571F vs. U+58EB.
>
> We might think that U+571F and U+58EB have the same abstract shape
> (since they have quite similar shape), as you pointed out. On the
> contrary, U+5202 and U+5204 are not only non-cognate but also quite
> different in their shape. So, what you've mentioned is questionable for
> me. I think if components C and D are considered non-cognate AND different
> in abstract shape then Kanji A and B might be automatically considered to
> be different in abstract shape.

Martin is correct about this. As noted above, the difference
between 5202 dao1 and 5204 ren4 is significant for the independent
ideographs, but is neutralized when these forms appear as the
radical of other characters.

So I think the assessment should be that A and B under these circumstances
would be considered for distinction, but definitely not be automatically
separated in the encoding. That would depend on detailed determination
of how the traditional dictionaries and other sources treat the
variation in question.

Note also that all of these determinations have *already* been made
and standardized for the URO (4E00..9FA5) and Vertical Extension A (3400..4DB5),
and have also been completed by the IRG (and are undergoing the second round of
ballotting) for Vertical Extension B for Plane 2. So while it is possible to
argue that the IRG made a mistake here or there on individual characters,
still as for all other Asian character encoding standards, including those
published by JIS, we live with the resulting decisions about unification
or disunification in particular instances and get on with the implementations.

> Further, I think that the meaning of "the same abstract shape" is very
> ambiguous and arbitrary. For example, I can't understand the reason why
> U+6649 and U+664B are treated as the components that have the same
> abstract shape, while U+5939 and U+593E are treated as the components that
> are different in abstract shape in The Unicode Standard.

Without citations of the character using these as components, it is
difficult to provide an argument in detail for these 4. The difference
in treatment may result either from differences in traditional
lexical treatment of character variants in classical dictionaries,
or it may be an artifact of source separation in the URO.

--Ken Whistler

>
> --
> NAOI Yasushi
> Glamour Profession, Inc.
>

Next message: Peter Constable: "Re:Canonical ordering"
Previous message: Magda Danish (Unicode): "FW: EUC <-> UTF translators"
Maybe in reply to: NAOI Yasushi: "Re: Japan opposes any proposals with UNICODE"
Next in thread: NAOI Yasushi: "Re: Japan opposes any proposals with UNICODE"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT