L2/02-044

Title: Mapping of Compatibility Ideographs
Authors: Ken Whistler & Martin Dürst
Date: 2002/01/25

Martin said:

> Dear Unicode Experts,
> 
> I very much think the following should be considered very seriously
> again, and most probably changed:
> 
> In Unicode 3.2, there are 59 new compatibility ideographs at
> U+FA30-FA6A. As far as I understand, they (or most of them) are
> from the set of variants of the Japanese Ministry of Justice.
> 
> All of them have a *canonical* mapping, which means that according
> to the Unicode Standard, nobody can expect them to be preserved.
> I propose that this be changed into a compatibility mapping.
> This is in particular relevant for the Web, where the tendency
> is to use NFC as much as possible.
> 
> This would be much more in line with how similar differences
> are handled in other scripts.
> 

Aiy, yai yai!

First of all, these is clearly not a *bug* in the BETA files, per
se, where have the mappings as intended (and also as listed in
the Source references for CJK Compatibility Ideographs in 10646).

It is, however, a debatable position to take, that the UTC will
need to consider and decide.

My own take on this is that making such a change would introduce
yet another, inconsistent class of CJK Compatibility characters
into the standard. The CJK Compatibility characters at F900..FA2D
all have canonical mappings (and that cannot be changed at this
point) -- except for the 12 with are actually unified ideographs.
For the majority of those, no one really cares -- the KS C 5601-1987
compatibility duplicates and the Big 5 duplicates are just duplicates,
and don't carry true variant distinctions. However, among the
remainder of the IBM 32, there are specific variants that fall
within the kinds of variations ordinarily unified in the big
list of unified ideographs, but which were pulled out here
separately for roundtripping to IBM code pages. (And there are
500+ CNS compatibility characters already standardized for
Unicode 3.1, all of which have canonical mappings, and many of
which may have distinct variant implications for CNS as well.)

Importantly, among the IBM 32 are some of the *same* systematic
kinds of variations important to the 59 new compatibility
ideographs at U+FA30-FA6A. Cf. the Unicode 3.0 characters
FA18..FA1A with the new Unicode 3.2 characters FA4D..FA54.
These show *exactly* the same variation in the radical form,
and are maintained in the Japanese Ministry of Justice list
for exactly the same traditionalist reasons. If we introduce
an inconsistency between the way FA18..FA1A behave under normalization
and FA4D..FA54, how are we going to explain why some are preserved
and others are not? Incidentally, the 3 among the IBM 32 contain
*the* most problematical of the bunch, FA19 "kami", which probably
has more traditionalist associations in Japan that just about
any other character!

In short, I don't see how introducing an inconsistency in the
way CJK Compatibility character mapping is done, just for this
new set of 59 characters from JIS X 0213, will generically
solve the problem of how to maintain *particular* glyph form
distinctions on web pages using normalized Unicode data.