From: Eric Muller (emuller@adobe.com)
Date: Tue Mar 20 2007 - 23:55:09 CST
mpsuzuki@hiroshima-u.ac.jp wrote:
> Comment 2: codepoints in CJK Compatibility Ideographs
> =====================================================
>
>
> I guess, the avoided
> codepoints are just "the out of scope" of IVD Adobe-Japan1
> (in fact, Unicode Technical Report #37 is written for CJK
> Unified Ideographs, no mention about CJK Compatibility
> Ideographs), and IVD Adobe-Japan1 does not concern the
> availability of ideographs at the avoided codepoints.
>
The explicit requirement in UTS37 by the first code point of an IVS be a
character with the Unified_Ideograph property (by opposition to "being
an ideograph", e.g. in one of the CJK ideograph blocks) is of course a
deliberate choice. The fact is that the compatibility ideographs (those
with a canonical decomposition, i.e. not including U+FA0E and its eleven
friends) are awkward, and barely serving their purpose.
One the one hand, they have been introduced in Unicode to facilitate
round-tripping with other standards. For example, JIS X 0208 + JIS X
0213 encodes both 41-78 and 1-14-24, and having correspondingly U+4FAE
and U+FA30 in Unicode means that the distinction established by JIS can
a priori be preserved when going through Unicode.
On the other hand, U+4FAE and U+FA30 have been made canonically
equivalent in Unicode. This is a priori a good choice, because those are
the same abstract character from Unicode's point of view (imagine they
are not encoded in JIS nor in Unicode, and you come to the IRG today
proposing to encode those two characters: you would get only one coded
character).
However, the canonical equivalence fundamentally negates the
round-tripping goal. Or more precisely: you can effectively round-trip
if and only if normalization is not applied to the Unicode data. With
today's larger and larger text and document processing systems, the
likelihood that none of the components will perform normalization is
getting lower and lower. So the effectiveness of the compatibility
ideographs is dubious at best.
In the IVD world, we can have our cake and eat it too: we can represent
the difference between 41-78 and 1-14-24 by having two sequences based
on U+4FAE. Those two sequences are not canonically equivalent so we are
fine on that front; and the ignorable nature of the variation selectors
means that we recognize the fundamental equivalence (in a pure Unicode
point of view) of 41-78 and 1-14-24. Thus there is no need to define
sequences using the compatibility ideographs, and we avoid the problems
of normalization.
In fact, I would guess that if we had had the variation selectors
mechanism in place from the start, this mechanism would have been used
and the compatibility ideographs would not have been encoded.
> However, if we use IVD Adobe-Japan1 in ToUnicode mapping
> tables in PDF using Adobe-Japan1 CID font, it can cause
> a round-trip issue. For example, if I make a PDF from
> JIS X 0213 text, with Adobe CID font, and insert ToUnicode
> mapping tables including IVS of IVD Adobe-Japan1,
> the receiver of PDF file can retrieve JIS X 0208 (and/or
> 0212) text from the PDF, but cannot retrieve original JIS
> X 0213 text.
>
Start with the sequence of JIS code points <41-78, 1-14-24>. Turn that
into a PDF using a AJ1 CID font, the PDF contains CIDs 3552 and 13382
(and no direct trace of the JIS code points). Use the registered
sequences <4FAE, E0100> and <4FAE,E0101> in the ToUnicode map. If you
want to go to JIS, turn <4FAE, E0100> into 41-78, and <4FAE, E0101> into
1-14-24.
Compare with the current scenario: the ToUnicode map contains <4FAE> and
<FA30>; any normalization on that reduces both to <4FAE>, and certainly
you cannot recover your original JIS code points.
Granted, you need new mappings from Unicode to JIS, but they are immune
to normalization problems.
Eric.
This archive was generated by hypermail 2.1.5 : Tue Mar 20 2007 - 23:59:23 CST