From: mpsuzuki@hiroshima-u.ac.jp
Date: Sun Mar 25 2007 - 19:15:14 CST
Dear Sir,
Sorry for long delay to reply. Here I ask about
the possibility to update UTS #37 about CJK
Compatibility Ideographs: additional note in
UTS #37 to state that IVS system is nothing
to do with CJK Compatibility Ideographs and
prohibit conversion between compatibility
ideographs and IVS-qualified unified ideographs.
# It divides <U+6674, U+E0100> and <U+FA12> etc,
# although both ideograph variants are displayed
# by CID+8481. There is already duplicated use
# of CID+19071 at U+29FCE & U+29FD7, thus such
# division is not fatal change, I guess.
So I add Hideki Hiura to Cc list who is another
author of UTS #37.
On Tue, 20 Mar 2007 22:55:09 -0700
Eric Muller <emuller@adobe.com> wrote:
> mpsuzuki@hiroshima-u.ac.jp wrote:
> > Comment 2: codepoints in CJK Compatibility Ideographs
> > =====================================================
> > I guess, the avoided
> > codepoints are just "the out of scope" of IVD Adobe-Japan1
> > (in fact, Unicode Technical Report #37 is written for CJK
> > Unified Ideographs, no mention about CJK Compatibility
> > Ideographs), and IVD Adobe-Japan1 does not concern the
> > availability of ideographs at the avoided codepoints.
> On the other hand, U+4FAE and U+FA30 have been made canonically
> equivalent in Unicode. This is a priori a good choice, because those are
> the same abstract character from Unicode's point of view (imagine they
> are not encoded in JIS nor in Unicode, and you come to the IRG today
> proposing to encode those two characters: you would get only one coded
> character).
I see, your reply is similar to what I expected during
I was writing my comments. Your pointing out is RIGHT
in principle. There is a group of CJK Compatibility
Ideographs that is based on small glyph shape difference
from corresponding CJK Unified Ideographs. IBM kanji
and JIS X 0213 compatibility kanjis are such. Although
it is arguable that Han Unification (ISO 10646 Annex S)
is well defined rule or not, both specifications of
ISO 10646 and Unicode seem to be against the addition
of technical standards utilizing compatibility ideographs.
> In fact, I would guess that if we had had the variation selectors
> mechanism in place from the start, this mechanism would have been used
> and the compatibility ideographs would not have been encoded.
I AGREE. If there were VS mechanism from the start, Han
Unification should be more systematic and exceptional
characters for source code separation could be eliminated.
> However, the canonical equivalence fundamentally negates the
> round-tripping goal. Or more precisely: you can effectively round-trip
> if and only if normalization is not applied to the Unicode data. With
> today's larger and larger text and document processing systems, the
> likelihood that none of the components will perform normalization is
> getting lower and lower. So the effectiveness of the compatibility
> ideographs is dubious at best.
I'm sure that Adobe staffs are far familar than me, but
please let me write in detail, to explain my interest.
One of the reasons why I'm sticking to CJK Compatibility
Ideograph is the clear statement of supported charset
coverage.
Followings are clear statement:
* only JIS X 0208-19xx kanji is supported
* only JIS X 0208-19xx + JIS X 0212-1990 kanjis are supported
* Microsoft codepage 932 kanji is supported (slightly unclear?)
Followings are NOT clear statement:
* Microsoft codepage 932 kanji is supported
except of CJK Compatibility Ideographs
* JIS X 0213:20xx kanji is supported
except of CJK Compatibility Ideographs
* JIS X 0213:20xx kanji is supported
except of CJK Unified Ideographs Extension B
There's no 7 or 8bit encoding method for JIS X 0213
which is interoperable with IBM or Microsoft codepage 932,
there's no popular legacy encodings for JIS X 0213
(even if we restrict the scope to JIS X 0213 level 3)
that are widely used for information interchange in Japan.
The most popular encoding to interchange JIS X 0213
charset would be Unicode (including CJK Compatibility
Ideographs). So, the seamless handling of CJK Compatibility
Ideographs is important to support JIS X 0213, I think.
If we cannot guarantee the roundtrip conversion of the
CJK Compatibility Ideographs that Unicode expressions
are different on IVS-unaware and IVS-aware systems,
we have to insist the supported coverage of softwares
as "JIS X 0213 without CJK Compatibility Ideographs".
It is not clear statement.
In previous post, I mentioned about NFD: normalization
to JIS X 0208 + 0212 coverage, it may clarify the
coverage of supported codepoints. But I've checked
the list of characters should be normalized and I
reconsidered. Such normalization would be hard work.
# JIS X 0213 compatibility ideographs, 81 kanji, is
# only the small part in new kanjis in JIS X 0213.
# The 1st majority is 396 kanjis in CJK Unified Ideographs,
# the 2nd majority is 303 kanjis in CJK Unified Ideographs Ext. B,
# the last part is 80 kanjis in CJK Unified Ideographs Ext. A.
# Normalization of such many CJK Unified Ideographs may be
# high-handed approach and its normalization rule may be
# quite ad-hoc and not intuitive.
Another fix might be the separation of CIDs for CJK Unified
Ideographs and CJK Compatibility Ideographs, even if their
form is exactly same. But it will cause another issue,
some kanji CIDs of Adobe-Japan1-6 are unavailable in IVS.
It is another non-clear coverage of glyphset.
As both fixes are not realistic, I wish if UTS #37 is updated
to have additional note to prohibit (not deprecate) the
codepoint conversion from CJK Compatibility Ideographs to
CJK Unified Ideographs with IVS. How do you think?
Regards,
mpsuzuki
This archive was generated by hypermail 2.1.5 : Sun Mar 25 2007 - 19:16:07 CST