From: mpsuzuki@hiroshima-u.ac.jp
Date: Fri Nov 02 2007 - 20:14:05 CST
Hi,
It may be too late to involve the discussion about the component
based encoding for CJKV ideographs stopped 1 week ago, but similar
comments promoting component encoding as good alternative to
support huge CJKV character collection may be posted in future.
I think there are 2 typical problems in component based encoding
for CJKV ideographs, but, unfortunately, I've never seen the
proposal with some precautions against them. If anybody knows,
please let me know.
1. information interchange of "unified" ideograph.
--------------------------------------------------
For some ideographs, IDS is too "descriptive" to identify
an ideograph whose shape is varied under ISO/IEC 10646 Annex S.
Unicode Standard 5.0 p. 429-430 explains that multiple IDSs
are possible to describe an ideograph and there's no algorithm
to check the equivalence of the characters described by 2 IDSs.
I think one of the important policy in Unicode is: multiple
expressions for single character is not good idea. Thus, using
a code point is better for information interchange without
ambiguity.
For example, when PRC, Taiwanese, Japanese, Korean and Vietnamese
instances in ISO/IEC 10646 five-columns of following characters
are expressed by IDS, the expressions won't be same:
U+518E, U+5203, U+5205, U+5544, U+559A, U+55AD, U+55B6, U+55BA, U+55C2,
U+5605, U+5629, U+5668, U+569D, U+56B3, U+570A, U+5832, U+5835,
U+5840, U+58B7, etc etc.
If IDS is expected to be useful for information interchange,
these ideographs should not be over-decomposed. In the case of
Kawabata-san's database, these characters have multiple IDS
expressions for each instances in ISO/IEC 10646's five-column
instances. As far as there's no standard to evaluate the equality
of these multiple IDS expressions, these characters should not
be over-decomposed. But, the instances in ISO/IEC 10646 is not
the perfect collection of unifiable ideographs. So, again, it's
difficult to list all characters which IDS decomposition should
be restricted. I guess Kawabata-san wants people to learn UCS
unification rule and keep from over-differenciation of "new"
ideograph (e.g. "this character is not coded yet, I want to
display this character, I cannot find existing fonts").
But I'm suspicious if the educational approach can block such
requests.
2. the quality of dynamically composed ideograph.
-------------------------------------------------
John Nightly has already pointed out: "CJKV characters are not
formed based on a cartesian system", I agree, it's important.
Some people may think IDS is sufficient to compose a CJK ideograph
dynamically: the graphic instruction of TrueType font supports
the composite glyph with simple affine transformation, so font file
can reduce its content to essential components only. Furthermore,
if the composition is implemented out of TrueType rasterizer,
the complex glyphs can be composed dynamically, font file doesn't
have to include the composition rule at all, and users can compose
any glyphs for all possible combinations.
It's popular assumption, but the quality of dynamically composed
glyph is quite suspicious. Talking about Japanese case, Wadalab
font was produced by this strategy (the composite ideograph can be
generated by component radicals). You can check the quality of
original artwork by Ken Lunde's samples of CID-keyed fonts:
ftp://ftp.oreilly.com/pub/examples/nutshell/ujip/adobe/samples/
(WadaXXX series are based Wadalab PS Type1 fonts).
Some people tried to improve Wadalab fonts by extra glyph variants
and network oriented systems (see http://fonts.jp/kage/), but many
people didn't use these systems and switched to use no-charged
proprietary fonts when Japanese information promotion agency
released such, because they felt the quality of most glyphs in
Wadalab was ugly. However, I'm not sure if such negative evaluation
on dynamically composed glyph is generic. If somebody knows about
the situation in other countries, please let me know.
It's also possible to make an OpenType font whose cmap + GSUB
converts IDS sequence to precomposed glyph index (the glyph is
not accessible by character codepoint directly), but this
strategy cannot break the barrier of 65535 glyphs limit, and
does not shrink the size of huge CJK fonts. I guess it's not
what expected by the people who promotes IDS to prevent the
inflation of CJK Unified Ideograph blocks.
Regards,
mpsuzuki
On Mon, 29 Oct 2007 16:36:41 -0500
vunzndi@vfemail.net wrote:
>I assume here by current approach you mean Wenlin's CDL, which is
>based on cartesian co-ordinates. This is good for font making but bad
>of a component based model. As you say the CDL is limited because it
>givesjust one repesentation of a character. CJKV characters are not
>formed based on a cartesian system, the component based model should
>be based on the way characters are form, these comcepts are more
>topological than cartesian.
This archive was generated by hypermail 2.1.5 : Fri Nov 02 2007 - 20:27:03 CST