RE: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)

From: mpsuzuki@hiroshima-u.ac.jp
Date: Fri Nov 02 2007 - 20:14:05 CST

  • Next message: James Kass: "Re: Encoding Personal Use Ideographs (was Re: Level of Unicode support required for various languages)"

    Hi,

    It may be too late to involve the discussion about the component
    based encoding for CJKV ideographs stopped 1 week ago, but similar
    comments promoting component encoding as good alternative to
    support huge CJKV character collection may be posted in future.
    I think there are 2 typical problems in component based encoding
    for CJKV ideographs, but, unfortunately, I've never seen the
    proposal with some precautions against them. If anybody knows,
    please let me know.

    1. information interchange of "unified" ideograph.
    --------------------------------------------------
       For some ideographs, IDS is too "descriptive" to identify
       an ideograph whose shape is varied under ISO/IEC 10646 Annex S.
       Unicode Standard 5.0 p. 429-430 explains that multiple IDSs
       are possible to describe an ideograph and there's no algorithm
       to check the equivalence of the characters described by 2 IDSs.
       I think one of the important policy in Unicode is: multiple
       expressions for single character is not good idea. Thus, using
       a code point is better for information interchange without
       ambiguity.

       For example, when PRC, Taiwanese, Japanese, Korean and Vietnamese
       instances in ISO/IEC 10646 five-columns of following characters
       are expressed by IDS, the expressions won't be same:
       U+518E, U+5203, U+5205, U+5544, U+559A, U+55AD, U+55B6, U+55BA, U+55C2,
       U+5605, U+5629, U+5668, U+569D, U+56B3, U+570A, U+5832, U+5835,
       U+5840, U+58B7, etc etc.

       If IDS is expected to be useful for information interchange,
       these ideographs should not be over-decomposed. In the case of
       Kawabata-san's database, these characters have multiple IDS
       expressions for each instances in ISO/IEC 10646's five-column
       instances. As far as there's no standard to evaluate the equality
       of these multiple IDS expressions, these characters should not
       be over-decomposed. But, the instances in ISO/IEC 10646 is not
       the perfect collection of unifiable ideographs. So, again, it's
       difficult to list all characters which IDS decomposition should
       be restricted. I guess Kawabata-san wants people to learn UCS
       unification rule and keep from over-differenciation of "new"
       ideograph (e.g. "this character is not coded yet, I want to
       display this character, I cannot find existing fonts").
       But I'm suspicious if the educational approach can block such
       requests.

    2. the quality of dynamically composed ideograph.
    -------------------------------------------------
       John Nightly has already pointed out: "CJKV characters are not
       formed based on a cartesian system", I agree, it's important.

       Some people may think IDS is sufficient to compose a CJK ideograph
       dynamically: the graphic instruction of TrueType font supports
       the composite glyph with simple affine transformation, so font file
       can reduce its content to essential components only. Furthermore,
       if the composition is implemented out of TrueType rasterizer,
       the complex glyphs can be composed dynamically, font file doesn't
       have to include the composition rule at all, and users can compose
       any glyphs for all possible combinations.

       It's popular assumption, but the quality of dynamically composed
       glyph is quite suspicious. Talking about Japanese case, Wadalab
       font was produced by this strategy (the composite ideograph can be
       generated by component radicals). You can check the quality of
       original artwork by Ken Lunde's samples of CID-keyed fonts:
       ftp://ftp.oreilly.com/pub/examples/nutshell/ujip/adobe/samples/
       (WadaXXX series are based Wadalab PS Type1 fonts).
       Some people tried to improve Wadalab fonts by extra glyph variants
       and network oriented systems (see http://fonts.jp/kage/), but many
       people didn't use these systems and switched to use no-charged
       proprietary fonts when Japanese information promotion agency
       released such, because they felt the quality of most glyphs in
       Wadalab was ugly. However, I'm not sure if such negative evaluation
       on dynamically composed glyph is generic. If somebody knows about
       the situation in other countries, please let me know.

       It's also possible to make an OpenType font whose cmap + GSUB
       converts IDS sequence to precomposed glyph index (the glyph is
       not accessible by character codepoint directly), but this
       strategy cannot break the barrier of 65535 glyphs limit, and
       does not shrink the size of huge CJK fonts. I guess it's not
       what expected by the people who promotes IDS to prevent the
       inflation of CJK Unified Ideograph blocks.

    Regards,
    mpsuzuki

    On Mon, 29 Oct 2007 16:36:41 -0500
    vunzndi@vfemail.net wrote:
    >I assume here by current approach you mean Wenlin's CDL, which is
    >based on cartesian co-ordinates. This is good for font making but bad
    >of a component based model. As you say the CDL is limited because it
    >givesjust one repesentation of a character. CJKV characters are not
    >formed based on a cartesian system, the component based model should
    >be based on the way characters are form, these comcepts are more
    >topological than cartesian.



    This archive was generated by hypermail 2.1.5 : Fri Nov 02 2007 - 20:27:03 CST