RE: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)

From: mpsuzuki@hiroshima-u.ac.jp
Date: Fri Nov 02 2007 - 20:14:05 CST

Next message: James Kass: "Re: Encoding Personal Use Ideographs (was Re: Level of Unicode support required for various languages)"

Previous message: William J Poser: "Re: Does the Egyptian Hieroglyphic proposal support Budge?"
Next in thread: vunzndi@vfemail.net: "RE: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Reply: vunzndi@vfemail.net: "RE: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Reply: vunzndi@vfemail.net: "RE: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi,

It may be too late to involve the discussion about the component
based encoding for CJKV ideographs stopped 1 week ago, but similar
comments promoting component encoding as good alternative to
support huge CJKV character collection may be posted in future.
I think there are 2 typical problems in component based encoding
for CJKV ideographs, but, unfortunately, I've never seen the
proposal with some precautions against them. If anybody knows,
please let me know.

1. information interchange of "unified" ideograph.
--------------------------------------------------
   For some ideographs, IDS is too "descriptive" to identify
   an ideograph whose shape is varied under ISO/IEC 10646 Annex S.
   Unicode Standard 5.0 p. 429-430 explains that multiple IDSs
   are possible to describe an ideograph and there's no algorithm
   to check the equivalence of the characters described by 2 IDSs.
   I think one of the important policy in Unicode is: multiple
   expressions for single character is not good idea. Thus, using
   a code point is better for information interchange without
   ambiguity.

   For example, when PRC, Taiwanese, Japanese, Korean and Vietnamese
   instances in ISO/IEC 10646 five-columns of following characters
   are expressed by IDS, the expressions won't be same:
   U+518E, U+5203, U+5205, U+5544, U+559A, U+55AD, U+55B6, U+55BA, U+55C2,
   U+5605, U+5629, U+5668, U+569D, U+56B3, U+570A, U+5832, U+5835,
   U+5840, U+58B7, etc etc.

   If IDS is expected to be useful for information interchange,
   these ideographs should not be over-decomposed. In the case of
   Kawabata-san's database, these characters have multiple IDS
   expressions for each instances in ISO/IEC 10646's five-column
   instances. As far as there's no standard to evaluate the equality
   of these multiple IDS expressions, these characters should not
   be over-decomposed. But, the instances in ISO/IEC 10646 is not
   the perfect collection of unifiable ideographs. So, again, it's
   difficult to list all characters which IDS decomposition should
   be restricted. I guess Kawabata-san wants people to learn UCS
   unification rule and keep from over-differenciation of "new"
   ideograph (e.g. "this character is not coded yet, I want to
   display this character, I cannot find existing fonts").
   But I'm suspicious if the educational approach can block such
   requests.

2. the quality of dynamically composed ideograph.
-------------------------------------------------
John Nightly has already pointed out: "CJKV characters are not
formed based on a cartesian system", I agree, it's important.

   Some people may think IDS is sufficient to compose a CJK ideograph
   dynamically: the graphic instruction of TrueType font supports
   the composite glyph with simple affine transformation, so font file
   can reduce its content to essential components only. Furthermore,
   if the composition is implemented out of TrueType rasterizer,
   the complex glyphs can be composed dynamically, font file doesn't
   have to include the composition rule at all, and users can compose
   any glyphs for all possible combinations.

   It's popular assumption, but the quality of dynamically composed
   glyph is quite suspicious. Talking about Japanese case, Wadalab
   font was produced by this strategy (the composite ideograph can be
   generated by component radicals). You can check the quality of
   original artwork by Ken Lunde's samples of CID-keyed fonts:
   ftp://ftp.oreilly.com/pub/examples/nutshell/ujip/adobe/samples/
   (WadaXXX series are based Wadalab PS Type1 fonts).
   Some people tried to improve Wadalab fonts by extra glyph variants
   and network oriented systems (see http://fonts.jp/kage/), but many
   people didn't use these systems and switched to use no-charged
   proprietary fonts when Japanese information promotion agency
   released such, because they felt the quality of most glyphs in
   Wadalab was ugly. However, I'm not sure if such negative evaluation
   on dynamically composed glyph is generic. If somebody knows about
   the situation in other countries, please let me know.

   It's also possible to make an OpenType font whose cmap + GSUB
   converts IDS sequence to precomposed glyph index (the glyph is
   not accessible by character codepoint directly), but this
   strategy cannot break the barrier of 65535 glyphs limit, and
   does not shrink the size of huge CJK fonts. I guess it's not
   what expected by the people who promotes IDS to prevent the
   inflation of CJK Unified Ideograph blocks.

Regards,
mpsuzuki

On Mon, 29 Oct 2007 16:36:41 -0500
vunzndi@vfemail.net wrote:
>I assume here by current approach you mean Wenlin's CDL, which is
>based on cartesian co-ordinates. This is good for font making but bad
>of a component based model. As you say the CDL is limited because it
>givesjust one repesentation of a character. CJKV characters are not
>formed based on a cartesian system, the component based model should
>be based on the way characters are form, these comcepts are more
>topological than cartesian.

Next message: James Kass: "Re: Encoding Personal Use Ideographs (was Re: Level of Unicode support required for various languages)"
Previous message: William J Poser: "Re: Does the Egyptian Hieroglyphic proposal support Budge?"
Next in thread: vunzndi@vfemail.net: "RE: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Reply: vunzndi@vfemail.net: "RE: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Reply: vunzndi@vfemail.net: "RE: Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Nov 02 2007 - 20:27:03 CST