RE: Composition of not included Chinese characters

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Sep 25 2007 - 19:25:33 CDT

  • Next message: Arne Goetje: "Re: Composition of not included Chinese characters"

    Thanks for pointing it. That's exactly the kind of thing I suggested, having
    some normalization form for IDS is really useful for the intended purpose,
    i.e. its use by the IRG for the identification and unification of
    ideographs.

    From your link, I found the revised principles in N1183 more useful:
    http://www.cse.cuhk.edu.hk/~irg/irg/irg25/IRGN1183RevisedIDSPrinciples.pdf
    This document explicitly says that IDS strings using triples are favored to
    decompositions using couples, so that the resulting IDS string is shorter,
    without changing the choice of component radicals.

    It also gives some principles about the decomposition:
    * it is based on glyphs, not on meaning or origin or intended use or
    classification between traditional and simplified uses.
    * it is language-neutral
    * it does not attempt to decompose the radicals too much into their
    component strokes, if these strokes are colliding, or intersecting, in a non
    trivial way: it keeps them undecomposed, and considers the composed radical
    as a good candidate for inclusion in the repertoire of base ideographs.

    These rules make sense. Now if we can use these principles to get a
    normative dictionary of IDS decompositions of ideographs, it will help
    authors using dictionaries, or locate some rare ideographs, using IDS
    strings as search keys from which derived IDS strings can be looked for and
    matched to find other ideographs.

    It could also be helpful for the implementation of input methods in editors,
    or within checkers that attempt detecting the incorrect usage of ideographs,
    and guess their meaning according to some usage dictionaries or repositories
    of common expressions. It will be less difficult to detect Chinese word
    boundaries.

    Finally, this could help creating enhanced orthographic rules when there are
    ambuiguities about the choice of radicals and the way they should be
    composed.

    IDS strings won't say anything about the final look of the composed glyph
    (because the exact forms of each component radical or even of each stroke
    making these radicals will not be specified and will vary between authors
    and traditions, or the order in which they are drawn, something that is
    quite well documented, but not completely, and this influences a lot their
    final appearance and the possible confusions between normally distinct
    radicals due to some transformations of the strokes when radicals are
    resized and adjusted to fit in the composition square, when also trying to
    keep them still readable.)

    From this extensive work, the composition rules may be finally formalized,
    after studying the various ways the same couples or triples of base
    ideographs are adjusted within many distinct composed ideographs, helping
    font authors to create more meaningful and readable ideographic fonts with a
    richer subset of supported ideographs and a consistent style based on a
    reduced set of possible stroke forms and contextual stroke transformation
    rules (working much like hinting with linear transforms of glyph control
    points depending on some external conditions).

    With this, we could see an end to the proliferation of ideographs, if many
    of them can be composed automatically from a set of transformation rules,
    acting like an orthography.

    > -----Message d'origine-----
    > De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] De la
    > part de James Kass
    > Envoyé : mercredi 26 septembre 2007 01:10
    > À : 'Unicode Mailing List'
    > Objet : RE: Composition of not included Chinese characters
    >
    >
    > Philippe Verdy wrote about duplicate screening of CJK ideographs
    > based on IDS.
    >
    > List members interested in this topic would be well advised to
    > read Taichi Kawabata's "Algorithm for Identifying the Duplicate
    > Ideograph Characters by the IDS", for starters. The document
    > is available from this page:
    > http://www.cse.cuhk.edu.hk/~irg/irg/irg25/IRG25.htm
    > (Please see the link "N1154".)
    >
    > The page above has several related documents linked, as do other
    > pages on the web site.
    > http://www.cse.cuhk.edu.hk/~irg/index.htm



    This archive was generated by hypermail 2.1.5 : Tue Sep 25 2007 - 19:27:39 CDT