From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Sep 25 2007 - 19:25:33 CDT
Thanks for pointing it. That's exactly the kind of thing I suggested, having
some normalization form for IDS is really useful for the intended purpose,
i.e. its use by the IRG for the identification and unification of
ideographs.
From your link, I found the revised principles in N1183 more useful:
http://www.cse.cuhk.edu.hk/~irg/irg/irg25/IRGN1183RevisedIDSPrinciples.pdf
This document explicitly says that IDS strings using triples are favored to
decompositions using couples, so that the resulting IDS string is shorter,
without changing the choice of component radicals.
It also gives some principles about the decomposition:
* it is based on glyphs, not on meaning or origin or intended use or
classification between traditional and simplified uses.
* it is language-neutral
* it does not attempt to decompose the radicals too much into their
component strokes, if these strokes are colliding, or intersecting, in a non
trivial way: it keeps them undecomposed, and considers the composed radical
as a good candidate for inclusion in the repertoire of base ideographs.
These rules make sense. Now if we can use these principles to get a
normative dictionary of IDS decompositions of ideographs, it will help
authors using dictionaries, or locate some rare ideographs, using IDS
strings as search keys from which derived IDS strings can be looked for and
matched to find other ideographs.
It could also be helpful for the implementation of input methods in editors,
or within checkers that attempt detecting the incorrect usage of ideographs,
and guess their meaning according to some usage dictionaries or repositories
of common expressions. It will be less difficult to detect Chinese word
boundaries.
Finally, this could help creating enhanced orthographic rules when there are
ambuiguities about the choice of radicals and the way they should be
composed.
IDS strings won't say anything about the final look of the composed glyph
(because the exact forms of each component radical or even of each stroke
making these radicals will not be specified and will vary between authors
and traditions, or the order in which they are drawn, something that is
quite well documented, but not completely, and this influences a lot their
final appearance and the possible confusions between normally distinct
radicals due to some transformations of the strokes when radicals are
resized and adjusted to fit in the composition square, when also trying to
keep them still readable.)
From this extensive work, the composition rules may be finally formalized,
after studying the various ways the same couples or triples of base
ideographs are adjusted within many distinct composed ideographs, helping
font authors to create more meaningful and readable ideographic fonts with a
richer subset of supported ideographs and a consistent style based on a
reduced set of possible stroke forms and contextual stroke transformation
rules (working much like hinting with linear transforms of glyph control
points depending on some external conditions).
With this, we could see an end to the proliferation of ideographs, if many
of them can be composed automatically from a set of transformation rules,
acting like an orthography.
> -----Message d'origine-----
> De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] De la
> part de James Kass
> Envoyé : mercredi 26 septembre 2007 01:10
> À : 'Unicode Mailing List'
> Objet : RE: Composition of not included Chinese characters
>
>
> Philippe Verdy wrote about duplicate screening of CJK ideographs
> based on IDS.
>
> List members interested in this topic would be well advised to
> read Taichi Kawabata's "Algorithm for Identifying the Duplicate
> Ideograph Characters by the IDS", for starters. The document
> is available from this page:
> http://www.cse.cuhk.edu.hk/~irg/irg/irg25/IRG25.htm
> (Please see the link "N1154".)
>
> The page above has several related documents linked, as do other
> pages on the web site.
> http://www.cse.cuhk.edu.hk/~irg/index.htm
This archive was generated by hypermail 2.1.5 : Tue Sep 25 2007 - 19:27:39 CDT