RE: how to sort by stroke (not radical/stroke)

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Thu May 15 2003 - 05:43:31 EDT

Next message: Pim Blokland: "Re: weird UTF-8 encoding in MS Exchange 2000 IM client"

Previous message: Philippe Verdy: "Computing default UCA collation tables"
Maybe in reply to: Gary P. Grosso: "how to sort by stroke (not radical/stroke)"
Next in thread: Andrew C. West: "RE: how to sort by stroke (not radical/stroke)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

John Jenkins wrote:
> There is a kTotalStrokes field in Unihan.txt, although it
> doesn't cover every character in Unihan. This would
> definitely be a good place to start.

Not so good. What Gary needs is the *sequence* of all strokes composing each
character. Once he has that data, the total number of strokes from each
character is simply the length of each sequence.

A better starting point would be a database of IDS decompositions of CJK
ideographs. E.g.:

        (DB#1: IDS decompositions)
        喻 = ⿰ 口 ⿱ ⿱ 人一 ⿰ 月刂
        U+55BB = LeftRight(MOUTH, TopBottom(TopBottom(MAN, ONE),
LeftRight(MOON, KNIFE))

Once you have that, building a strokes database is quite trivial. First, all
the IDS operators are useless for this purpose and should be stripped off:

        (DB#2: Decompositions in atomic components)
        喻 = 口人一月刂
        U+55BB = { MOUTH, MAN, ONE, MOON, KNIFE }

Then, a database of strokes for all the atomic components is needed. This
should not such a huge work, because only a few hundreds such components are
supposed to exist:

(DB#3: Stroke sequences of atomic components)

口 = 丨乙一
MOUTH = { shu, zhe, heng }

人 = 丿丶
MAN = { pie, na }

一 = 一
ONE = { heng }

月 = 丿乙一一
MOON = { pie, zhe, heng, heng }

刂 = 丨亅
KNIFE = { shu, shugou }

At this point, it is easy to automatically expand the components of DB#2 to
the corresponding stroke sequences of DB#3:

        (DB#4: CJK stroke sequences)
        喻 = 丨乙一丿丶一丿乙一一丨亅
        U+55BB = { shu, zhe, heng, pie, na, heng, pie, zhe, heng, heng,
shu, shugou }

DB#1 would be useful for a number of purposes, but building it is a pain in
the neck! (Just to be 100% clear, I'd like having it, but I am *not*
volunteering to do it. :-)

_ Marco

Next message: Pim Blokland: "Re: weird UTF-8 encoding in MS Exchange 2000 IM client"
Previous message: Philippe Verdy: "Computing default UCA collation tables"
Maybe in reply to: Gary P. Grosso: "how to sort by stroke (not radical/stroke)"
Next in thread: Andrew C. West: "RE: how to sort by stroke (not radical/stroke)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu May 15 2003 - 06:24:11 EDT