From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Thu May 15 2003 - 06:50:37 EDT
On Thu, 15 May 2003 11:43:31 +0200, Marco Cimarosti wrote:
> Not so good. What Gary needs is the *sequence* of all strokes composing each
> character. Once he has that data, the total number of strokes from each
> character is simply the length of each sequence.
I'm not sure that's what he wants either. In dictionaries that give a Stroke
Order index, strokes are usually sub-sorted by the stroke category of the first
one or two strokes of the character. Whilst you can get this information from a
sequence of all strokes, that is more than is needed.
> A better starting point would be a database of IDS decompositions of CJK
> ideographs.
...
> DB#1 would be useful for a number of purposes, but building it is a pain in
> the neck! (Just to be 100% clear, I'd like having it, but I am *not*
> volunteering to do it. :-)
Coincidentally I've recently been in contact with someone who has spent the last
ten years creating a database of CJK ideographs, similar in scope to the Unihan
database, but (according to him) more systematic and accurate. His database does
include ideographic decompositions (as well as stroke categorization of the
first two strokes of each character). The main problem with ideographic
decompositions is that not all discrete ideographic components are [currently]
encoded within Unicode - there are about 100 unencoded ideographic components
according to this person. Of course you could get around them by breaking them
down directly into their component strokes, but this would be an inelegant
solution.
Andrew
P.S. If anyone is interested in cooperating with this person, please contact me
off-list.
This archive was generated by hypermail 2.1.5 : Thu May 15 2003 - 07:37:15 EDT