From: John Jenkins (jenkins@apple.com)
Date: Tue May 13 2003 - 10:48:42 EDT
On Tuesday, May 13, 2003, at 07:52 AM, Gary P. Grosso wrote:
>
> Our radical/stroke sort relies on the fact that unicode order is the
> same as radical/stroke order.
Actually, this is not quite true. Outside of the fact that the Han
ideographs are spread out over three blocks, there are ambiguities in
stroke-counting which can result in disagreements.
The basic order of ideographs within a block is via the four-dictionary
sorting algorithm, which closely approximates radical-stroke order but
does vary in actual stroke counts from what would be generally used for
traditional Chinese, simplified Chinese, Japanese, and Korean.
The intent of the default order within Unihan is most emphatically
*NOT* to provide an adequate or correct sort order for ideographs, but
to provide a consistent, algorithmic way of assigning code points.
Actual, real-life collation should use additional data, some of which
can be found within Unihan.txt.
> Stroke order, then, is something
> different. Seems like we would need order entries in the config
> data
> for every character, which would be totally unmanageable.
>
> I didn't have any luck searching the Unicode web site for information
> about sorting by stroke.
>
There is a kTotalStrokes field in Unihan.txt, although it doesn't cover
every character in Unihan. This would definitely be a good place to
start.
Since characters with the same radical-stroke combination (usually) are
found in a block, and since the radials (more often than not) have a
consistent stroke count, it probably wouldn't be difficult to use some
sort of data structure to hold this information more compactly than
just using a straight table, but I haven't worked on the problem myself.
==========
John H. Jenkins
jenkins@apple.com
jhjenkins@mac.com
http://www.tejat.net/
This archive was generated by hypermail 2.1.5 : Tue May 13 2003 - 11:43:58 EDT