From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Wed May 14 2003 - 06:01:38 EDT
On Wed, 14 May 2003 04:57:53 +0900, Dan Kogai wrote:
> For U+3400 - U+4DD5 you are roughly right but at U+4E00, "One", the
> simplest of all ideographs, rewinds the "stroke counter". So I have to
> say sorting by Unicode code point to approximate radical/stroke sorting
> is very moot.
But I did specify for the "basic CJK block [U+4E00..9FFF] only". If you include
CJK-A and/or CJK-B it all falls to pieces.
However, as I said, the vast majority of CJK data in the wild fits within
U+4E00..9FFF, and you only have to worry about CJK-A or CJK-B if you are dealing
with atypical Chinese data (such as includes obscure or archaic ideographs, or
ultra-simplified forms). For standard modern Chinese of the PRC or Taiwanese
varieties then it is reasonably safe to assume that everything will fit into the
basic CJK block (given that the basic CJK block is based on pre-Unicode Taiwan
and PRC coding standards), and a sort by codepoint will yield acceptable results
for most purposes.
As John said, there are some inconsistancies in stroke count ordering within
radical (but these are fairly minor, of the type stroke order = ... 9, 9, 10, 9,
9, 10, 10, 9, 10 ...), and there are one or two ideographs which are mislocated
in the wrong radical group (e.g. U+5909), but all in all it's pretty good if all
you need is an approximate radical/stroke sort.
Regards,
Andrew
This archive was generated by hypermail 2.1.5 : Wed May 14 2003 - 07:03:35 EDT