From: Dan Kogai (dankogai@dan.co.jp)
Date: Tue May 13 2003 - 15:57:53 EDT
On Wednesday, May 14, 2003, at 01:23 AM, Andrew C. West wrote:
> That's certainly true, but sorting by Unicode code point will be 90%
> OK for the
> 99.99% of CJK data that is encoded within the basic CJK block (and at
> the
> radical level it'll probably be 99.9% OK). As a rough and ready method
> of
> sorting CJK data it's definitely the most cost effective way of
> implementing a
> CJK sort. Like I said, it all depends on what you want it for.
I wrote a small perl script to see if that is correct.
#!/usr/local/bin/perl
use strict;
use Unicode::Unihan; # get one via CPAN
my $uh = Unicode::Unihan->new;
binmode STDOUT => ':utf8';
for my $ord (0..65535){ # just check BMP
my $chr = chr($ord);
my $rs = $uh->RSUnicode($chr);
defined $rs or next;
printf "$chr (U+%04x) => $rs\n", $ord;
}
__END__
And here is the part of what it prints.
㐀 (U+3400) => 1.4
㐁 (U+3401) => 1.5
㐂 (U+3402) => 1.5
㐃 (U+3403) => 2.2
[snip]
䶵 (U+4db5) => 214.10
一 (U+4e00) => 1.0
丁 (U+4e01) => 1.1
For U+3400 - U+4DD5 you are roughly right but at U+4E00, "One", the
simplest of all ideographs, rewinds the "stroke counter". So I have to
say sorting by Unicode code point to approximate radical/stroke sorting
is very moot.
Sorting by code point to yield dictionary order seems a luxury only
ASCII enjoys. Even ISO-8859-1 fails miserably since all diacritics are
\x80 and above.
Dan the Unsorted Man
This archive was generated by hypermail 2.1.5 : Tue May 13 2003 - 17:02:17 EDT