> We have generated graphical charts that show the sorting order for many
locales with the ICU 2.0 data:
http://oss.software.ibm.com/icu/charts/collation/
> They are intended to give an easier-to-read overview of the sorting order
than the source data (which lives in CVS, in the locale-specific
icu/data/*.txt files).
I want to mention a couple of items:
1. There are situations that aren't easy to notice without looking at charts
like these. For example, in the Danish chart
(http://oss.software.ibm.com/icu/charts/collation/da.html), characters such
as
æ (00E6) LATIN SMALL LETTER AE *
come after Z (and the accented Z's), as specified in our rules.
Unfortuately, because there are several Z's with overlays that don't
decompose, these come *before* characters like:
ƶ (01B6) LATIN SMALL LETTER Z WITH STROKE
The normal references for Danish don't tell us whether or not we should put
letters like æ (00E6) LATIN SMALL LETTER AE * down a bit further, so that
they order after all the "Z-like" characters or not. These are not crucial
issues, since the z with stroke is not normally used in Danish, but it would
be good to get clarification on this.
This situation occurs in several other languages too.
2. You can see what is tailored by clicking down in the window, and
searching for *. (Once you have selected one of the languages, when you call
up a find box in IE, it doesn't search the lower pane! -- you have to
specifically click down in the pane. Don't know what NN does.)
3. The charts do not show certain special features, like the sorting of
Japanese iteration mark or length mark, that are only apparent in the right
context.
4. If you are familiar with any of the languages on the charts, or know of
people who are, feedback would be appreciated.
Mark
—————
Ὀλίγοι ἔμφονες πολλῶν ἀφρόνων φοβερώτεροι — Πλάτωνος
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]
http://www.macchiato.com
----- Original Message -----
From: "Markus Scherer" <markus.scherer@jtcsv.com>
To: "icu list" <icu@www-124.southbury.usf.ibm.com>; "unicode"
<unicode@unicode.org>; "unicore" <unicore@unicode.org>
Cc: <icu-announce@www-124.southbury.usf.ibm.com>
Sent: Friday, November 02, 2001 16:32
Subject: ICU 2.0 Collation charts online
> Dear ICU users,
>
> We have generated graphical charts that show the sorting order for many
locales with the ICU 2.0 data:
http://oss.software.ibm.com/icu/charts/collation/
> They are intended to give an easier-to-read overview of the sorting order
than the source data (which lives in CVS, in the locale-specific
icu/data/*.txt files).
>
> Please take a look at the charts and notify us of any problems, either via
email, or, if you are sure that something is wrong, by filing a bug . Please
see our Contacts page on http://oss.software.ibm.com/icu/archives/index.html
>
> Please note the following:
>
> - Currently, many more characters are shown in each chart than are
actually used in each language. This is because we show entire scripts with
all variations. In the future, we will need to collect lists of characters
that are actually used in a language in order to show simpler charts.
> However, with the complete script charts, you may be able to see
peculiarities that might be unintended.
>
> - You need to look at the actual collation weights (fly-over text) for the
actual sorting of characters that expand (red coloring). For example, a
sharp s (ß) sorts like ss but is shown as primary different from s (just
like ss itself is different from s). We do not currently have code for the
chart generation that automatically finds that ß is similar to ss and would
show a lower-level difference between those.
>
> - All of the collation sequences are based on the Unicode Collation
Algorithm table for "sorting everything". This means that many characters of
the particular language and all of the characters of other languages follow
the UCA order. We have a link to the UCA charts on unicode.org.
>
>
> Enjoy, and thank you very much for your help,
>
> markus
>
>
>
This archive was generated by hypermail 2.1.2 : Mon Nov 05 2001 - 12:14:04 EST