Thanks for your responses so far on this thread.
It is true that to handle an index consisting of an arbitrary mixture of
languages, we would certainly need more than 255 primary sort characters.
It's debatable who would want such an index, but as the world becomes more
internationalized, people may want such a thing as opposed to having many
indexes, one for each language.
I think that it is possible to index Japanese and Chinese within the 255
character limitation, since they are generally indexed by hiragana and
a romanization (such as pinyin), respectively. On the other hand, when the
user forgets, for example, to supply the hiragana equivalent for some Kanji
that they are indexing, it must be handled gracefully, and one solution is
just to index it directly as the Kanji, even though no one wants this
result in actual typographic practice.
Anyway, my colleague who is doing the actual coding on this area of our
project decided to give us the best mix of efficiency and usablity.
From his response to Jim Agenbroad:
According to our examples, and the Unicode Standard 2.0 (page 6-62),
the "unit of collation" for Korean is the Hangul syllable block (the
molecule). However, the jamo (the phonetic atoms) can be used for a
binary sort, after the syllables are decomposed. This allows the
number of sorting weights to be far less than 255.
The headings seem to be the leading consonants (kiyok, niun, tigut,
I am not familiar with Amharic script, and I find no mention of it in
the Unicode Standard. My almanac tells me Amharic is spoken in
Ethiopia. I have modified our algorithm to switch automatically
between one-byte and two-byte weights, so we will be able to
accommodate whatever Amharic requires. So far, we don't seem to have
much demand for it.
Responding for Gary Grosso,
Ann Arbor MI, USA
I would like to add that we find (mostly "lurking" on) this list very helpful.
Gary Grosso ArborText, Inc. Ann Arbor, MI, USA firstname.lastname@example.org
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT