Re: indexing of various langauges

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Jul 25 1997 - 12:03:48 EDT


Gary Grosso asked:
>
> This is not strictly on the topic of Unicode, but many on this list are
> knowledgeable about editing/typography of many of the worlds languages.
> Also, I would be happy to get pointers to other sources.
>
> My question is this: for reasons of streamlining our implementation, we
> would like to limit the number of primary sort characters to 255. Does
> anyone knows of any language where the generally accepted indexing practice
> would have more that 255 distinct primary weights, or index groupings?

Chinese and Japanese. Probably also for languages whose writing systems
are based on large syllabic scripts (Amharic, for example), though I
don't expect these will be of immediate commercial interest to you.

In my opinion, a better design point for the primary weight for multiple-
level sortkeys is to use a 16-bit value, which can contain 65536 distinct
values. The rationale for this, which would seem to be complete
overkill for the typical European language collation weighting, is that
it allows for definition of a default primary weighting for all of
Unicode (or whatever subset of it makes sense for your implementations).
Primary collation weightings for particular languages can then be
defined as minor deltas off the default weighting. This gives you the
situation where you get reasonable default collation behavior for
multilingual text while still getting culturally correct collation for
the particular target language.

--Ken Whistler



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT