Re: sequences and stuff

From: Brendan Murray/DUB/Lotus (
Date: Thu Nov 30 2000 - 08:30:54 EST

Branislav Tichy <> wrote:
> b) there are compound words, which have these sequences on a word border,
> and in this case, they stands for two separate graphemes and _are_ sorted
> as c+h, d+z a.s.f.
> the proper collation algorithmus would therefore have to realise (imho),
> whether there is one or two graphemes (whether the word is compound)!

There are similar situations in many languages. Possibly more complicated
is the use of graphemes which usually contract but don't in some cases. For
example, the "aa" sequence as in "gaard" in Danish is traditionally sorted
as å (a-ring), after ø (o-slash), but in other situations, particularly in
names, the "aa" is really "a"+"a", and should be sorted before "b". How can
this be catered for algorithmically?

My guess is that there are only two possible solutions:
   1. use an exceptions list, or
   2. break the grapheme with some marker like ZWNJ to prevent the

Obviously the first creates a maintenance nightmare, and the latter has to
be somehow tagged to store the data correctly. In any case there's no
simple solution.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT