Re: sequences and stuff

From: Mark Davis (
Date: Thu Nov 30 2000 - 12:17:45 EST

The soft hyphen is not sufficient, since in other languages the case where
two letters must be distinguished in collation may not fall on a syllable
boundary, or allow hyphenation between them.

The UTC looked at all the possible existing boundary-control characters;
none of them really work for this problem since they all have other
functions that may conflict. There was a proposal for a grapheme-break and
grapheme-join pair of additional "Cf" characters. The UTC accepted the
second one, and will be working with WG2 on it.


IMO, both are useful in different situations. The grapheme-break is more
useful in the situation you cite: marking the exceptional words having
characters that should not be considered a single grapheme in collation
(and, perhaps, in pronunciation: e.g. "Bathill").

----- Original Message -----
From: "Keld Jørn Simonsen" <>
To: "Unicode List" <>
Cc: "Unicode List" <>
Sent: Thursday, November 30, 2000 07:43
Subject: Re: sequences and stuff

> On Thu, Nov 30, 2000 at 05:18:59AM -0800, Brendan Murray/DUB/Lotus wrote:
> >
> > Branislav Tichy <> wrote:
> > > b) there are compound words, which have these sequences on a word
> > > and in this case, they stands for two separate graphemes and _are_
> > > as c+h, d+z a.s.f.
> > > the proper collation algorithmus would therefore have to realise
> > > whether there is one or two graphemes (whether the word is compound)!
> >
> > There are similar situations in many languages. Possibly more
> > is the use of graphemes which usually contract but don't in some cases.
> > example, the "aa" sequence as in "gaard" in Danish is traditionally
> > as å (a-ring), after ø (o-slash), but in other situations, particularly
> > names, the "aa" is really "a"+"a", and should be sorted before "b". How
> > this be catered for algorithmically?
> Yes, the Slovak problem may look like the Dansih "aa" problem.
> Just for the record, "aa" normally means "å" in Danish names,
> eg. Søndergaard is the last name of one of the persons that
> has been responsible for SC2 matters in Danish Standards.
> "gaard" is pronounced like "gård". I have no examples off my head on
> Danish names where "aa" actually means two a-s, pronounced as two sounds.
> The rule from the danish orthography book is that if the two
> a's are pronounced as two sounds, they are also sorted as two sounds, as
> two A's. If it is pronounced as one sound, then it is sorted as an "å"
> (irrespectively of whether the sound is an "a" sound).
> > My guess is that there are only two possible solutions:
> > 1. use an exceptions list, or
> > 2. break the grapheme with some marker like ZWNJ to prevent the
> > contraction.
> >
> > Obviously the first creates a maintenance nightmare, and the latter has
> > be somehow tagged to store the data correctly. In any case there's no
> > simple solution.
> >
> The two a sounds occur in combined words, like ekstraarbejde (extra work).
> The recommendation from danish standards is to introduce a soft-hyphen SHY
> between the A's. This also works for iso-8859-1.
> Keld

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:15 EDT