RE: [OT] o-circumflex

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Mon Sep 10 2001 - 09:58:05 EDT


> On Mon, Sep 10, 2001 at 11:09:28AM +0200, Marco Cimarosti wrote:
> > Asmus Freytag wrote:
> > > But if you do this, all compound words starting with "data"
> > > and continuing
> > > with another word starting with "a" will be sorted incorrectly!
> > >
> > > To achieve this effect, you would have to mark which AAs are
> > > A-Rings and which ones are accidental adjacencies. In Danish
> > > one can use the SHY (soft hyphen) [...]
> >
> > Real-life sort orders often ignore these subtleties and are
> often based on a
> > small set of rules which is applied blindly, regardless of
> the origin,
> > meaning, or pronunciation of headwords.
> >
>
> Real-life sorts, like MS Windows sorting or Linux sorting,
> actually adheres
> to these Danish rules, once you have set up your machine for Danish.

If I understand what you mean, perhaps my point was not clear.

I know that "aa" sorts like "å", and that it should go after "z". But there
are also cases when the sequence "aa" is just two a's, adjacent to each
other by pure chance.

One of these cases could be the word "dataarkiv", which I found in a Danish
web page
(http://www.riksarkivet.no/nordiskarknytt/98-nr4/institusjonen.html).

Now: if your Windows or Linux collations states (correctly!) that "aa"
should go after "z", you may have a list ordered like this:

        Order A:
                1. data
                2. Datben, Dr. Keld
                3. Datz, Mr. Marco
                4. dataarkiv
                5. Datåz, Dr. Asmus

But if "dataarkiv" was written using an invisible separator between the two
a's (e.g. a soft hyphen, or a zero width non joiner), the your list would be
like this:

        Order B:
                1. data
                2. dataarkiv
                3. Datben, Dr. Keld
                4. Datz, Mr. Marco
                5. Datåz, Dr. Asmus

Asmus was arguing that List B would be the correct one (and this is
certainly true on, e.g., a dictionary) but, in order to obtain it, the
source text must be properly encoded with invisible separators inserted
where needed.

What I was saying is that the "automatic" Order A is also often used, and I
brought the example of the Dutch phone directories (where "Beijing" is
sorted as if it was "Beying"), and of the Italian encyclopedia (where
"Jefferson" is sorted as if it was "Iefferson").

Michael (michka) Kaplan wrote:
> And this is the *true* answer to the whole mess of attempting
> *multilingual* sorts -- once the user chooses the sort they
> WANT, the system might handle other language strings in a
> way that might be obscure to those who know the other
> language but the person who expected Danish or whatever
> will see what they want.

And this is precisely what I was trying to say, although I was not
necessarily talking about multilingual sort ("dataarkiv" seems a purely
Danish word, although derived from Latin roots).

For some users and some usages, the "incorrect" Order B may be much more
useful than the "correct" Order A. If the rules says that "ij" goes between
"x" and "z", a Dutchman should find the "Beijing Restaurant" between "bex-"
and "bez-".

If someone wants Order A (as may be the case for the author of a
dictionary), then they should apply Asmus' suggestion in order to drive the
collation algorithm.

_ Marco



This archive was generated by hypermail 2.1.2 : Mon Sep 10 2001 - 11:34:37 EDT