On Mon, Sep 10, 2001 at 03:58:05PM +0200, Marco Cimarosti wrote:
> > On Mon, Sep 10, 2001 at 11:09:28AM +0200, Marco Cimarosti wrote:
> > > Asmus Freytag wrote:
> > > > But if you do this, all compound words starting with "data"
> > > > and continuing
> > > > with another word starting with "a" will be sorted incorrectly!
> > > >
> > > > To achieve this effect, you would have to mark which AAs are
> > > > A-Rings and which ones are accidental adjacencies. In Danish
> > > > one can use the SHY (soft hyphen) [...]
> > >
> > > Real-life sort orders often ignore these subtleties and are
> > often based on a
> > > small set of rules which is applied blindly, regardless of
> > the origin,
> > > meaning, or pronunciation of headwords.
> > >
> >
> > Real-life sorts, like MS Windows sorting or Linux sorting,
> > actually adheres
> > to these Danish rules, once you have set up your machine for Danish.
>
> If I understand what you mean, perhaps my point was not clear.
My point was that real-life sorts nowadays are quite sophisticated,
and the major systems have adequate sorting for Danish and other
languages with that kind of complexity.
> I know that "aa" sorts like "å", and that it should go after "z". But there
> are also cases when the sequence "aa" is just two a's, adjacent to each
> other by pure chance.
>
> One of these cases could be the word "dataarkiv", which I found in a Danish
> web page
> (http://www.riksarkivet.no/nordiskarknytt/98-nr4/institusjonen.html).
Yes, and ekstraarbejde - extra work. I know.
> Now: if your Windows or Linux collations states (correctly!) that "aa"
> should go after "z", you may have a list ordered like this:
>
> Order A:
> 1. data
> 2. Datben, Dr. Keld
> 3. Datz, Mr. Marco
> 4. dataarkiv
> 5. Datåz, Dr. Asmus
>
> But if "dataarkiv" was written using an invisible separator between the two
> a's (e.g. a soft hyphen, or a zero width non joiner), the your list would be
> like this:
>
> Order B:
> 1. data
> 2. dataarkiv
> 3. Datben, Dr. Keld
> 4. Datz, Mr. Marco
> 5. Datåz, Dr. Asmus
>
> Asmus was arguing that List B would be the correct one (and this is
> certainly true on, e.g., a dictionary) but, in order to obtain it, the
> source text must be properly encoded with invisible separators inserted
> where needed.
Yes, that is also my advice.
> What I was saying is that the "automatic" Order A is also often used, and I
> brought the example of the Dutch phone directories (where "Beijing" is
> sorted as if it was "Beying"), and of the Italian encyclopedia (where
> "Jefferson" is sorted as if it was "Iefferson").
You have to sort it according to the expectations of the user.
A Dutch book would use Dutch rules, an Italian book would use
the italian order. You cannot mix ordering, such that some words follow
one set of rules, and other words follow other rules. It all needs
to be comprehended by one human, the reader, and there only one ruleset
applies.
>
> Michael (michka) Kaplan wrote:
> > And this is the *true* answer to the whole mess of attempting
> > *multilingual* sorts -- once the user chooses the sort they
> > WANT, the system might handle other language strings in a
> > way that might be obscure to those who know the other
> > language but the person who expected Danish or whatever
> > will see what they want.
>
> And this is precisely what I was trying to say, although I was not
> necessarily talking about multilingual sort ("dataarkiv" seems a purely
> Danish word, although derived from Latin roots).
>
> For some users and some usages, the "incorrect" Order B may be much more
> useful than the "correct" Order A. If the rules says that "ij" goes between
> "x" and "z", a Dutchman should find the "Beijing Restaurant" between "bex-"
> and "bez-".
>
> If someone wants Order A (as may be the case for the author of a
> dictionary), then they should apply Asmus' suggestion in order to drive the
> collation algorithm.
I think we agree, but what you call "simple set of rules" I call "quite complex".
I also think that the Danish rules are quite simple as they can be formulated
in say 4 lines of Danish prose. But compared to ascii sorting they are to some
people unbelievable complex, and I think many Danish believes that you cannot get
programs that adhere, although the major systems do that out of the box.
Your incorrect and correct examples use the very same sorting algoritm, the only
thing is that the data is coded differently.
But maybe you are driving for a yet more complex sorting, one that can sort
according to multiple rules? Beijing should then not be sorted as Beÿing?
As stated above I think - and other sorting experts too - that sorting
with multiple rules is a conceptual misunderstanding.
Kind regards
Keld
This archive was generated by hypermail 2.1.2 : Mon Sep 10 2001 - 11:46:28 EDT