Re: [OT] o-circumflex

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Fri Sep 07 2001 - 14:50:44 EDT


At 01:06 PM 9/7/01 -0400, David Gallardo wrote:
>As a practical matter, you need to take the diacritics into account when
>sorting, even in English where they (may or may not) have linguistic
>significance, otherwise you'll get nondeterministic behaviour. In other
>words, résumé and resume should fall together, but always in the same order.

Stated absolutely, this is patent, but oft-repeated nonsense. For example,
it does not always make sense for list of names. An old friend of mine, Jon
Proppe, who is an Icelandic art critic, spells his name with an accent
grave on the first o and an acute accent on the e. In a campus directory of
the US university he attended (assuming it did not strip the accents), it
would make no sense to have his name show up after all the Proppes, or all
the Jons without an accent (depending on whether its sorted by first or
last name).

If I sort a list of single words which contains non-unique entries, a
stable sort would sort the non-unique subsets in the order of their
appearance in the input. If its not important to distinguish between naive
and naïve (e.g. in a machine generated index that spans multiple documents
with differences in the use of accents) its hard to see what's gained in
splitting the list in two for this case.

On the other hand, if San Jose and San José are correctly and consistently
distinguished in my input, they should probably sort separately.

The two cases of resume are different yet again, as noted, since one could
be a verb form.

It all depends not on whether a distinction can be made, but whether it is
meaningful in the context of the list being sorted.

A./



This archive was generated by hypermail 2.1.2 : Fri Sep 07 2001 - 15:39:01 EDT