RE: [OT] o-circumflex

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Fri Sep 07 2001 - 20:10:42 EDT


Asmus,

You are quite correct that is why Unicode support differing collation
strengths. Some times you only care about the actual letters without
diacritics. But even then letters are locale sensitive. For example the
Danish alphabet starts with an A and ends it with A ring above. A Dane
would look for Alborg near the end of a list of towns. It is like having
the Spanish ch follow cz.

By providing for different types of collation one can meet the user's
expectations.

Then of course you have search, display and sort differences. If I am
looking for Istanbul it is probably OK even for Turkish locales to match it
to the Turkish spelling which uses a dotted capital I.

With languages with multiple diacritics like Vietnamese you have another set
of rules and had better have normalized data.

In Arabic do you include vowels or not?

I remember your discussions of Greek where there are other considerations.

Carl

> -----Original Message-----
> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
> Behalf Of Asmus Freytag
> Sent: Friday, September 07, 2001 11:51 AM
> To: David Gallardo; Ayers, Mike; 'David Starner'; unicode@unicode.org
> Subject: Re: [OT] o-circumflex
>
>
> At 01:06 PM 9/7/01 -0400, David Gallardo wrote:
> >As a practical matter, you need to take the diacritics into account when
> >sorting, even in English where they (may or may not) have linguistic
> >significance, otherwise you'll get nondeterministic behaviour. In other
> >words, résumé and resume should fall together, but always in
> the same order.
>
> Stated absolutely, this is patent, but oft-repeated nonsense. For
> example,
> it does not always make sense for list of names. An old friend of
> mine, Jon
> Proppe, who is an Icelandic art critic, spells his name with an accent
> grave on the first o and an acute accent on the e. In a campus
> directory of
> the US university he attended (assuming it did not strip the accents), it
> would make no sense to have his name show up after all the
> Proppes, or all
> the Jons without an accent (depending on whether its sorted by first or
> last name).
>
> If I sort a list of single words which contains non-unique entries, a
> stable sort would sort the non-unique subsets in the order of their
> appearance in the input. If its not important to distinguish
> between naive
> and naïve (e.g. in a machine generated index that spans multiple
> documents
> with differences in the use of accents) its hard to see what's gained in
> splitting the list in two for this case.
>
> On the other hand, if San Jose and San José are correctly and
> consistently
> distinguished in my input, they should probably sort separately.
>
> The two cases of resume are different yet again, as noted, since
> one could
> be a verb form.
>
> It all depends not on whether a distinction can be made, but
> whether it is
> meaningful in the context of the list being sorted.
>
> A./
>
>
>
>
>



This archive was generated by hypermail 2.1.2 : Fri Sep 07 2001 - 21:06:03 EDT