RE: UCD 3.1, Final Beta - Case folding

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Tue Mar 06 2001 - 12:48:55 EST


Antone,

One difference between upper/lower case shifting and case folding is that case folding is locale-less.

This is the same as the upper case then lower case shift in a locale that has no special locale rules such as English or French.

You can not just remove accents especially in a locale-less function. Sometimes the accent makes it a separate letter. It probably would not create too many mismatches removing the ring above the A in Danish but it would mess up sorting sequences (A with ring above is the last letter in the alphabet). You real problem language would probably be languages like Vietnamese. You have many short words that are distinguished by tone marks or the use of different vowels. These vowels are represented by the same letter with different accent marks.

Yes case shifting destroys the Turkish and Azeri ı/I and i/İ relationship.

The case that I was referring to was the Lithuanian lower case dotted i followed by a COMBINING DOT ABOVE which becomes a simple dotless upper case I when shifted. The two dot lower case i becomes a standard dotless uppercase I. A round trip upper/lower case shift in the "lt" locale will remove the COMBINING DOT ABOVE after the i. This is like changing the German sharp-s to "ss" so that it will match "SS" shifted to lower case.

Carl

 

-----Original Message-----
From: Antoine Leca [mailto:Antoine.Leca@renault.fr]
Sent: Tuesday, March 06, 2001 8:02 AM
To: Unicode List
Cc: Unicode List
Subject: Re: UCD 3.1, Final Beta - Case folding

[utf-8]

Carl W. Brown wrote:
>
> From: Antoine Leca [mailto:Antoine.Leca@renault.fr]
>
> >Carl W. Brown wrote:
> >>
> >> The case folding is locale-less so it seems to me the it is probably
> >> better to remove the COMBINING DOT ABOVE after all 'i' / 'I'
> >> regardless of locale
> >> to make it work for Lithuanian. I doubt that this will case serious
> >> problems with caseless compares for other locales.
>
> >please consider a Turkish text, fully decomposed: there, a dot_above
> >U+0307 following an uppercase I U+0049 should certainly *not* be dropped.
>
> This works for Turkish as well. Case folding folds dotted and dotless i
> into 'i'.

This is where I do not understand.

You are saying that for some Turk, the result of the caseless comparison
will be that ı/I and i/İ will be fully intermixed.

I was understanding they expect that all the ı/I (regardless of the case)
should come before all the i/İ. Did I miss something?

Or viewed from another point, I was not sure that İstambul should match
Istambul in a _Turkish_ caseless search.

OTOH, I am neither a Turkish expert nor a i18n expert, so perhaps caseless
comparisons should ignore all accents and the like (i.e. grouping c and č,
и and й, etc. Perhaps I am overemphasing, but I hope you will get the idea)

Antoine



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT