From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed May 09 2007 - 13:40:59 CDT
Richard Wordingham wrote:
> Philippe Verdy wrote on Tuesday, May 08, 2007 at 8:57 PM
> > Are we guaranteed to have, with existing normative Unicode definitions
> > and stability rules, for every string S in a locale L, the following
> > equalities starting at some current orpast version of the Unicode
> > standard and in all future versions:
> >
> > toCaseFold(toLowerCase(S, L), L)
> > = toCaseFold(toUpperCase(S, L), L)
> > = toCaseFold(toTitleCase(S, L), L)
> >
> > Are there existing exceptions?
>
> Yes. U+0131 LATIN SMALL LETTER DOTLESS I lowercases and casefolds to
> itself, but uppercases and titlecases to U+0049 LATIN CAPITAL LETTER I,
> which then casefolds in the default casefolding to U+0069 LATIN SMALL
> LETTER
> I.
>
> U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE misbehaves similarly (mutatis
> mutandis) in the default simple mappings.
Hmmm. Although I remembered the effect of lowercase and uppercase/titlecase
mappings on these letters, I did not remember that this applied to the case
folding mapping.
This is really unfortunate, because the effect of casefolding should exactly
erase the effect of these differences, i.e:
- When locale L is Turkish or Azeri, all case mappings should preserve the
difference between dotted and undotted letters i
- When locale L is neutral or other than Turkish or Azeri, the case folding
should map all four letters to the same letter, ignoring the soft dot.
- Case folding does not have to be lowercase or uppercase, it just have to
be consistent and return one string of the equivalence classes of strings
that are mapped to it (i.e. each equivalence class should contain one member
whose identity is not changed by the case folding)
Note: I am not speaking here about the case mappings of individual
characters in the UCD, but about the general algorithm that works on any
Unicode text, even if such text contains "defective" sequences: this is what
nameprep for IDN needs to work on, because it handles strings (domain name
labels) not just individual characters, and only at this level the effect on
canonical equivalent input strings must be guaranteed to make the nameprep
process compliant with Unicode rules.
This archive was generated by hypermail 2.1.5 : Wed May 09 2007 - 13:43:13 CDT