From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Dec 15 2003 - 10:15:19 EST
I have a minor problem related to the case folding (for searches) of dotless
lowercase letters, and I don't know why there's no case mapping defined for
them, when performing full case folding (I have no problem for simple case
mappings).
We currently have these full mappings for uppercase letters:
0049; C; 0069; # LATIN CAPITAL LETTER I
-> LATIN SMALL LETTER I
0049; T; 0131; # LATIN CAPITAL LETTER I
-> LATIN CAPITAL LETTER I WITH DOT ABOVE
0130; F; 0069 0307; # LATIN CAPITAL LETTER I WITH DOT ABOVE
-> LATIN SMALL LETTER I, COMBINING DOT ABOVE
0130; T; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE
-> LATIN SMALL LETTER I
I would have expected to find these mappings:
0130; F; 0069; # LATIN SMALL LETTER DOTLESS I
-> LATIN SMALL LETTER I
0130; T; 0130; # LATIN SMALL LETTER DOTLESS I
-> LATIN SMALL LETTER DOTLESS I
The rationale being that the locale-neutral mappings would not differentiate
the "standard" small letter (soft-dotted) i, and the "Turkic" small letter
dotless i, for the same reason that they do not differentiate their
uppercase versions; and that the "Turkic" mappings should maintain this
difference in both lowercase and uppercase pairs of letters.
This is quite irritating, because original strings that are distinct with
case folding will not remain distinct with case folding, if they are first
converted to uppercase. Of course the mapping below would be a no-op:
0130; T; 0130; # LATIN SMALL LETTER DOTLESS I
-> LATIN SMALL LETTER DOTLESS I
but it would be needed in Turkic languages to override the locale-neutral
full case mapping:
0130; F; 0069; # LATIN SMALL LETTER DOTLESS I
-> LATIN SMALL LETTER I
In fact there are also occurences where small dotless i are used in
non-Turkic languages, where both versions compare equally, notably when
there is another diacritic above that soft-dotted letter.
With an above diacritic, the letters should coherently compare equal with
case folding in non-Turkic languages, but they still should compare equally
in Turkic languages in that case (the above proposed mapping will not detect
this, meaning that a lowercase letter i with a diacritic above should be
encoded always with the standard (soft-dotted) i.
Such case folding issue does not occur for the lowercase German sharp S
(ess-tsett), which is correctly mapped with this full case folding:
00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
-> LATIN SMALL LETTER S, LATIN SMALL LETTER S
(this is shown as a proof that case foldings can be defined for lowercase
letters, and that conforming applications that use case folding must not
rely on the "Ll" general category to see if case folding can be avoided, and
not even on the absence of a simple lowercase mapping in the main
UnicodeData.txt file).
The same comment applies to the difference between the standard
"soft-dotted" j and the new dotless j...
__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE! http://www.ellaforspam.com
This archive was generated by hypermail 2.1.5 : Mon Dec 15 2003 - 10:56:43 EST