Re: Fixing Two Unicode Asymmetries in case conversion

From: John Cowan (cowan@locke.ccil.org)
Date: Thu Nov 12 1998 - 16:16:18 EST


Marco Mussini wrote:

> please consider these two cases:
>
> (A) Turkish letter "dotless I"

[details snipped]

> This leads to the following problem while implementing a case conversion
> routine:
> you have to keep language into account to correctly process letter "i"
> and "I"; you must have two different behaviours for converting case for
> letter "i"/"I" in Turkish and in other languages.

Yes, you do. That is why, although "case" is a normative property,
case mappings are merely informative. Turkish has different rules,
full stop.

> - introduce a dedicated codepoint for the capital dotless letter "I"
> coming from LATIN SMALL LETTER DOTLESS I, and give it LATIN SMALL LETTER
> DOTLESS I as its lowercase correspondent.
>
> - introduce a dedicated codepoint for the lowercase dotted letter "i"
> coming from LATIN CAPITAL LETTER I WITH DOT ABOVE, and give it LATIN
> CAPITAL LETTER I WITH DOT ABOVE as its uppercase correspondent.

This has been suggested many times, notably by me. In fact, the
lowercase "i" with dot can be represented by U+0069 U+0307, normal "i"
followed by COMBINING DOT ABOVE.

The reason it has not been done is:

        1) the immense transcoding headache of properly
        converting 8859-5 legacy data, which may be Turkish or not;

        2) a strong doubt that users will "get it right" in future either.
 
> (B) The second problem is about the German sharp S issue:

[snipped]

> Both of these disadvantages would be solved if Unicode could introduce a
> new dedicated codepoint for double S and put it into bidirectional
> correspondence with Sharp S.

There are 139 lower-case letters in Unicode 2.1 that have no direct
uppercase equivalent. Should there be introduced new bogus characters
for all of them, so that when you see an "fl" ligature you can upcase
it to "FL" without expanding anything? Of course not.

Note that case conversion is inherently language-sensitive, notably in the
case of IPA, which needs to be left strictly alone even when embedded
in another language which is being case converted. The best you can
get is an approximate fit.
 

-- 
John Cowan	http://www.ccil.org/~cowan		cowan@ccil.org
	You tollerday donsk?  N.  You tolkatiff scowegian?  Nn.
	You spigotty anglease?  Nnn.  You phonio saxo?  Nnnn.
		Clear all so!  'Tis a Jute.... (Finnegans Wake 16.5)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:42 EDT