Re: Fixing Two Unicode Asymmetries in case conversion

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Thu Nov 12 1998 - 14:03:06 EST


The ideas raised by Marco, of adding new *I and *SS upper case forms have
been raised before by others and have been rejected.

The main reason why the proposed fix is worse than the current situation
has to do with the fact that Unicode does not exist in a vacuum. A Unicode
system is likely to obtain lots of data from other systems, or system
components, which do not use Unicode, or did not use Unicode when the data
was created.

Turkish users, and German users, expect today that the upper case forms
(i.e. I and SS) are represented using the code points from ASCII wherever
non-Unicode code sets are used. Once coded this wey, these letters are
indistinguishable from letters that are not the result of uppercasing
(e.g. I in non-Turkish words and SS in words where SS is the uppercase of ss).

While it would be possible to map down to such an existing code set, e.g by
mapping *SS and SS to the same letters SS, and *I and I to the same letter I,
the reverse is not possible. Not even language information would help you
in the case of SS. Result: old data would continue to use codes for S and
I, even
after converting to Unicode.

Now your case mapper has to work overtime: it needs to know the new case
mapping and it still has to know the 'old' case mapping, language dependent
for Turkish, and 1->2 mapping for German.

Worse, you now have multiple code points for things that *look* the same, a
sure thing for people to get confused when entering, or searching for data.
If I see a name on a web site in all caps, I may not be able to tell whether
to search for it via *SS or SS or via I or *I.

Finally, the biggest limitation are keyboards. There is no key combination
which can generate *SS, since shift + ß = ? (the question mark) on a German
keyboard.

There are many, many other places in text processing from invoking the
right spell checker to smart quotes to helping with the font selection,
where knowledte of the language is needed. To add the proposed character
codes does not help with these other issues, but since doing them right
means that you should design a way to have language information available
to your process,
building generic case table with only two language dependent case pair (the
Turkish i's) and a single, not language dependent, instance of ß to SS
expansion, should be OK.

A./

 



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:42 EDT