Re: LC_CTYPE locale category and character sets.

From: Mark Davis (marked@best.com)
Date: Thu Jul 16 1998 - 16:36:43 EDT


Michael,

Remember that in general, casing operations are never reversible. For example,
upper(lower("Mark Davis")) = "MARK DAVIS"
lower(upper("Mark Davis")) = "mark davis".

There are even single words like "vederLa" in Italian, which are neither upper,
lower, or titlecase.

However, suppose you want to do it. Just don't map any character C to D if D
does not have the corresponding mapping back to C in the Unicode Character
Database.

Pseudocode example of uppercasing:

for (i = 0; i < sourcelength; ++i) {
 char c = source[i];
 String d = upper(c);
 if (d != c && lower(d) == c) destination.append(d);
 else destination.append(c);
}

Of course, this means that <omicron><final sigma> uppercases to <OMICRON><final
sigma>, which is wrong; but you get what you ask for.

Mark

Michael Everson scribbled:

> Ar 09:13 -0700 1998-07-16, scríobh Kenneth Whistler:
>
> >Case-mappings between characters have a few well-known, culturally-specific
> >preferences that must be accounted for. But case-mappings are *relations*
> >between pairs (or triplets) of characters, and not character properties
> >per se.
>
> >> Does anyone has a good example of how to handle correctly the german
> >> LATIN SMALL LETTER SHARP S (00DF)
> >> 'to uppercase' conversion , which sould give two letters : "SS" ?
> >
> >Mark Davis pointed at the Unicode Standard for the full answer.
> >
> >The short answer is that the Unicode Character Database (and you
> >should be using Version 2.1.2 now) gives all the default one-to-one
> >case mappings. Some case mappings (e.g., for French and for Turkish)
> >differ from the defaults.
>
> French?
>
> >And U+00DF for German has the uppercase "SS",
> >but "SS" does not generally lowercase to U+00DF (unless you do
> >context analysis on the data).
>
> Which is especially unreliable now that the German High Court has approved
> the spelling reform.
>
> How do you do reversible conversions from lowercase to uppercase and back,
> though? Or is that "outside the scope" of coding in your view?
>
> --
> Michael Everson, Everson Gunn Teoranta ** http://www.indigo.ie/egt
> 15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
> Guthán: +353 1 478-2597 ** Facsa: +353 1 478-2597 (by arrangement)
> 27 Páirc an Fhéithlinn; Baile an Bhóthair; Co. Átha Cliath; Éire

--
business: medavis2@us.ibm.com, mark@unicode.org
personal: mark@macchiato.com, http://www.macchiato.com
--



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT