Re: LC_CTYPE locale category and character sets.

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jul 16 1998 - 12:22:22 EDT


Chirstophe Pierret asks:

>
> Here are some questions regarding character properties and cultural
> preferences:
>
> * Does the character properties defined in a LC_CTYPE posix locale
> category
> depends only on the character set of the locale ?

This is one of the issues driving the critique of the proposed
ISO standard 14652, which attempts to expand POSIX locale constructs
to cover 10646/Unicode.

From the point of view of the universal character set (UCS), i.e.
the Unicode Standard, character properties are properties of the
characters. They are not locale-specific, but universal.

>
> * Is it meaningful to consider that a unicode (considered as a character
> set) LC_CTYPE
> locale category doesn't change with the cultural preferences ?

Case-mappings between characters have a few well-known, culturally-specific
preferences that must be accounted for. But case-mappings are *relations*
between pairs (or triplets) of characters, and not character properties
per se. The character properties themselves should be invariant, defined
on the universal character set.

Then against the background of that set of invariant character properties,
engineers can do a better job of adjusting the kinds of behavior in
software which *should* be culturally-specific and vary by locale.

>
> I can't imagine that LATIN CAPITAL LETTER A is not uppercase anymore !

Nor can I. This is one of the reasons why it is meaningless to define
an isupper class in an LC_CTYPE definition.

LC_CTYPE was, in my opinion, basically a kludge to get around the fact
that different (non-universal) character sets contained different
repertoires, differently encoded. The use of LC_CTYPE enabled those
differences to be encapsulated in the equivalent of locale-specific
resource files in such a way that it basically allowed the API level
isupper(), etc., to work in a locale- and character-set-independent
way.

But such considerations are obsolete for Unicode-based implementations.

>
> But are there any known example of a LC_CTYPE character property
> (isalpha, isupper, tolower, isdigit, isxdigit ...)
> which changes or should change from one culture to another ?

None of them should.

>
> I noticed that in the Unicode Character Database 2.1.1, the line
> 00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;German;;;
> doesn't give a uppercase equivalent.
> In the readme file , I read the explanation :
>
> 12 Upper case equivalent mapping. If a character is part of an
> alphabet with case distinctions, and has an upper case
> equivalent,
> then the upper case equivalent is in this field. See the
> explanation
> below on case distinctions. These mappings are always
> one-to-one,
> not one-to-many or many-to-one. This field is informative.
>
> So , how can we handle one-to-many uppercase equivalent ?
>
> Does anyone has a good example of how to handle correctly the german
> LATIN SMALL LETTER SHARP S (00DF)
> 'to uppercase' conversion , which sould give two letters : "SS" ?

Mark Davis pointed at the Unicode Standard for the full answer.

The short answer is that the Unicode Character Database (and you
should be using Version 2.1.2 now) gives all the default one-to-one
case mappings. Some case mappings (e.g., for French and for Turkish)
differ from the defaults. And U+00DF for German has the uppercase "SS",
but "SS" does not generally lowercase to U+00DF (unless you do
context analysis on the data).

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT