Unicode character cases

From: Marco Mussini (marco.mussini@vim.tlt.alcatel.it)
Date: Fri Nov 20 1998 - 12:46:48 EST


Dear all,

thank you very much for your replies to the problem
we have raised concerning che case of some Unicode
characters.
By seeing the variety of the replies, we have the
impression that we did not make clear enough the
issue.
This mail tries to state the problem in a more understandable
form.

The issue is the capitalization process of text: its
difficulties and complexity.

Let's first set some ground.

A codeset, besides being a mapping for characters, contains
embedded into it some knowledge concerning characters which
can make the processing of them more or less complex.
Such knowledge lays both in the codepoints themselves (i.e.
in having a unique codepoint for "similar" characters or
in having distinct ones) and in the character properties.
Specific computer processes makes use of such knowledge.
E.g. the process of rendering characters with glyphs for
printing or displaying can be complex or not depending
on how the codeset is made.
If, e.g., we had a poor codeset in which roman letters and greek
letters were represented by the same codepoints, the rendering
process would need to know the language of each letter to
choose the proper glyph. Clearly, it would be close to impossible
to make it tell the language by itself, and close to crazy to
tell it the language of each word.
Luckily (but on purpose), in Unicode we have distinct characters,
and thus the rendering process need not bother about languages.

Let's now turn to capitalization.

The question is on whether the knowledge needed to perform
capitalization is (or has to be) embedded into the codeset.
For Unicode 2 the answer is both yes and no at the same time.
It is "yes" because the character properties tell the uppercase
equivalent for he majority of characters, and "no" because the
mapping is not complete.
As a result, there is a need to know the language and to implement
a specific algorithm to perform capitalization.
This seems to us an unfortunate situation which needs to be amended.
The current status of affairs is such that, having not enough
knowledge present in the codeset, the capitalization process
is language-dependent, which means that a piece of text containing
words of different languages is very hard to capitalize properly.
Note that Java JDK, notably one of the very few tools which supports
Unicode, has the same problem and thus does not capitalise properly.

The issue now is the following:

  "do we want to make the capitalization process simple
   and language independent by embedding in Unicode the
   necessary knowledge or not?"

The cost for having it seems to be the following:

  - introduce two distinct characters for the Turkish
    lowercase and uppercase I's, which behave differently
    from the Latin ones for what concerns capitalization.

Note that we are not hinting to the appropriateness in general
of introducing distinct codepoints; we are only considering
this for what concerns capitalization.

The outcome of this change would be that capitalization would
be, at least, locale-independent.
It would need to cater, however, for some exceptions which are
not related to the language but that can keep it still complex.
That is the case of the sharp s, which can be mended by introducing
an uppercase sharp s (whatever the glyph could it be).
We did not check thoroughly Unicode, and thus there could be
some other few cases (perhaps some ligatures) for which there
would be a need to introduce characters for their uppercase
equivalents.

Note that with these changes the capitalization process would
become simple; a string could be capitalized character by
character leading to one which has the same length.
Such a process would not produce all the possible capitalizations
of a string (e.g. a string containing a sharp s could also be
capitalized by uppercasing it with SS), but would produce just
a good one.

Likewise, the upposite process would produce a good, correct
lowercase string (and not all the possible ones).
E.g. a SS in a German word would become two lowercase s's, and
not a lowercase sharp s, which seems to be O.K.

It is up to your judgment to tell if this change is worth its
price. In reasoning about it, please, focus only on the change
of case of characters, and leave out any other consideration
concerning the introduction or separation of codepoints which is
not strictly related to it.

As a last remark, take into account that the more regular a
codeset is, the higher is the chance that you will get good
software that handles text properly. It is not a case that
even Java fails on this. By embedding character case knowledge
in the codest we would relieve programmers from knowing the
orthographic rules of the many languages which Unicode supports,
rules which most programmers are likely not to know.

Thank You all for your comments,

Angelo Borsotti, Marco Mussini



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:43 EDT