Re: C # character model

From: Mark Davis (markdavis@ispchannel.com)
Date: Wed Jun 28 2000 - 10:54:56 EDT


Almost all international functions (upper-, lower-, titlecasing, case folding, drawing, measuring, collation, transliteration, grapheme-, word-, linebreaks, etc.) should take *strings* in the API, NOT single code-points. Single code-point APIs almost always malfunction once you get outside of simple languages, because you need more context to get the right answer, or because you might need to generate a sequence of characters to return the right answer.

Take collation, for example. Any Unicode-compliant collation (UTR #10) must be able to handle sequences of more than one code-point, and treat that sequence as a single entity. Given that the code has to handle sequences, it makes little difference whether the string internally is a sequence of UTF-16 code units, or is a sequence of code-points ( = UTF-32 code units). This is because one of the beauties of UTF-16 and UTF-8 is that there is no overlap. Unlike SJIS,if I express a code-point as a sequence of code units, and search for it within a string of those code units, I will never get a mismatch: I will find a match iff there is a matching code-point at that position. In SJIS, if I search for an "a", I might find a false match as the second byte of a two-byte character -- this never happens in UTF-16 or UTF-8.

If you ever tried to collate by handling single code-points at a time, you would get the wrong answer. The same will happen if you draw or measure text, single code-point at a time. Because scripts like Arabic are contextual, the width of x plus the width of y is not equal to the width of xy. Once you get beyond basic typography, the same is true for English as well; because of kerning and ligatures the width of "fi" in TrueType may be different than the width of "f" plus the width of "i".

Looking at the question at hand: casing operations must return strings, not single code-points. (See http://www.unicode.org/unicode/reports/tr21/charts/). Moreover, the titlecasing operation requires strings as input, not single code-points at a time.

Remember also that whenever you store a single code-point in a struct or class instead of a string, you are excluding the use of graphemes -- a single code-point may not be sufficient to express what is required: you may need to store a sequence, such as "ch" for Slovak.

In other words, almost all APIs and struct/class fields should *not* take either a char16 or a char32, they should take a string. And if they take a string, it doesn't matter what the internal representation of the string is. The main exception we've found are very low-level operations such as getting character properties (e.g. General Category or Canonical Class in the UCD). For those it is handy to have interfaces that convert quickly to and from UTF-16 and UTF-32, and that allow you to iterate through strings returning UTF-32 values (even though the internal format is UTF-16).

For more information, see "Forms of Unicode" on http://www2.software.ibm.com/developer/papers.nsf/unicode-papers-bynewest

Mark

Antoine Leca wrote:

> Torsten Mohrin wrote:
> >
> > Antoine Leca <Antoine.Leca@renault.fr> wrote:
> >
> > [...]
> > >> > APIs use and return single 16-bit values.
> > >
> > >Ah, that may be a problem (what is the ToUpper return value of ß?)
> >
> > I don't know the mentioned API, but it could return 0x00DF or (to
> > indicate it as an error) 0xFFFF. I don't see a problem.
>
> The problem is that the "correct" answer is a two letter string, "SS".
>
> More generally, character manipulation API done on single 16-bit
> values tends to have a number of problems, not very problematic
> when we deal with Latin-based West European languages, but that
> are going gore when considered in a more wide context (example:
> what is the width of character U+064A Arabic yeh? if the context
> is not indicated in some way, the answer is probably wrong...)
>
> Antoine



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:05 EDT