Re: Possible problem going forward with normalization

From: Martin J. Duerst (duerst@w3.org)
Date: Sun Dec 26 1999 - 06:02:12 EST


At 16:03 1999/12/21 -0800, John Cowan wrote:
> It occurs to me that when a future version of Unicode is released with
> new combining marks, text that mixes old and new marks on the same
> base character, or that generates incorrectly ordered new marks,
> will produce inconsistent results when passed through normalization.
>
> Consider the sequence LATIN SMALL LETTER A, COMBINING GRACKLE (a
> post-3.0 character of class 232), COMBINING ACUTE ACCENT.
> A 3.0 implementation of normalization will not recognize the
> COMBINING GRACKLE as a mark, and will not swap it with the
> acute mark. A post-3.0 implementation with updated tables
> will do so.
>
> What is the current thinking on this?

Ken has given all the details. They show that the problems
that indeed can appear can be minimized by being careful
when introducing new things post Unicode 3.0. The whole
idea of normalization and the exact details of each
form, in particular normalization form C, where carefully
considered to reduce as much as possible the inpact of
new introductions. Of course, everybody knew that it would
not be possible to reduce this impact to zero.

One more thing that is very important to consider is where
this normalization should be applied. The W3C character
model (http://www.w3.org/TR/charmod) very clearly says that
normalization should be applied as early as possible. This
has a very strong reason: The closer to the 'origin' of a
character, the higher the chance that information about
that character will be around, and that therefore normalization
will be done correctly.

Translated to our examples, what we really should consider
is two editors Editix and Ny-Editix (and not two normalization
programs Normix and Ny-Normix). Editix will allow you to create/
edit text in Unicode 3.0, Ny-Editix in Unicode 4.0. Both of them
may use whatever representation they like internally, but
externally, Editix will use Unicode 3.0 in Normal Form C,
and Ny-Editix will use Unicode 4.0 in Normal Form C.
Editix does not allow you to create characters new in Unicode
4.0, and therefore an Unicode 3.0-based Normal Form C is
all that is needed.

Of course, there is the question of what's the real origin
of a character. Rather than the editor, this may be the
keyboard driver. Where keyboard drivers generate the
relevant characters, they should also make sure they are
appropriately normalized.

So the general idea is not 'everybody normalize every time
they see some data', but 'normalize early, don't let
unnormalized data show up at all'.

Regards, Martin.

#-#-# Martin J. Du"rst, World Wide Web Consortium
#-#-# mailto:duerst@w3.org http://www.w3.org



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:57 EDT