Re: Merging combining classes, was: New contribution N2676

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Oct 25 2003 - 20:00:57 CST


From: "Peter Kirk" <peterkirk@qaya.org>

> I can see that there might be some problems in the changeover phase. But
> these are basically the same problems as are present anyway, and at
> least putting them into a changeover phase means that they go away
> gradually instead of being standardised for ever, or however long
> Unicode is planned to survive for.

I had already thought about it. But this may cause more troubles in the
future for handling languages (like modern Hebrew) in which those combining
classes are not a problem, and where the ordering of combining characters is
a real bonus that would be lost if combining classes are merged, notably for
full text searches where the number of order combinations to search could
explode, as the effective order in occurences could become unpredictable for
searches.

Of course, if the combining class values were really bogous, a much simpler
way would be to deprecate some existing characters, allowing new
applications to use the new replacement characters, and slowly adapt the
existing documents with the replacement characters whose combining classes
would be more language-friendly.

This last solution may seem better but only in the case where a unique
combining class can be assigned to these characters. As one said previously
in this list, there are languages in which such axiome will cause problems,
meaning that, with the current model, those problematic combining characters
would have to be encoded with a null combining class, and linked to the
previous combining sequence using either a character property (for its
combining behavior in grapheme clusters and for rendering) or a specific
joiner control (ZWJ ?) if this property is not universal for the character.

> It isn't a problem for XML etc as in such cases normalisation is
> recommended but not required, thankfully.

In practive, "recommanded" will mean that many processes will perform this
normalization, as part of their internal job, so it would cause
interoperability problems if the result of this normalization is further
retreived by the unaware client that submitted the data to that service
which is supposed to keep the normalization identity of the string.

Also I have doubts about the validity of this change face to the stability
pact signed between Unicode and the W3C for XML.

> As for requirements that lists
> are normalised and sorted, I would consider that a process that makes
> assumptions, without checking, about data received from another process
> under separate control is a process badly implemented and asking for
> trouble.

Here the problem is that we will not always have to manage the case of
separate processes, but also the case of utility libraries: if this library
is upgraded separately, the application using it may start experimenting
problems. e.g. I am thinking about the implied sort order in SQL databases
for table indices: what would happen if the SQL server is stopped just the
time to upgrade a standard library implementing the normalization among many
other services, because another security bug such as a buffer overrun is
solved in another API? When restarting the SQL server with the new library
implementing the new normalization, nothing would happen, apparently, but
the sort order would no more be guaranteed, and stored sorted indices would
start being "corrupted", in a way that would invalidate binary searches
(meaning that some unique keys could become duplicated, or not found,
producing unpredictable results, critical if they are assumed for, say, user
authentication, or file existence).

Of course such upgrade should be documented, but as this would occur in very
intimate levels of a utility library incidentally used by the server. Will
all administrators and programmers be able to find and know all the intimate
details of this change, when Unicode has stated to them that normalized
forms should never change? Will it be possible to scan and rebuild the
corrupted data with a check&repair tool, if the programmers of this system
assumed that the Unicode statement was definitive and allowed performing
such assumptions to build optimized systems?

When I read the stability pact, I can conclude from it that any text valid
and normalized in one version of Unicode will remain normalized in any
version of Unicode (including previous ones) provided that the normalized
strings contain characters that were all defined in the previous version.
This means that there's a upward _and_ backward compatibility of encoded
strings and normalizations on their common defined subset (excluding only
characters that have been defined in later versions but were not assigned in
previous versions).

The only thing that is allowed to change is the absolute value of non-zero
combining classes (but in a limited way, as for now they are limited to a
8-bit value range also specified in the current stability pact with the XML
working group), but not their relative order: merging neighbouring classes
will change their relative order, and has the effect of removing
requirements on the sort order, and thus modifying the result of the
normalization algorithm applied to the same initial source strings.



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST