From: Peter Kirk (peterkirk@qaya.org)
Date: Tue Nov 25 2003 - 14:07:48 EST
On 25/11/2003 10:03, John Cowan wrote:
>... And as for
>canonical equivalence, the most efficient way to compare strings for
>it is to normalize both of them in some way and then do a raw
>binary compare. Since it adds efficiency to normalize only once,
>it is worthwhile to define a few normalization forms and urge
>people to produce text in one of them, so that receivers need not
>normalize but need only check for normalization, typically much cheaper.
>
>
>
If receivers are expected to check for normalisation, they are
presumably expected also to normalise if the check fails; if they do
not, they are in conflict with conformance clause C9 - at least with the
"ideally" of the last paragraph and probably with the principle "no
process can assume that another process will make a distinction between
two different, but canonical-equivalent character sequences.". The
efficiency gain is because it is expected that the great majority of
received strings are already normalised. But the system must be able to
cope with a small proportion of non-normalised strings. And so if
combining classes are changed in such a way that the normalised form of
certain rare or anomalous strings is not preserved, the system can cope.
And thus the argument from normalisation stability against changing
combining classes also fails, at least where those changes are made to
rare or obscure characters, or combinations of characters, which are
little used in existing texts. One example, if Doug will forgive me, is
Hebrew points. There may well be others.
So, it seems that Unicode has bound itself by its stability policy to
something which is both unnecessary and in fundamental conflict with its
own conformance clause C10. I urge reconsideration of the policy.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Tue Nov 25 2003 - 14:53:14 EST