From: Kenneth Whistler (kenw@sybase.com)
Date: Mon May 05 2008 - 18:12:57 CDT
Richard Wordingham wrote, regarding:
> > In other words, any piece of code is free to normalize.
>
> Although my reply is a bit late for immediate application, I have been
> assured that that is a fallacy for which I fell. An example of a piece of
> code that is not free to normalise is a routine performing compliant default
> upper-casing. Consider the recording of a no longer verifiable reading of a
> lower case alpha with a subscript iota.
[long complex example involving casing of alpha + iota subscript +
diacritic omitted]
> Presumably no Unicode-compliant process may assume that another process will
> perform default upper-casing compliantly!
Huh? Casing changes the interpretation of text, so it differs
significantly from canonical equivalence.
In general:
interpretation(X) = interpretation(NFC(X)) = interpretation(NFD(X))
But in general:
interpretation(X) != interpretation(toLowercase(X))
!= interpretation(toUppercase(X))
Of course there are many choices of X for which one or both
of those expressions may be equal, but in general a casing transformation
can (and often does) change the interpretation of text, in
the narrow sense of "interpretation" defined in the Unicode
conformance clauses.
So given that, it should not be surprising that it follows that:
interpretation(X) != interpretation(toUppercase(NFD(X)))
> Is there some subtlety here? Perhaps in what constitutes a process?
There are myriad types of text processes. Many of them do not
maintain text "interpretation" in the narrow sense -- they
are *intended* to change things.
This differs from (canonical) normalization, which by definition
does not change the "interpretation" of text. For the purposes
of conformance per se, if I hand you X and you hand me back
NFC(X) or NFD(X), then you have handed me back text intended
to have the same "interpretation". It may not be *identical*
text, of course, because the sequence of code points could
be different, and the length of the text may be different,
but its interpretation should be the same.
Once you start applying casing operations, you no longer have
that claim to same interpretation. I may recognize that you
have properly cased a string according to the default casing
rules (in which instance you can validly claim conformance to
those casing rules), or I may, with your agreement, recognize
that you have applied *other* casing rules, including whatever
conventions you want to put in effect about expanding diacritics
across 1 <--> 2 casing transforms, but what I won't see is
you handing me back text with the *same* (Unicode) interpretation under
such transforms. And any neutral third party (other implementation)
should agree with those conclusions, as well, if they have
properly implemented Unicode normalization.
--Ken
This archive was generated by hypermail 2.1.5 : Mon May 05 2008 - 18:17:13 CDT