Re: Unicode Search Engines

From: John Cowan (cowan@mercury.ccil.org)
Date: Wed Feb 20 2002 - 13:21:39 EST


Marco Cimarosti scripsit:

> But, if there is no precomposed character for "q with tilde", then the
> combining tilde *must* be maintained in all normalization forms.

Correct.

> Why? Isn't that what W3C asked?

No. The W3C CharMod wants receivers to check normalization and
reject unnormalized documents, *not* to normalize input. Silently
normalizing input can conceal the existence of a security-related
spoof that is NFC-equivalent to a genuine document.
It is essentially the same reason that broken HTML or broken UTF-8
should not be silently repaired.

> BTW, are you sure that it is NFKC? My understanding is that it was NFC +
> some extra passages.

It is NFC, with the additional proviso that n11n must be done even
if characters appear as character references (&#xnnnn;) rather than
actual characters.

-- 
John Cowan           http://www.ccil.org/~cowan              cowan@ccil.org
To say that Bilbo's breath was taken away is no description at all.  There
are no words left to express his staggerment, since Men changed the language
that they learned of elves in the days when all the world was wonderful.
        --_The Hobbit_



This archive was generated by hypermail 2.1.2 : Wed Feb 20 2002 - 12:50:04 EST