From: Doug Ewell (dewell@adelphia.net)
Date: Fri Dec 05 2003 - 11:43:16 EST
Kenneth Whistler <kenw at sybase dot com> wrote:
> Canonical equivalence is about not modifying the interpretation of the
> text. That is different from considerations about not changing the
> text, period.
>
> If some process using text is sensitive to *any* change in the text
> whatsover (CRC-checking or any form of digital signaturing, memory
> allocation), then, of course, *any* change to the text, including any
> normalization, will make a difference.
>
> If some process using text is sensitive to the *interpretation* of the
> text, i.e. it is concerned about the content and meaning of the
> letters involved, then normalization, to forms NFC or NFD, which only
> involve canonical equivalences, will *not* make a difference.
All right. I think that is the missing piece I needed.
How's this:
Compression techniques may optionally replace certain sequences with
canonically equivalent sequences to improve efficiency, but *only* if
the output of the decompressed text is expected to be
codepoint-for-codepoint equivalent to the original. Whether this is
true or not depends on the user and the intended use of the text.
Text compression techniques are generally assumed to be "lossless,"
meaning that no information -- including meta-information -- is altered
by compressing and decompressing the text. However, this is not always
the case for other types of data. In particular, video and audio
formats often incorporate some form of "lossy" compression where the
benefit of reduced size outweighs the potential degradation of the
original image or sample.
Because Unicode incorporates the notion of canonical equivalence, the
line between "lossless" and "lossy" is not as clear as with other
character encoding standards. Conformance clause C10 says (roughly)
that a process may choose any canonical-equivalent sequence for a run of
text without altering the interpretation of the text. Compression of
Unicode text may be assumed either to (a) retain only the
interpretation, in which case this is acceptable, or (b) retain the
exact code points, in which case it is not.
Mark indicated that a compression-decompression cycle should not only
stick to canonical-equivalent sequences, which is what C10 requires, but
should convert text only to NFC (if at all). Ken mentioned
normalization "to forms NFC or NFD," but I'm not sure this was in the
same context. (Can we find a consensus on this?)
No substitution of compatibility equivalents or other privately defined
equivalents is acceptable. A compressor can obviously convert its input
to whatever representation it likes, but it must be able to recover the
original input exactly, or "equivalently" as described above.
> Or to be more subtle about it, it might make a difference, but it is
> nonconformant to claim that a process which claims it does not make a
> difference is nonconformant.
>
> If you can parse that last sentence, then you are well on the way to
> understanding the Tao of Unicode.
I had to read it a few times, but such things are necessary along the
Path of Enlightenment.
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Fri Dec 05 2003 - 12:32:13 EST