Re: Compression through normalization

From: Doug Ewell (dewell@adelphia.net)
Date: Fri Dec 05 2003 - 11:43:16 EST

  • Next message: Markus Scherer: "Re: Sort Order"

    Kenneth Whistler <kenw at sybase dot com> wrote:

    > Canonical equivalence is about not modifying the interpretation of the
    > text. That is different from considerations about not changing the
    > text, period.
    >
    > If some process using text is sensitive to *any* change in the text
    > whatsover (CRC-checking or any form of digital signaturing, memory
    > allocation), then, of course, *any* change to the text, including any
    > normalization, will make a difference.
    >
    > If some process using text is sensitive to the *interpretation* of the
    > text, i.e. it is concerned about the content and meaning of the
    > letters involved, then normalization, to forms NFC or NFD, which only
    > involve canonical equivalences, will *not* make a difference.

    All right. I think that is the missing piece I needed.

    How's this:

    Compression techniques may optionally replace certain sequences with
    canonically equivalent sequences to improve efficiency, but *only* if
    the output of the decompressed text is expected to be
    codepoint-for-codepoint equivalent to the original. Whether this is
    true or not depends on the user and the intended use of the text.

    Text compression techniques are generally assumed to be "lossless,"
    meaning that no information -- including meta-information -- is altered
    by compressing and decompressing the text. However, this is not always
    the case for other types of data. In particular, video and audio
    formats often incorporate some form of "lossy" compression where the
    benefit of reduced size outweighs the potential degradation of the
    original image or sample.

    Because Unicode incorporates the notion of canonical equivalence, the
    line between "lossless" and "lossy" is not as clear as with other
    character encoding standards. Conformance clause C10 says (roughly)
    that a process may choose any canonical-equivalent sequence for a run of
    text without altering the interpretation of the text. Compression of
    Unicode text may be assumed either to (a) retain only the
    interpretation, in which case this is acceptable, or (b) retain the
    exact code points, in which case it is not.

    Mark indicated that a compression-decompression cycle should not only
    stick to canonical-equivalent sequences, which is what C10 requires, but
    should convert text only to NFC (if at all). Ken mentioned
    normalization "to forms NFC or NFD," but I'm not sure this was in the
    same context. (Can we find a consensus on this?)

    No substitution of compatibility equivalents or other privately defined
    equivalents is acceptable. A compressor can obviously convert its input
    to whatever representation it likes, but it must be able to recover the
    original input exactly, or "equivalently" as described above.

    > Or to be more subtle about it, it might make a difference, but it is
    > nonconformant to claim that a process which claims it does not make a
    > difference is nonconformant.
    >
    > If you can parse that last sentence, then you are well on the way to
    > understanding the Tao of Unicode.

    I had to read it a few times, but such things are necessary along the
    Path of Enlightenment.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Fri Dec 05 2003 - 12:32:13 EST