Re: Compression through normalization

From: Mark E. Shoulson (mark@kli.org)
Date: Mon Nov 24 2003 - 10:52:03 EST

  • Next message: Peter Kirk: "Re: Compression through normalization"

    On 11/24/03 01:26, Doug Ewell wrote:

    >So the question becomes: Is it legitimate for a Unicode compression
    >engine -- SCSU, BOCU-1, or other -- to convert text such as Hangul into
    >another (canonically equivalent) normalization form to improve its
    >compressibility?
    >
    OK, this *is* a fascinating question. When people hear "lossless
    compression," that means that decompress(compress(T))=T for all T, no
    matter what. What you get out doesn't just look like what you put in,
    it IS what you put in. But C10 permits replacement by canonical
    equivalents. I think this may be a problem. I see problems coming when
    you sign an MD5 hash of a file and sent it, compressed, to your friend,
    who uncompresses it and finds it doesn't hash to this anymore! We might
    require that message-hashing only be done on a particular normalization,
    but that may not be appropriate.

    More sinisterly, it makes for trouble with certain kinds of, say,
    steganography. Hiding data in text isn't as easy as hiding it in
    pictures or sounds, but it can happen. Say I have my S33KR1T M3SS1J
    carefully encoded in my Korean text as every prime-numbered character
    (or whatever), carefully using jamos and syllables to get them all in
    the right places, and then along comes the compressor and screws up my
    message! One could rightly argue that I was misusing the standard in
    the first place, but it still feels like the compressor is doing what it
    shouldn't.

    I think I'd rather we have a standard that allows for some way to
    specify that the process really and truly doesn't do ANYTHING to the
    input, that the input is bit-for-bit the same as the output. C10
    presumably can still say that canonical replacement is kosher for
    processes that purport "not to modify the interpretation of a valid
    coded character representation," but things that claim something like
    "not to alter the bit-level encoding" have to leave each 1 and 0 alone.

    I note that C10 explicitly does leave out things like the problems I was
    noting above, now that I read the text. It specifies the requirements
    for claiming not to modify the INTERPRETATION of the characters. But
    we're not talking about interpretations here, necessarily, and I'd say a
    compressor that messes about with interpretations is an unusual
    compressor, and caveat emptor if you use it. Compressors generally
    don't muck about with interpretations, they compress and uncompress
    characters (well, octets, but even if you consider characters in the
    Unicode sense, we're working with *characters* and not their
    interpretations).

    I think there's room for specifying a bit-for-bit identity level of
    compliance, and most compression routines will conform to it (now,
    having a command-line option to turn on NFC/NFD preprocessing might be
    handy, but it should be optional).

    ~mark



    This archive was generated by hypermail 2.1.5 : Mon Nov 24 2003 - 12:44:22 EST