Re: Compression through normalization

From: Mark E. Shoulson (mark@kli.org)
Date: Mon Nov 24 2003 - 10:52:03 EST

Next message: Peter Kirk: "Re: Compression through normalization"

Previous message: Philippe VERDY: "IE settings for surrogates support"
In reply to: Doug Ewell: "Compression through normalization (was: Re: Ternary search trees)"
Next in thread: Peter Kirk: "Re: Compression through normalization"
Reply: Peter Kirk: "Re: Compression through normalization"
Maybe reply: Doug Ewell: "Re: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 11/24/03 01:26, Doug Ewell wrote:

>So the question becomes: Is it legitimate for a Unicode compression
>engine -- SCSU, BOCU-1, or other -- to convert text such as Hangul into
>another (canonically equivalent) normalization form to improve its
>compressibility?
>
OK, this *is* a fascinating question. When people hear "lossless
compression," that means that decompress(compress(T))=T for all T, no
matter what. What you get out doesn't just look like what you put in,
it IS what you put in. But C10 permits replacement by canonical
equivalents. I think this may be a problem. I see problems coming when
you sign an MD5 hash of a file and sent it, compressed, to your friend,
who uncompresses it and finds it doesn't hash to this anymore! We might
require that message-hashing only be done on a particular normalization,
but that may not be appropriate.

More sinisterly, it makes for trouble with certain kinds of, say,
steganography. Hiding data in text isn't as easy as hiding it in
pictures or sounds, but it can happen. Say I have my S33KR1T M3SS1J
carefully encoded in my Korean text as every prime-numbered character
(or whatever), carefully using jamos and syllables to get them all in
the right places, and then along comes the compressor and screws up my
message! One could rightly argue that I was misusing the standard in
the first place, but it still feels like the compressor is doing what it
shouldn't.

I think I'd rather we have a standard that allows for some way to
specify that the process really and truly doesn't do ANYTHING to the
input, that the input is bit-for-bit the same as the output. C10
presumably can still say that canonical replacement is kosher for
processes that purport "not to modify the interpretation of a valid
coded character representation," but things that claim something like
"not to alter the bit-level encoding" have to leave each 1 and 0 alone.

I note that C10 explicitly does leave out things like the problems I was
noting above, now that I read the text. It specifies the requirements
for claiming not to modify the INTERPRETATION of the characters. But
we're not talking about interpretations here, necessarily, and I'd say a
compressor that messes about with interpretations is an unusual
compressor, and caveat emptor if you use it. Compressors generally
don't muck about with interpretations, they compress and uncompress
characters (well, octets, but even if you consider characters in the
Unicode sense, we're working with *characters* and not their
interpretations).

I think there's room for specifying a bit-for-bit identity level of
compliance, and most compression routines will conform to it (now,
having a command-line option to turn on NFC/NFD preprocessing might be
handy, but it should be optional).

~mark

Next message: Peter Kirk: "Re: Compression through normalization"
Previous message: Philippe VERDY: "IE settings for surrogates support"
In reply to: Doug Ewell: "Compression through normalization (was: Re: Ternary search trees)"
Next in thread: Peter Kirk: "Re: Compression through normalization"
Reply: Peter Kirk: "Re: Compression through normalization"
Maybe reply: Doug Ewell: "Re: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Nov 24 2003 - 12:44:22 EST