From: jon@hackcraft.net
Date: Wed Nov 26 2003 - 07:37:24 EST
> In the case of GIF versus JPG, which are usually regarded as "lossless"
> versus "lossy", please note that there /is/ no "orignal", in the sense
> of a stream of bytes. Why not? Because an image is not a stream of
> bytes. Period. What is being compressed here is a rectangular array of
> pixels, and that is what is being restored when the image is "viewed". I
> am not aware of ANY use of the GIF format to compress an arbitrary byte
> stream.
>
> So, by analogy, if the XYZ compression format (I made that up) claims to
> compress a sequence of Unicode glyphs, as opposed to an arbitrary byte
> stream, and can later reconstruct that sequence of glyphs exactly, then
> I argue that it has every right to be called "lossless", in the same
> manner that GIF is called "lossless", because /there is no original byte
> stream to preserve/.
Well there *is* a stream of bytes with GIFs and they *are* reconstructed
perfectly on decompression. Most of the time it only matters that the image
isn't altered by the compression (PNG is perhaps a better analogy, since with
GIF we might be forced to reduce the colour depth of the image to make it work
with the format, with PNG we can have better compression if we drop to 256 or
fewer colours but we don't have to). However it could be a issue if we
performed some operation on the underlying data which treated it as bytes
(signing the image in BMP format springs to mind as a possibility).
While in practice we would generally not have such issues (we would move our
signing operation to after the PNG creation) they could arise (if we have some
concept of signing an image independent of image format, implemented by
converting to a canonical format as needed - in such a case PNG being lossless
in its treatment of the bytestream would make it usable, JPEG being lossy would
not).
With a similar operation on Unicode text data we have a similar problem and a
similar solution. If we have a need to use the underlying bytes (lets say we're
signing again) we can either move the operation on the bytes until after the
compression or, if that is not permitted by some requirement, we are forced to
use a compression scheme that is lossless at the byte level.
If our concept of signing is independent of encoding then we can move between
encodings during the compression process (and sign on a canonical encoding) XML
Signature is an example of this (it treats UTF-8 as a canonical encoding).
If our concept of signing considers canonically equivalent sequences to be
equivalent we can move between normalisation forms in the compression process
and sign and verify on a specified normalisation form (again XML signature
aludes to this possibility, though it doesn't use it, as it could introduce
security issues in some cases - though for applications that truly treat
canonically equivalent sequences as equivalent then this is a viable pre-
processing step to XML signature).
So perhaps we should stop talking about "lossy/lossless" and talk about "what
is lost" in a given operation. The advantage gained (theoretically, at least,
does anyone have data on how significant this is?) is from removing entropy of
a type that the compression algorithm is unlikely to be able to remove itself.
The question is whether this is truly entropy, or if it's actually data. I'd
lean towards considering it entropy and removing it - but I'd like to be warned
in advance that this was going to happen, and have other options available.
This archive was generated by hypermail 2.1.5 : Wed Nov 26 2003 - 08:22:22 EST