From: Doug Ewell (dewell@adelphia.net)
Date: Thu Sep 21 2006 - 00:01:46 CDT
Hans Aberg <haberg at math dot su dot se> wrote:
> Relative to that stuff, I suggest to compress the character data, as
> represented by the code points, rather any character encoded data.
> Typically, a compression method build a binary encoding based on a
> statistical analysis of a sequence of data units. So if applied to the
> character data, there results a character encoding from such a
> compression. Conversely, any character encoding can be viewed as a
> compression method with certain statistical properties.
Different compression methods work in different ways. Certainly, a
compression method that is specifically designed for Unicode text can
take advantage of the unique properties of Unicode text, as compared to,
say, photographic images.
I've often suspected that a Huffman or arithmetic encoder that encoded
Unicode code points directly would perform better than a byte-based one
working with UTF-8 code units. I haven't done the math to prove it,
though.
> When compressing character encoded data, one first translates it into
> character data, and compresses that. So it does then not matter which
> character encoding originally is used in the input, as the character
> data will be the same: the final compression need only to include the
> additional information about what was the original character encoding
> to restore data.
Actually, it does matter for some compression methods, such as the
well-known LZW. Burrows-Wheeler is fairly unusual in this regard.
-- Doug Ewell Fullerton, California, USA http://users.adelphia.net/~dewell/ RFC 4645 * UTN #14
This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 00:12:58 CDT