From: Hans Aberg (haberg@math.su.se)
Date: Thu Sep 21 2006 - 06:36:16 CDT
On 21 Sep 2006, at 07:01, Doug Ewell wrote:
> Different compression methods work in different ways. Certainly, a
> compression method that is specifically designed for Unicode text
> can take advantage of the unique properties of Unicode text, as
> compared to, say, photographic images.
I guess that is the simple point - the more structure, one can
recognize, the better a compression method can be done.
> I've often suspected that a Huffman or arithmetic encoder that
> encoded Unicode code points directly would perform better than a
> byte-based one working with UTF-8 code units. I haven't done the
> math to prove it, though.
And specifically, recognizing common words in natural languages is
something that can be done when working with Unicode code points, and
this is something that perhaps is harder to do with a byte-
compression method.
But it also hinges on how advanced the pattern recognition of a byte-
oriented compression method is: A character code point pattern can be
translated into a byte pattern in one character encoding, so it might
be in principle possible for the byte oriented compression method to
recognize it. But it5 then needs to be able to recognize multibyte
patterns, not only single bytes.
Hans Aberg
This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 06:38:05 CDT