From: Hans Aberg (haberg@math.su.se)
Date: Thu Sep 21 2006 - 08:19:28 CDT
On 21 Sep 2006, at 14:26, Doug Ewell wrote:
>> One way to do a character compression is to simply do a frequency
>> analysis, sort the characters according to that, which gives a map
>> code points -> code points. Then apply a variable width character
>> encoding which gives smaller width to smaller non-negative
>> integers, like say UTF-8, to that. Here, the compression method
>> cannot do worse than UTF-8.
>
> You mean, do Huffman encoding, but with bytes as the basic code
> unit instead of bits?
It is similar to what I indicated. It is just a general scheme that
comes to my mind. :-)
One can think of different methods, depending on what properties one
wants the encoded text to have. So UTF-like encodings have the
advantages that falling into bytes, and that multibytes boundaries
can be always resynchronized, but gives less compression than a bit
scheme. One can use a code point to bit-map encoding to get better
compression.
Another method, which enables compressing both characters (code
points) and natural language words (sequences of code points), might
be to make modified UTF-8, where the leading byte admits indicating
two categories of numbers. (Continued below.)
> Don't forget you need to store the frequency table along with the
> compressed data, so the reader can reconstruct the table. That
> could mitigate your compression somewhat.
I am basically ignoring the problem of the table size. In some
methods, the tables size will be crucial, of course.
But one might do a statistical analysis of common natural words of a
large text body, and based on that, give more common words lower
numbers. This results in a fixed translation table, suitable for
general purpose texts. Then one applies the hybrid encoding above. As
the natural word translation table will be the same for different
compressed texts, it will not matter if it is large, as it can be
implemented as a library.
Hans Aberg
This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 08:23:03 CDT