From: Hans Aberg (haberg@math.su.se)
Date: Wed Sep 20 2006 - 12:53:14 CDT
On 20 Sep 2006, at 04:14, Doug Ewell wrote:
> Hans Aberg <haberg at math dot su dot se> wrote:
>
>> It is probably more efficient to translate the stream into code
>> points and then use a compression technique on that, because then
>> the full character structure is taken into account. Then it does
>> not matter which character encoding is used.
>
> If you have not yet read Unicode Technical Note #14, particularly
> the sections on "general-purpose compression" and "two-layer
> compression," you might wish to do so.
Relative to that stuff, I suggest to compress the character data, as
represented by the code points, rather any character encoded data.
Typically, a compression method build a binary encoding based on a
statistical analysis of a sequence of data units. So if applied to
the character data, there results a character encoding from such a
compression. Conversely, any character encoding can be viewed as a
compression method with certain statistical properties.
When compressing character encoded data, one first translates it into
character data, and compresses that. So it does then not matter which
character encoding originally is used in the input, as the character
data will be the same: the final compression need only to include the
additional information about what was the original character encoding
to restore data.
There is the problem of large translation tables. But that belongs to
the chapter of table compression, or alternatively, one can use a aet
of character encodings that, though not providing the most efficient
compression, may admit compact translation functions. On the other
hand, a translation table of just a hundred thousand characters is
not so big anymore in todays computers.
And one can go further, doing a statistical analysis on typical text
in the different languages, identifying words, and their typical
statistical frequencies. A compression would then identify common
words, suitable for compression, and give them one entry in the
translation table.
Hans Aberg
This archive was generated by hypermail 2.1.5 : Wed Sep 20 2006 - 13:11:54 CDT