Recently John Bennett said:
>> I had assumed that traditional compression algorithms looked for repeats
>> on an 8-bit basis and, hence, would fail to compress Unicode. Is this
>> assumption correct/incorrect?
>
>The compressions do work on an 8-bit basis, but looking at Unicode text as a
>sequence of bytes will still find a lot of pattern. It just doesn't do as
>good a job as it would if it dealt with 16-bit chunks.
Actually, in a single language document (say an ISO Latin-I language),
the upper byte of almost every 16-bit code would be identical. In English
for example, the upper byte would be all zeros and the lower byte would be
equivalent to ASCII. Thus I would expect current text compressors to do
an excellent job on Unicode since the 8-bit pattern in the upper byte
would obviously have a very high frequency; both statistical modeling and
dictionary methods now in use should be able to use this fact. But if I
understand what John said, a Unicode specific algorithm could do better.
I'd be most surprised if it did a whole lot better. Even in a two-language
document, the same argument applies to the upper byte frequency.
Wayne Pollock, pollock@acm.org
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:32 EDT