Compressing Unicode [was: A UTF-8 based News Service]

From: Juliusz Chroboczek (jec@dcs.ed.ac.uk)
Date: Sat Jul 14 2001 - 15:05:34 EDT


[sorry if you receive this twice -- wee little problem with my mailer]

D> Recently I created a test file of all Unicode characters in code
D> point order (excluding the surrogates, but including all the other
D> non-characters). I will admit up front that this is a pathological
D> test case and real-world data probably won't behave anywhere near
D> the same.

This test is completely and utterly meaningless. (CP/M 100 % faster
than Unix according to Ziff-Davis.)

Flate compression (used by both the ZIP and gzip formats) is a two
step process. First, repeated strings are eliminated using a variant
of LZ. Then, the resulting data are encoded using, I believe, dynamic
Huffman coding.

In the case of SCSU, your data contains the very same byte sequence
every window length. The LZ compression will reduce every occurence
but the first of this sequence to a single token, which the Huffman
coding will then reduce to a handful of bits.

On the other case, the UTF-8 version of your data doesn't contain a
single repeated byte sequence, which is extremely pathological indeed.
Thus, Flate on this data degenerates to dynamic Huffman.

A trivial differential predictor (applied to codepoints, not to UTF-8
byte values) would yield much better results in this case than SCSU
(roughly 99.9% compression, I believe). Doug, are you trying to sell
us a bridge?

                                        Juliusz



This archive was generated by hypermail 2.1.2 : Sat Jul 14 2001 - 16:09:06 EDT