Re: A UTF-8 based News Service

From: DougEwell2@cs.com
Date: Fri Jul 13 2001 - 11:53:53 EDT


In a message dated 2001-07-13 7:00:26 Pacific Daylight Time,
unicode@abyssiniacybergateway.net writes:

> Sounds promising! How well does SCSU gzip?

If gzip works anything like PKZIP, the answer is, very well indeed. This is
because (using the explanation I have heard before) SCSU retargets Unicode
text to an 8-bit model, meaning that for small alphabetic scripts (or
medium-sized syllabic scripts like Ethiopic), most characters are represented
in one byte, so the information appears 8 bits at a time. Many
general-purpose compression algorithms are optimized for that kind of data.

Recently I created a test file of all Unicode characters in code point order
(excluding the surrogates, but including all the other non-characters). I
will admit up front that this is a pathological test case and real-world data
probably won't behave anywhere near the same. Having said that, here are the
file sizes of this data expressed in UTF-8 and SCSU, raw and zipped (using
PKZIP 4.0):

    Raw UTF-8 4,382,592
    Zipped UTF-8 2,264,152 (52% of raw UTF-8)
    Raw SCSU 1,179,688 (27% of raw UTF-8)
    Zipped SCSU 104,316 (9% of raw SCSU, < 5% of zipped UTF-8)

So PKZIP compressed this particular (non-real-world) UTF-8 data by only 48%,
but compressed the equivalent SCSU data by a whopping 91%. That's because
SCSU puts the data in an 8-bit model, which brings out the best in PKZIP.
Gzip may work the same.

Note that real-world data would probably be much more useful in making this
comparison than my sequential-order data, which certainly favors SCSU as it
minimizes windows switches and creates repetitive patterns. Also note that
SCSU compressors are different and the same data encoded with a different
compressor might yield more or fewer than 1,179,688 bytes. I used my own
compressor.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Fri Jul 13 2001 - 13:10:28 EDT