In a message dated 2001-07-13 7:00:26 Pacific Daylight Time,
unicode@abyssiniacybergateway.net writes:
> Sounds promising! How well does SCSU gzip?
If gzip works anything like PKZIP, the answer is, very well indeed. This is
because (using the explanation I have heard before) SCSU retargets Unicode
text to an 8-bit model, meaning that for small alphabetic scripts (or
medium-sized syllabic scripts like Ethiopic), most characters are represented
in one byte, so the information appears 8 bits at a time. Many
general-purpose compression algorithms are optimized for that kind of data.
Recently I created a test file of all Unicode characters in code point order
(excluding the surrogates, but including all the other non-characters). I
will admit up front that this is a pathological test case and real-world data
probably won't behave anywhere near the same. Having said that, here are the
file sizes of this data expressed in UTF-8 and SCSU, raw and zipped (using
PKZIP 4.0):
Raw UTF-8 4,382,592
Zipped UTF-8 2,264,152 (52% of raw UTF-8)
Raw SCSU 1,179,688 (27% of raw UTF-8)
Zipped SCSU 104,316 (9% of raw SCSU, < 5% of zipped UTF-8)
So PKZIP compressed this particular (non-real-world) UTF-8 data by only 48%,
but compressed the equivalent SCSU data by a whopping 91%. That's because
SCSU puts the data in an 8-bit model, which brings out the best in PKZIP.
Gzip may work the same.
Note that real-world data would probably be much more useful in making this
comparison than my sequential-order data, which certainly favors SCSU as it
minimizes windows switches and creates repetitive patterns. Also note that
SCSU compressors are different and the same data encoded with a different
compressor might yield more or fewer than 1,179,688 bytes. I used my own
compressor.
-Doug Ewell
Fullerton, California
This archive was generated by hypermail 2.1.2 : Fri Jul 13 2001 - 13:10:28 EDT