Re: Data compression

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri May 06 2005 - 16:10:10 CDT

Next message: Rick McGowan: "Version 4.1 of UCA Released"

Previous message: Philippe Verdy: "Re: Announcement of Changes to the Unicode Membership structure"
In reply to: N. Ganesan: "Re: Data compression"
Next in thread: Doug Ewell: "Re: Data compression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "N. Ganesan" <naa.ganesan@gmail.com>
To: "Unicode List" <unicode@unicode.org>
Sent: Friday, May 06, 2005 7:47 PM
Subject: Re: Data compression

> Thanks for all the interesting and useful tech comments.
>
> Phillippe wrote:
>>Tamil compresses very well for example with SCSU (with nearly one encoded
>>byte per codepoint).
>
> I'm a mere structural dynamicist and collect, edit of classical Tamil
> texts.
>
> Can you tell a little more on SCSU.

SCSU is fully documented by Unicode itself, in a Technical Standard. See:

UTS 6 "A Standard Compression Scheme for Unicode"
http://www.unicode.org/reports/tr6/

It could be a valid UTF because it preserves all codepoints in an original
string, without even altering its normalization form (so no code point are
reordered, even if the original string is not in any normalized form), and
also because it still allows encoding invalid code points.

Like UTF-8, SCSU generates a sequence of 8-bit code units, but unlike UTF-8,
most encoded texts will be stored with roughly 1 byte per code point (with a
few additional special control bytes), provided that the text uses a single
script and the script is not too large (so this will be true for all
alphabets, abjads and abugidas); for Far-East Asian texts, or scripts with
large syllabaries, the average will be around 2 bytes per code point
(instead of 3 or sometimes 4 with UTF-8).

But, unlike UTF-8, UTF-16, UTF-32 standard encoding schemes (and also
UTF-EBCDIC and CESU-8, not recommanded but supported and documented also by
Unicode; and the "modified UTF-8" encoding used in Java and documented by
Sun, that encodes surrogates isolately, and accepts encoding any 16-bit code
unit, and encodes NULL with 0xC0,0x80 instead of just 0x00),
SCSU does NOT guarantee a unique encoding for the same represented
codepoints: there are several alternatives, which allow SCSU compressors to
be implemented with simple algorithms, or with more complex algorithms with
better compression level; however the SCSU decompressor is fully predictive
and can be parsed into only one valid sequence of codepoints from a valid
SCSU compressed stream.

This means that you can't check the "equality" of two encoded SCSU streams,
without first decompressing them to streams of code points. (You can safely
check encoded strings for equality with UTF-8, UTF-16, UTF-32, UTF-EBCDIC,
and CESU-8).

Next message: Rick McGowan: "Version 4.1 of UCA Released"
Previous message: Philippe Verdy: "Re: Announcement of Changes to the Unicode Membership structure"
In reply to: N. Ganesan: "Re: Data compression"
Next in thread: Doug Ewell: "Re: Data compression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri May 06 2005 - 16:11:12 CDT