Re: Data compression

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri May 06 2005 - 16:10:10 CDT

  • Next message: Rick McGowan: "Version 4.1 of UCA Released"

    From: "N. Ganesan" <naa.ganesan@gmail.com>
    To: "Unicode List" <unicode@unicode.org>
    Sent: Friday, May 06, 2005 7:47 PM
    Subject: Re: Data compression

    > Thanks for all the interesting and useful tech comments.
    >
    > Phillippe wrote:
    >>Tamil compresses very well for example with SCSU (with nearly one encoded
    >>byte per codepoint).
    >
    > I'm a mere structural dynamicist and collect, edit of classical Tamil
    > texts.
    >
    > Can you tell a little more on SCSU.

    SCSU is fully documented by Unicode itself, in a Technical Standard. See:

        UTS 6 "A Standard Compression Scheme for Unicode"
                 http://www.unicode.org/reports/tr6/

    It could be a valid UTF because it preserves all codepoints in an original
    string, without even altering its normalization form (so no code point are
    reordered, even if the original string is not in any normalized form), and
    also because it still allows encoding invalid code points.

    Like UTF-8, SCSU generates a sequence of 8-bit code units, but unlike UTF-8,
    most encoded texts will be stored with roughly 1 byte per code point (with a
    few additional special control bytes), provided that the text uses a single
    script and the script is not too large (so this will be true for all
    alphabets, abjads and abugidas); for Far-East Asian texts, or scripts with
    large syllabaries, the average will be around 2 bytes per code point
    (instead of 3 or sometimes 4 with UTF-8).

    But, unlike UTF-8, UTF-16, UTF-32 standard encoding schemes (and also
    UTF-EBCDIC and CESU-8, not recommanded but supported and documented also by
    Unicode; and the "modified UTF-8" encoding used in Java and documented by
    Sun, that encodes surrogates isolately, and accepts encoding any 16-bit code
    unit, and encodes NULL with 0xC0,0x80 instead of just 0x00),
    SCSU does NOT guarantee a unique encoding for the same represented
    codepoints: there are several alternatives, which allow SCSU compressors to
    be implemented with simple algorithms, or with more complex algorithms with
    better compression level; however the SCSU decompressor is fully predictive
    and can be parsed into only one valid sequence of codepoints from a valid
    SCSU compressed stream.

    This means that you can't check the "equality" of two encoded SCSU streams,
    without first decompressing them to streams of code points. (You can safely
    check encoded strings for equality with UTF-8, UTF-16, UTF-32, UTF-EBCDIC,
    and CESU-8).



    This archive was generated by hypermail 2.1.5 : Fri May 06 2005 - 16:11:12 CDT