RE: Compression through normalization

From: D. Starner (shalesller@writeme.com)
Date: Wed Nov 26 2003 - 16:10:26 EST

  • Next message: Timothy Partridge: "Re: What is a process?"

    > Use Base64 - it is stable through all normalisation forms.

    The problem with Base64 (and worse yet, PUA characters for bytes), is that
    it's inefficent. Base64 offers 6 bits per 8 (75%) on UTF-8, 6 bits per 16 (37%)
    on UTF-16. You can get 15 bits per 16 (93%) on UTF-16 and 15 bits per 24 (62%)
    on UTF-8 with the following scheme, and the only normalization is Hangul, which
    is at least algorithmic.

    You could remove the normalization and increase compression in the UTF-8
    (but cost it in the SCSU case) by using low characters that don't decompose
    or compose, but then you have to carry long lists of usable characters.
    (The numbers get tricky, so I haven't run them.)
    You could remove the normalization cost by using Plane 2, but any characters
    on Plane 2 would be larger.
    (Assuming your binary data is linearly distributed, the numbers are
    easy, except for SCSU. I think astral windows in SCSU have little
    effect when used on data like this.)

           Base64 / CJK15 / CJ15
    UTF-8 75% / 62% / 59%
    UTF-16 37% / 93% / 78%
    SCSU 75% / 93% / 78%(?)

    CJK15:
    Break the byte stream into 15 bit chunks. Let a be a 15-bit chunk and U
    be the resulting Unicode character. Then
    if a < 1800h then U = a + 3400h
    else if A < 6800h then U = a - 1800h + 4E00h /* a + 3600h */
    else U = a - 6800h + AC00h /* a + 4400h */

    CJ15:
    replace the last else with
    else U = a - 6800h + 20000h /* a + 19800h */

    -- 
    ___________________________________________________________
    Sign-up for Ads Free at Mail.com
    http://promo.mail.com/adsfreejump.htm
    


    This archive was generated by hypermail 2.1.5 : Wed Nov 26 2003 - 17:07:08 EST