Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Wed Dec 03 2003 - 06:02:57 EST

  • Next message: Michael Everson: "Re: MS Windows and Unicode 4.0 ?"

    >----- Original Message -----
    >From: "Frank Yung-Fong Tang" <ytang0648@aol.com>
    >
    >
    > > > >> UTF-16 6,634,430 bytes
    > > > >> UTF-8 7,637,601 bytes
    > > > >> SCSU 6,414,319 bytes
    > > > >> BOCU-1 5,897,258 bytes
    > > > >> Legacy encoding (*) 5,477,432 bytes
    > > > >> (*) KS C 5601, KS X 1001, or EUC-KR)
    >
    >What is the size of gzip these? Just wonder
    >gzip of UTF-16
    >gzip of UTF-8
    >gzip of SCSU
    >gzip of BOCU-1
    >gzip of Legacy encoding

    Based on the principles that underly the gzip encoding, and on the fact
    that the UTF-8 encoding has many three-byte combinations, while UTF-16 /
    SCSU / BOCU-1/ Legacy have two byte combinations for the same characters, I
    expect that the *relative* size of the gzipped results will (within
    ignorable fluctuation) approximately track the relative size of the
    un-zipped versions, with perhaps, an extra penalty for utf-8 due to the
    24-bit combinations interacting worse with the gzip architecture than the
    16-bit combinations. But that's speculation.

     From the work of Atkins et. al. as reported by Doug Ewell I would further
    expect that BW type compression would give (practically) indistinguishable
    results for all five cases, as BW has been shown to be particularly
    encoding form insensitive, unlike Huffman or gzip which work best with true
    8-bit symbols.

    A./



    This archive was generated by hypermail 2.1.5 : Wed Dec 03 2003 - 06:41:55 EST