Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Wed Dec 03 2003 - 06:02:57 EST

Next message: Michael Everson: "Re: MS Windows and Unicode 4.0 ?"

Previous message: jon@hackcraft.net: "RE: UTF-16 inside UTF-8"
In reply to: Mark Davis: "Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)"
Next in thread: Doug Ewell: "Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

>----- Original Message -----
>From: "Frank Yung-Fong Tang" <ytang0648@aol.com>
>
>
> > > >> UTF-16 6,634,430 bytes
> > > >> UTF-8 7,637,601 bytes
> > > >> SCSU 6,414,319 bytes
> > > >> BOCU-1 5,897,258 bytes
> > > >> Legacy encoding (*) 5,477,432 bytes
> > > >> (*) KS C 5601, KS X 1001, or EUC-KR)
>
>What is the size of gzip these? Just wonder
>gzip of UTF-16
>gzip of UTF-8
>gzip of SCSU
>gzip of BOCU-1
>gzip of Legacy encoding

Based on the principles that underly the gzip encoding, and on the fact
that the UTF-8 encoding has many three-byte combinations, while UTF-16 /
SCSU / BOCU-1/ Legacy have two byte combinations for the same characters, I
expect that the *relative* size of the gzipped results will (within
ignorable fluctuation) approximately track the relative size of the
un-zipped versions, with perhaps, an extra penalty for utf-8 due to the
24-bit combinations interacting worse with the gzip architecture than the
16-bit combinations. But that's speculation.

From the work of Atkins et. al. as reported by Doug Ewell I would further
expect that BW type compression would give (practically) indistinguishable
results for all five cases, as BW has been shown to be particularly
encoding form insensitive, unlike Huffman or gzip which work best with true
8-bit symbols.

A./

Next message: Michael Everson: "Re: MS Windows and Unicode 4.0 ?"
Previous message: jon@hackcraft.net: "RE: UTF-16 inside UTF-8"
In reply to: Mark Davis: "Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)"
Next in thread: Doug Ewell: "Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Dec 03 2003 - 06:41:55 EST