From: Doug Ewell (dewell@adelphia.net)
Date: Sat Nov 29 2003 - 17:17:10 EST
Someone, I forgot who, questioned whether converting Unicode text to NFC
would actually improve its compressibility, and asked if any actual data
was available.
Certainly there is no guarantee that normalization would *always* result
in a smaller file. A compressor that took advantage of normalization
would have to determine whether there would be any benefit.
One extremely simple example would be text that consisted mostly of
Latin-1, but contained U+212B ANGSTROM SIGN and no other characters from
that block. By converting this character to its canonical equivalent
U+00C5:
* UTF-8 would use 2 bytes instead of 3.
* SCSU would use 1 byte instead of 2.
* BOCU-1 would use 1 or 2 bytes instead of always using 2.
A longer and more realistic case can be seen in the sample Korean file
at:
http://www.cs.fit.edu/~ryan/compress/corpora/korean/arirang/arirang.txt
This file is in EUC-KR, but can easily be converted to Unicode using
recode, SC UniPad, or another converter. It consists of 3,317,215
Unicode characters, over 96% Hangul syllables and Basic Latin spaces,
full stops, and CRLFs. When broken down into jamos (i.e. converting
from NFC to NFD), the character count increases to 6,468,728.
The entropy of the syllables file is 6.729, yielding a "Huffman bit
count" of 22.3 million bits. That's the theoretical minimum number of
bits that could be used to encode this file, character by character,
assuming a Huffman or arithmetic coding scheme designed to handle 16- or
32-bit Unicode characters. (Many general-purpose compression algorithms
can do better.) The entropy of the jamos file is 4.925, yielding a
Huffman bit count of 31.8 million bits, almost 43% larger.
When encoded in UTF-8, SCSU, or BOCU-1, the syllables file is smaller
than the jamos file by 55%, 17%, and 32% respectively.
General-purpose algorithms tend to reduce the difference, but PKZip
(using deflate) compresses the syllables file to an output 9% smaller
than that of the jamos file. Using bzip2, the compressed syllables file
is 2% smaller.
So we can at least say that Korean, which can be normalized from NFD to
NFC algorithmically and without the use of long tables of equivalents or
exclusions, can consistently be compressed to a smaller size after such
normalization than before.
Whether a "silent" normalization to NFC can be a legitimate part of
Unicode compression remains in question. I notice the list is still
split as to whether this process "changes" the text (because checksums
will differ) or not (because C10 says processes must consider the text
to be equivalent).
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Sat Nov 29 2003 - 17:57:56 EST