[Unicode]  Frequently Asked Questions Home | Site Map | Search

Compression

 

Q: I need to compress Unicode data. Is there anything special to consider?

A: Unicode has defined a Standard Compression Scheme for Unicode (SCSU). [AF]

Q: Why not use UTF-8 as compressed format?

A: UTF-8 represents only the ASCII characters in less space than needed in UTF-16, for some characters (other European and Middle Eastern) it requires the same space as UTF-16; and for all other characters (South, Southeast, and East Asian) it requires more space. [AF]

Q: What's wrong with using standard compression algorithms such as Huffman coding or patent-free variants of LZW?

A: SCSU bridges the gap between an 8-bit based LZW and a 16-bit encoded Unicode text, by removing the extra redundancy that is part of the encoding (sequences of every other byte being the same) and not a redundancy in the content. The output of SCSU should be sent to LZW for block compression where that's desired.

To get the same effect with one of the popular general purpose algorithms, like Huffman or any of the variants of Lempel-Ziv compression, it would have to be retargeted to 16-bit, losing effectiveness due to the larger alphabet size. It's relatively easy to work out the math for the Huffman case to show how many extra bits the compressed text would need just because the alphabet was larger. Similar effects exist for LZW. For a detailed discussion of general text compression issues see the book Text Compression by Bell, Cleary and Witten (Prentice Hall 1990).

One of the key design points of SCSU was that it should work with small strings. Starting a new LZW for each string is not efficient, it is probably wasteful. SCSU usually does not need more than one or two bytes overhead, and often 0 bytes to start up.

 Another design point of SCSU is that it is editable (you can replace a piece in the middle, w/o having to change the stuff at the beginning or the end.)

Finally, it was not so much the smallest strings the SCSU designers wanted, but to get pretty much all types of Unicode encoded legacy data to be as compact as in the equivalent legacy encoding.

Whether or not these capabilities are important to your overall design, is a different matter, but as long as they are, SCSU is superior to generic algorithms. [AF]

Q: What about longer texts?

A: The best way to compress large unicode encoded text is via a Burrows-Wheeler type compression, which gives near identical results no matter which encoding form was used. Even using SCSU first does not significantly improve the overall compression, underscoring the encoding form insensitivity of the algorithm.

For a description of the Burrows-Wheeler compression algorithm (which is, for example, used in bzip2), see Data Compression by David Salomon (Springer 2000).

Other general purpose compression algorithms, such as Huffman, or LZ77 (found, for example, in GNU gzip) are sensitive to the encoding forms used, and can benefit from using SCSU first.

Whether to use bzip2 or SCSU + gzip depends on availability plus consideration of running time vs. final compression achieved. It may be necessary to conduct a comparison with actual data to find the true optimum. [AF]