[Unicode]  Frequently Asked Questions Home | Site Map | Search

Compression

Q: I need to compress Unicode data. Is there anything special to consider?

A: Unicode has defined a Standard Compression Scheme for Unicode (SCSU). While its use is not required, it has certain key characteristics that you should know about because they make it attractive for certain tasks. [AF]

Q: Why not use UTF-8 as compressed format?

A: UTF-8 represents only the ASCII characters in less space than needed in UTF-16, for some characters (other European and Middle Eastern) it requires the same space as UTF-16; and for all other characters (South, Southeast, and East Asian) it requires more space. (Both UTF-8 and UTF-16 always require less space than UTF-32.) [AF]

Q: What's wrong with using standard compression algorithms such as Huffman coding or patent-free variants of LZW?

A: SCSU bridges the gap between an 8-bit based LZW and a 16-bit encoded Unicode text, by removing the extra redundancy that is part of the encoding (sequences of every other byte being the same) and not a redundancy in the content. The output of SCSU should be sent to LZW for block compression where that is desired.

To get the same effect with one of the popular general purpose algorithms, like Huffman or any of the variants of Lempel-Ziv compression, it would have to be retargeted to 16-bit, losing effectiveness due to the larger alphabet size. It's relatively easy to work out the math for the Huffman case to show how many extra bits the compressed text would need just because the alphabet was larger. Similar effects exist for LZW. For a detailed discussion of general text compression issues see the book Text Compression by Bell, Cleary and Witten (Prentice Hall 1990).

Q: What are the design points for SCSU?

A: One of the key design points of SCSU was that it should work with small strings. Starting a new LZW for each string is not efficient; it is probably wasteful. SCSU usually does not need more than one or two bytes overhead, and often 0 bytes to start up.

Another design point of SCSU is that it is editable (you can replace a piece in the middle, w/o having to change the stuff at the beginning or the end.)

Finally, it was not so much the smallest strings the SCSU designers wanted, but to get pretty much all types of Unicode encoded legacy data to be as compact as in the equivalent legacy encoding.

Whether or not these capabilities are important to your overall design, is a different matter, but as long as they are, SCSU is superior to generic algorithms. [AF]

Q: What about compressing longer texts?

A: The best way to compress large Unicode-encoded text is via a Burrows-Wheeler type compression, which gives near identical results no matter which encoding form was used. Even using SCSU first does not significantly improve the overall compression, underscoring the encoding form insensitivity of the algorithm.

For a description of the Burrows-Wheeler compression algorithm (which is, for example, used in bzip2), see Data Compression by David Salomon (Springer 2000).

Other general purpose compression algorithms, such as Huffman, or LZ77 (found, for example, in GNU gzip) are sensitive to the encoding forms used, and can benefit from using SCSU first.

Whether to use bzip2 or SCSU + gzip depends on availability plus consideration of running time vs. final compression achieved. It may be necessary to conduct a comparison with actual data to find the true optimum. [AF]

Q: Are there disadvantages to using SCSU?

A: Unlike some other schemes, strings compressed with SCSU cannot be compared binarily for equality of contents. That is because the encoder has the choice of compression strategies, and different encoders may make different choices for the same string. While you could compare strings for equality if they are compressed by the same encoder, the comparison order in case of strings of different contents will not be the same as the binary comparison order for the original strings (in the general case). [AF]

Q: Are there security concerns with SCSU?

A: Because identical strings can have different compressed representations, filtering of compressed strings for insecure contents can fail.   [AF]