Unicode Text Compression
&
Intended Audience: |
Software Engineers |
Session Level: |
Intermediate |
Since the introduction of Unicode, the world of text processing has been
continually wrestling with the notion of a single unifying character set.
In the past it was common to find large ranges of small character sets in
text files. Text processing, however, is evolving to a world in which small
ranges of a single large character set are used. Using subsets of a large
character set naturally invites compression. We observe that compression is
just another form of encoding textual data.
The Unicode standard provides several algorithms, techniques, and
strategies for assigning, transmitting, and compressing (e.g., Standard
Compression Scheme for Unicode) Unicode characters. These techniques allow
Unicode data to be represented in a concise format under several contexts.
We study the compression and representation of Unicode data in a unifying
framework.
In this paper we examine several techniques and strategies for compressing
Unicode data: gzip, bzip, pkzip, and Huffman. We find that these algorithms
do not perform equally on the various Unicode transmission forms (e.g.,
UTF-16BE, UTF-16LE, and UTF-8). Nevertheless, the intrinsic information
content in a Unicode file is the same regardless of the encoding. We show
that by using simple transliteration and transcoding techniques we are able
to achieve compression results that compare favorably to algorithms
designed for specific Unicode encodings. We argue that specific and complex
algorithms for compressing Unicode data are unnecessary.
|