Twenty-second International Unicode Conference

Unicode Text Compression

Steve Atkin - IBM Corporation

Ryan Stansifer - Florida Tech

Intended Audience:	Software Engineers
Session Level:	Intermediate

Since the introduction of Unicode, the world of text processing has been continually wrestling with the notion of a single unifying character set. In the past it was common to find large ranges of small character sets in text files. Text processing, however, is evolving to a world in which small ranges of a single large character set are used. Using subsets of a large character set naturally invites compression. We observe that compression is just another form of encoding textual data.

The Unicode standard provides several algorithms, techniques, and strategies for assigning, transmitting, and compressing (e.g., Standard Compression Scheme for Unicode) Unicode characters. These techniques allow Unicode data to be represented in a concise format under several contexts. We study the compression and representation of Unicode data in a unifying framework.

In this paper we examine several techniques and strategies for compressing Unicode data: gzip, bzip, pkzip, and Huffman. We find that these algorithms do not perform equally on the various Unicode transmission forms (e.g., UTF-16BE, UTF-16LE, and UTF-8). Nevertheless, the intrinsic information content in a Unicode file is the same regardless of the encoding. We show that by using simple transliteration and transcoding techniques we are able to achieve compression results that compare favorably to algorithms designed for specific Unicode encodings. We argue that specific and complex algorithms for compressing Unicode data are unnecessary.

When the world wants to talk, it speaks Unicode

International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

5 July 2002, Webmaster