From: Doug Ewell (dewell@adelphia.net)
Date: Wed Dec 31 2003 - 12:36:23 EST
I'm pleased to announce the release of my new paper, "A survey of
Unicode compression":
http://users.adelphia.net/~dewell/compression.html
This 21-page paper is a moderately technical discussion of the various
ways in which Unicode text can be compressed for storage and
interchange. Several different approaches are examined and evaluated.
Specific topics include:
* UTF-16, UTF-8, and 8-bit legacy character sets
* the Unicode "compression formats," SCSU and BOCU-1
* general-purpose compression algorithms (RLE, Huffman, LZW)
* using multiple compression techniques together
* using canonical equivalence to improve compression
* a detailed description of a SCSU encoder
Although it assumes a basic understanding of Unicode, certain terms
related to Unicode and information theory are explained. No complicated
mathematical theory is included. The paper is intended for anyone
interested in the details of Unicode compression, not just programmers,
although the sample SCSU encoder will probably be of interest only to
programmers.
It's available in HTML format, directly from the URL given above, or can
be downloaded in either Adobe PDF or Microsoft Word format (zipped or
unzipped).
Enjoy,
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Wed Dec 31 2003 - 14:23:57 EST