From: Doug Ewell (dewell@adelphia.net)
Date: Sun May 06 2007 - 20:28:53 CDT
I just discovered UTN #31, "Fast Compression Algorithm for Unicode 
Text," which describes a simplified Lempel-Ziv-type approach that is 
applied directly to "16 bit Unicode characters," unlike most LZ and 
other compression algorithms which operate on bytes.
Although "16 bit" still isn't quite right -- there's more to Unicode 
than the BMP -- the approach of compressing Unicode code points directly 
was especially interesting to me, because I've wondered for a long time 
about the statement in the Unicode FAQ on this very topic:
"To get the same effect [as SCSU] with one of the popular general 
purpose algorithms, like Huffman or any of the variants of Lempel-Ziv 
compression, it would have to be retargeted to 16-bit, losing 
effectiveness due to the larger alphabet size.  It's relatively easy to 
work out the math for the Huffman case to show how many extra bits the 
compressed text would need just because the alphabet was larger. 
Similar effects exist for LZW."
This puzzles me because in Huffman coding, the symbols (characters) that 
*don't* appear in the message to be encoded don't appear in the Huffman 
tree, and should have no impact on the encoded message.  A message of, 
say, 50 discrete characters, whether drawn from the 128-character ASCII 
range or the million-character Unicode range, should have the same 
entropy.  The representation of the character in the Huffman tree might 
be larger if the "alphabet" is larger, but this cost should be borne 
only *once*, in the tree, not every time the character is encoded.
Apparently there was a presentation at the first Unicode Implementers' 
Workshop, back in the early '90s, where this claim was made, and it has 
been accepted ever since.
I've implemented a Huffman encoder that operates directly on Unicode 
code points, and it does a noticeably better job of compressing the data 
than if it were applied to the UTF-8 or UTF-16 code units.  Even in 
contrived cases, like using combinations of the same 4 UTF-8 code units 
to generate 16 different characters, the direct approach works better 
(except for the tree-storage overhead, which usually makes Huffman 
inappropriate for such short messages anyway).  The reason is that the 
increase in entropy is more than made up by the reduced number of 
encoded symbols.
I'd like to know if anyone has done similar experiments, and how they 
turned out, and if anyone is interested in more details of my 
experiment.
None of this is meant as a criticism of SCSU, in which I am still a big 
believer.
-- Doug Ewell * Fullerton, California, USA * RFC 4645 * UTN #14 http://users.adelphia.net/~dewell/ http://www1.ietf.org/html.charters/ltru-charter.html http://www.alvestrand.no/mailman/listinfo/ietf-languages
This archive was generated by hypermail 2.1.5 : Sun May 06 2007 - 20:31:07 CDT