From: Doug Ewell (dewell@adelphia.net)
Date: Sun May 06 2007 - 20:28:53 CDT
I just discovered UTN #31, "Fast Compression Algorithm for Unicode
Text," which describes a simplified Lempel-Ziv-type approach that is
applied directly to "16 bit Unicode characters," unlike most LZ and
other compression algorithms which operate on bytes.
Although "16 bit" still isn't quite right -- there's more to Unicode
than the BMP -- the approach of compressing Unicode code points directly
was especially interesting to me, because I've wondered for a long time
about the statement in the Unicode FAQ on this very topic:
"To get the same effect [as SCSU] with one of the popular general
purpose algorithms, like Huffman or any of the variants of Lempel-Ziv
compression, it would have to be retargeted to 16-bit, losing
effectiveness due to the larger alphabet size. It's relatively easy to
work out the math for the Huffman case to show how many extra bits the
compressed text would need just because the alphabet was larger.
Similar effects exist for LZW."
This puzzles me because in Huffman coding, the symbols (characters) that
*don't* appear in the message to be encoded don't appear in the Huffman
tree, and should have no impact on the encoded message. A message of,
say, 50 discrete characters, whether drawn from the 128-character ASCII
range or the million-character Unicode range, should have the same
entropy. The representation of the character in the Huffman tree might
be larger if the "alphabet" is larger, but this cost should be borne
only *once*, in the tree, not every time the character is encoded.
Apparently there was a presentation at the first Unicode Implementers'
Workshop, back in the early '90s, where this claim was made, and it has
been accepted ever since.
I've implemented a Huffman encoder that operates directly on Unicode
code points, and it does a noticeably better job of compressing the data
than if it were applied to the UTF-8 or UTF-16 code units. Even in
contrived cases, like using combinations of the same 4 UTF-8 code units
to generate 16 different characters, the direct approach works better
(except for the tree-storage overhead, which usually makes Huffman
inappropriate for such short messages anyway). The reason is that the
increase in entropy is more than made up by the reduced number of
encoded symbols.
I'd like to know if anyone has done similar experiments, and how they
turned out, and if anyone is interested in more details of my
experiment.
None of this is meant as a criticism of SCSU, in which I am still a big
believer.
-- Doug Ewell * Fullerton, California, USA * RFC 4645 * UTN #14 http://users.adelphia.net/~dewell/ http://www1.ietf.org/html.charters/ltru-charter.html http://www.alvestrand.no/mailman/listinfo/ietf-languages
This archive was generated by hypermail 2.1.5 : Sun May 06 2007 - 20:31:07 CDT