Re: UTN #31 and direct compression of code points

From: Doug Ewell (dewell@adelphia.net)
Date: Tue May 08 2007 - 01:20:11 CDT

Next message: Doug Ewell: "Re: UTN #31 and direct compression of code points"

Previous message: Werner LEMBERG: "Re: Uppercase � is coming? (U+1E9E)"
In reply to: Philippe Verdy: "RE: UTN #31 and direct compression of code points"
Next in thread: Philippe Verdy: "RE: UTN #31 and direct compression of code points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy <verdy_p@wanadoo.fr> wrote:

> You seem to forget that such Huffman coding requires not only storing
> the bitstreams representing each compressed code point, but also the
> table that will be needed to decode the bit-stream.

Since I just finished implementing a Huffman encoder and decoder, and
since I wrote in my last message "except for the tree-storage overhead,
which usually makes Huffman inappropriate for such short messages
anyway," and also wrote in UTN #14 about storing the tree, it's probably
safe to assume that I didn't just forget about storing the tree.

In a conventional byte-oriented Huffman implementation, for each of N
encoded symbols (including any "pseudo-EOF" character that is needed to
tell the decoder when to stop), the tree typically requires 10N - 2 bits
(8 bits for each terminal node plus 2 for each non-terminal node, of
which there are N - 1). The total size of the tree obviously depends on
the number of discrete symbols in the message. For a short but highly
redundant message like "she sells seashells by the seashore" (entropy =
3.04), the tree is small and the total encoded message, tree plus
bitstream, is smaller than the plaintext. For a short pangram like "the
quick brown fox jumps over the lazy dog" (entropy = 4.38), the tree
alone is larger than the plaintext.

An implementation that allows the entire Unicode range to be encoded
clearly requires more than 8 bits per code point. I used
self-delimiting numeric values [1] to represent code points up to U+007F
in 8 bits, up to U+3FFF in 16 bits, and the rest in 24 bits. This
doesn't perform so well with short examples in Ugaritic or Syloti Nagri,
but as the message gets longer, the overhead of the tree becomes smaller
as the number of discrete characters grows more slowly and finally hits
a ceiling.

It might be possible to improve on this storage scenario, by using a
SCSU-like sliding window or by encoding differences between symbols, if
the order in which the tree will be read is known. In my experiments,
this improved matters slightly for texts in small alphabets above
U+4000, but made things worse for non-English Latin-based texts.

> On a large alphabet like Unicode, this conversion table will have a
> very significant size, unless you have a predetermined table, based on
> fixed distributions, but then the compression will only be effective
> on texts having the same code point frequencies as in the fixed
> distribution.

Predetermined tables are useful when the texts to be encoded are in the
same language, and ideally of the same nature. For an experimental
system such as mine, meant to show compression performance on any sort
of Unicode text, it would clearly be the wrong approach.

> Huffman coding is effective for example if you know the source
> language and script of most of the texts you have to compress, but if
> there are texts that have different distributions, the Huffman coding
> may be MUCH larger.

Any type of compression becomes more effective when more can be assumed
about the input.

> On the opposite, the SCSU algorithm that was adopted by Unicode with a
> formal definition for general use, does not depend on language and
> script choice (at least with almost all modern languages written only
> with characters that are part of the current code point ranges when
> the SCSU algorithm was finalized).

You don't have to sell me on SCSU; I'm a true believer (as I wrote in my
previous message). But one-size-fits-all compression solutions are hard
to come by. Most commercial compression tools can fall back on a
secondary algorithm, or even straight storage with no compression, if
the primary algorithm is not effective. In the case of SCSU, its
Achilles heel lies with short-alphabet text spread across multiple
128-character half-blocks, such as Vietnamese and Ethiopic and Canadian
Syllabics.

> And it does not depend on first building a conversion table or not
> even some statistics from a significant part of the text to encode, so
> the SCSU algorithm can be applied "in the fly", i.e. without prior
> knowledge about the text and in a single pass, with a very small
> encoding state (of just one code point). There may be a few cases
> where a few other look-ahead code points may be useful to optimize the
> result further, but experiments have shown that the gain is very
> modest, and not always worth the effort (in terms of code complexity
> and encoding throughput measured in encoded megabytes per second on
> the media where the compressed byte-stream will be transported or
> stored).

I assume you're telling other people. I wrote about all this in UTN #14
three years ago.

> SCSU has another advantage: it is fully deterministic

Not true. As you just said yourself, encoders that look ahead farther
can achieve modestly better (i.e. different) results.

> and not more complex if the encoded stream was optimized or not, and
> the worst case does have a very small and strictly limited overhead
> face to a uncompressed stream; this makes SCSU even more interesting
> for permanent storage of large volumes of texts, to increase the
> effective throughput of the decoded stream when retrieving it, and to
> reduce the average access time, due to a higher hit rate in data
> caches (for example in large relational databases).

I agree fully about the ease of decompressing SCSU, and said so in UTN
#14. (Microsoft didn't take my advice about supporting SCSU in Internet
Explorer.)

> But now we look at this LZ-like algorithm; remember that this is just
> a technical note, for information purpose, it is provided mostly as a
> call for comments, for improvement; it just describes some principles
> in a public document with a copyright assigned to the author and to
> Unicode, possibly as a proof of prior art to protect the principle
> against the possible development of an aggressive patent claim, for
> what could become later an approved algorithm

There's obviously plenty of prior art surrounding LZ-type compression,
especially a simplified version such as this one.

> and it is made as a possible response to the possible patent claims on
> BOCU-1 or SCSU.

SCSU does not belong in this discussion. It was derived from a simpler
scheme (RCSU) developed at Reuters and its publication as a freely
available UTR was led by two Reuters employees. It has been around for
10 years now and nobody has attempted to claim any IPR on it or restrict
its use. (BOCU-1, of course, is another matter.)

> In fact, the conclusion of this document should consider comparing the
> (still unnamed) LZ-like algorithm against SCSU and BOCU-1. I consider
> it as a viable alternative, because the decompression algorithm is
> fully deterministic too, and very straightforward to implement, as
> demonstrated in the sample code.

I'd love to see it, and will probably conduct my own experiments with it
as well. I have no great criticism to make against the UTN #31
approach; as I said, I just discovered the UTN this past weekend.

[1]
ftp://ftp.rfc-editor.org/in-notes/internet-drafts/draft-eddy-dtn-sdnv-02.txt

--
Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
http://users.adelphia.net/~dewell/
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages

Next message: Doug Ewell: "Re: UTN #31 and direct compression of code points"
Previous message: Werner LEMBERG: "Re: Uppercase � is coming? (U+1E9E)"
In reply to: Philippe Verdy: "RE: UTN #31 and direct compression of code points"
Next in thread: Philippe Verdy: "RE: UTN #31 and direct compression of code points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue May 08 2007 - 01:21:49 CDT