From: Doug Ewell (dewell@adelphia.net)
Date: Sun Sep 24 2006 - 17:46:20 CST
John D. Burger <john at mitre dot org> wrote:
> On the notion of analyzing the words in text, sorting by frequency,
> and assigning shorter code units to higher frequency words for
> compression:
>
> This is typically not worth the effort - high-frequency words perforce
> are more likely to occur earlier in the text, and thus are given short
> code words with no such analysis needed. Moreover, not defining what
> a "word" is lets Ziv-Lempel and friends discover subwords and
> multi-word sequences automagically. They essentially do stemming
> without knowing anything about language at all.
This was a special-purpose project that I rolled myself, where
compression happens only once and decompression happens repeatedly, and
where I elected to use a simpler and lighter-weight mechanism than LZ.
> Also remember that compression ratio is not the only figure of merit -
> compression speed is also important.
Point well taken. My impression is that the approach I took, for its
limited purpose, is comparable to LZ in speed, but that's just a guess
since I haven't profiled either one.
-- Doug Ewell Fullerton, California, USA http://users.adelphia.net/~dewell/ RFC 4645 * UTN #14
This archive was generated by hypermail 2.1.5 : Sun Sep 24 2006 - 18:05:49 CST