From: John D. Burger (john@mitre.org)
Date: Fri Sep 22 2006 - 21:28:06 CDT
On the notion of analyzing the words in text, sorting by frequency,
and assigning shorter code units to higher frequency words for
compression:
This is typically not worth the effort - high-frequency words
perforce are more likely to occur earlier in the text, and thus are
given short code words with no such analysis needed. Moreover, not
defining what a "word" is lets Ziv-Lempel and friends discover
subwords and multi-word sequences automagically. They essentially do
stemming without knowing anything about language at all.
Also remember that compression ratio is not the only figure of merit
- compression speed is also important.
- John Burger
MITRE
This archive was generated by hypermail 2.1.5 : Fri Sep 22 2006 - 21:33:24 CDT