From: Hans Aberg (haberg@math.su.se)
Date: Sat Sep 23 2006 - 07:28:21 CDT
On 23 Sep 2006, at 04:28, John D. Burger wrote:
> On the notion of analyzing the words in text, sorting by frequency,
> and assigning shorter code units to higher frequency words for
> compression:
>
> This is typically not worth the effort - high-frequency words
> perforce are more likely to occur earlier in the text, ...
This seems to be a description how those on the fly compression
algorithms works, rather than a description of say typical English
texts (see link below). Why would high-frequency English words appear
more frequently in a typical English text?
> ...and thus are given short code words with no such analysis
> needed. Moreover, not defining what a "word" is lets Ziv-Lempel
> and friends discover subwords and multi-word sequences
> automagically. They essentially do stemming without knowing
> anything about language at all.
And <http://en.wikipedia.org/wiki/Ziv-Lempel-Welch> says:
The algorithm is designed to be fast to implement but not
necessarily optimal since it does not perform any analysis on the data.
So they work so because of other reasons than obtaining an efficient
compression.
> Also remember that compression ratio is not the only figure of
> merit - compression speed is also important.
Well, one type of application in mind is very large linguistical
databases - compressing the whole Wikipedia was one example.
So at least in some circumstances, the main interest will be to have
a database that is fairly compact and fast readable/searchable.
And there isn't one compression algorithm that will fit all needs.
Hans Aberg
This archive was generated by hypermail 2.1.5 : Sat Sep 23 2006 - 07:31:39 CDT