From: John D. Burger (john@mitre.org)
Date: Mon Sep 25 2006 - 12:58:16 CST
Hans Aberg wrote:
>> On the notion of analyzing the words in text, sorting by
>> frequency, and assigning shorter code units to higher frequency
>> words for compression:
>>
>> This is typically not worth the effort - high-frequency words
>> perforce are more likely to occur earlier in the text, ...
>
> This seems to be a description how those on the fly compression
> algorithms works, rather than a description of say typical English
> texts (see link below). Why would high-frequency English words
> appear more frequently in a typical English text?
??? I'm assuming this tautological query was mis-typed. If you meant
to ask why high-frequency English words are likely to appear
=earlier= in a typical text, well, for me this is almost tautological
as well, but ...
High-frequency words are so because they occur in many sentences, and
thus they are likely to occur in the first few sentences of a typical
text. These words include prepositions, pronouns, and other "stop
words", and it's rather difficult to produce English text without
using them. The top five most frequent words from a large corpus I
am currently using are:
the
of
and
to
in
I used all five in my first sentence above.
- John D. Burger
MITRE
This archive was generated by hypermail 2.1.5 : Mon Sep 25 2006 - 13:00:54 CST