Re: Unicode & space in programming & l10n

From: Hans Aberg (haberg@math.su.se)
Date: Mon Sep 25 2006 - 14:16:13 CST

  • Next message: John D. Burger: "Re: Unicode & space in programming & l10n"

    On 25 Sep 2006, at 20:58, John D. Burger wrote:

    > Hans Aberg wrote:
    >
    >>> On the notion of analyzing the words in text, sorting by
    >>> frequency, and assigning shorter code units to higher frequency
    >>> words for compression:
    >>>
    >>> This is typically not worth the effort - high-frequency words
    >>> perforce are more likely to occur earlier in the text, ...
    >>
    >> This seems to be a description how those on the fly compression
    >> algorithms works, rather than a description of say typical English
    >> texts (see link below). Why would high-frequency English words
    >> appear more frequently in a typical English text?
    >
    > ??? I'm assuming this tautological query was mis-typed. If you
    > meant to ask why high-frequency English words are likely to appear
    > =earlier= in a typical text, well, for me this is almost
    > tautological as well, but ...
    >
    > High-frequency words are so because they occur in many sentences,
    > and thus they are likely to occur in the first few sentences of a
    > typical text.

    ??? But they appear later in the sentences as well, I would gather.

    > These words include prepositions, pronouns, and other "stop words",
    > and it's rather difficult to produce English text without using
    > them. The top five most frequent words from a large corpus I am
    > currently using are:
    >
    > the
    > of
    > and
    > to
    > in
    >
    > I used all five in my first sentence above.

    And how do you know which are the more frequent ones by merely
    looking at the first few sentences. And if one collects a
    considerable number of them, the most frequent words would not even
    fit into the first few sentences.

       Hans Aberg



    This archive was generated by hypermail 2.1.5 : Mon Sep 25 2006 - 14:20:10 CST