RE: Characters

From: Doug Ewell (doug@ewellic.org)
Date: Fri Feb 11 2011 - 12:14:33 CST

  • Next message: mpsuzuki@hiroshima-u.ac.jp: "Re: [unicode] RE: Characters"

    <anbu at peoplestring dot com> wrote:

    > No, this is not a joke. Whenever I post something, you are making fun
    > of it. What's the problem? I seriously want to know the characters
    > present in Unicode 6 and each of their frequencies of usage.

    The characters are available from the Unicode Character Database, as
    others have said.

    If you know or have read anything about text compression -- and I assume
    that is what you are trying to implement, based on this and previous
    postings -- you know that frequency of usage of text characters is
    completely, totally dependent on context.

    In English text, there are different letter frequencies compared to
    French or Greek or Tamil or Japanese text. SMS messages probably have
    different frequencies compared to e-mails or scholarly works. Financial
    or statistical reports may have a higher concentration of digits. C#
    code has a high concentration of ( and ) and { and }. The beat goes on.
    There is no one frequency chart, for alphabetic letters or for all of
    Unicode, that is right for all text-compression needs; and a compression
    scheme that assumes one will fail spectacularly for text samples that
    fit a different model.

    I'm not trying to make fun of your posts, but the simple fact is that
    your questions make me doubt whether you have enough background
    knowledge to take on a project like this. I recommend "Data Compression:
    The Complete Reference" by David Salomon for information about
    compression in general, and maybe also Unicode Technical Note #14
    (disclaimer: I wrote it) if you want to compress Unicode text.

    --
    Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
    RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­
    


    This archive was generated by hypermail 2.1.5 : Fri Feb 11 2011 - 12:17:07 CST