Re: Most commonly used characters not in BMP

From: Leonardo Boiko (leoboiko@gmail.com)
Date: Mon Jun 14 2010 - 11:27:54 CDT

  • Next message: Asmus Freytag: "Re: Writing a proposal for an unusual script: SignWriting"

    On Mon, Jun 14, 2010 at 13:10, John H. Jenkins <jenkins@apple.com> wrote:
    > I imagine that the best data would come from Google.

    As far as I know, Google discards punctuation and other miscellaneous
    characters during tokenization, so it would only work for the subset
    of Unicode they are willing to index (I think? I’m rusty on the
    details). I’d like just a simple, unfiltered, raw usage count per
    codepoint (perhaps with separate counters per-language and country,
    with the usual caveats of how hard it is to auto-detect those).



    This archive was generated by hypermail 2.1.5 : Mon Jun 14 2010 - 11:29:27 CDT