Re: Most commonly used characters not in BMP

From: Leonardo Boiko (
Date: Mon Jun 14 2010 - 11:27:54 CDT

    On Mon, Jun 14, 2010 at 13:10, John H. Jenkins <> wrote:
    > I imagine that the best data would come from Google.

    As far as I know, Google discards punctuation and other miscellaneous
    characters during tokenization, so it would only work for the subset
    of Unicode they are willing to index (I think? Iā€™m rusty on the
    details). Iā€™d like just a simple, unfiltered, raw usage count per
    codepoint (perhaps with separate counters per-language and country,
    with the usual caveats of how hard it is to auto-detect those).

