Re: Most commonly used characters not in BMP

From: Leonardo Boiko (leoboiko@gmail.com)
Date: Mon Jun 14 2010 - 11:27:54 CDT

Next message: Asmus Freytag: "Re: Writing a proposal for an unusual script: SignWriting"

Previous message: Stephen Slevinski: "Re: Writing a proposal for an unusual script: SignWriting"
In reply to: John H. Jenkins: "Re: Most commonly used characters not in BMP"
Next in thread: Mark Davis ☕: "Re: Most commonly used characters not in BMP"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Mon, Jun 14, 2010 at 13:10, John H. Jenkins <jenkins@apple.com> wrote:
> I imagine that the best data would come from Google.

As far as I know, Google discards punctuation and other miscellaneous
characters during tokenization, so it would only work for the subset
of Unicode they are willing to index (I think? I’m rusty on the
details). I’d like just a simple, unfiltered, raw usage count per
codepoint (perhaps with separate counters per-language and country,
with the usual caveats of how hard it is to auto-detect those).

Next message: Asmus Freytag: "Re: Writing a proposal for an unusual script: SignWriting"
Previous message: Stephen Slevinski: "Re: Writing a proposal for an unusual script: SignWriting"
In reply to: John H. Jenkins: "Re: Most commonly used characters not in BMP"
Next in thread: Mark Davis ☕: "Re: Most commonly used characters not in BMP"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jun 14 2010 - 11:29:27 CDT