From: Leonardo Boiko (leoboiko@gmail.com)
Date: Mon Jun 14 2010 - 11:27:54 CDT
On Mon, Jun 14, 2010 at 13:10, John H. Jenkins <jenkins@apple.com> wrote:
> I imagine that the best data would come from Google.
As far as I know, Google discards punctuation and other miscellaneous
characters during tokenization, so it would only work for the subset
of Unicode they are willing to index (I think? I’m rusty on the
details). I’d like just a simple, unfiltered, raw usage count per
codepoint (perhaps with separate counters per-language and country,
with the usual caveats of how hard it is to auto-detect those).
This archive was generated by hypermail 2.1.5 : Mon Jun 14 2010 - 11:29:27 CDT