Re: Most commonly used characters not in BMP

From: Mark Davis ☕ (mark@macchiato.com)
Date: Mon Jun 14 2010 - 20:15:00 CDT

Next message: Tulasi: "Re: Latin Script"

Previous message: Rick McGowan: "Unicode 6.0 beta code charts - updated today"
In reply to: John H. Jenkins: "Re: Most commonly used characters not in BMP"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From a sampling of the web (about .7M docs), the most common supplementary
characters are, curiously, private use. Top is [?] U+FEB85. For Han, the top
few are: 𣿡, 𠀤, 𩇫, 𥑬, 𤥂, 𡛺, 𤎌, 𠜎,... There are also, oddly, some
Gothic and Shavian characters.

However, the data gets pretty noisy; it would take a bigger sample to get
more reliable data.

Mark

— Il meglio è l’inimico del bene —

On Mon, Jun 14, 2010 at 09:10, John H. Jenkins <jenkins@apple.com> wrote:

> Some characters in the SIP are more common in Chinese written in the HK SAR
> than any character in Extension A, either because they are Hong Kong
> toponyms (or the like), or are Cantonese-specific. (My own analysis of text
> on the Chinese Wikipediæ is that the most common are U+23D13, U+282E2,
> U+28B4E, and U+2A568, which occur seven times each.)
>
> I imagine that the best data would come from Google.
>
> And there are some Web sites out there in Deseret and Shavian, as well.
> (If nothing else, both Deseret and Shavian versions of xkcd are available.
> I'm not aware of any Linear B translations.)
>
> On 2010/6/14, at 上午8:48, Frédéric Grosshans wrote:
>
> > Is there any data on the most commonly used characters which are not in
> > BMP ?
> >
> > I have the impression that SMP characters are mainly used scholars
> > (historic scripts and math symbols). However, I have no idea whether the
> > SIP characters are mainly historical, or if they include not-so rare
> > characters needed for name and/or chinese dialects.
> >
> > Frédéric Grosshans
> >
> >
>
>
>
>

Next message: Tulasi: "Re: Latin Script"
Previous message: Rick McGowan: "Unicode 6.0 beta code charts - updated today"
In reply to: John H. Jenkins: "Re: Most commonly used characters not in BMP"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jun 14 2010 - 20:18:18 CDT