Re: CJK Ideograph Fragments

From: Christoph Burgmer (cburgmer@ira.uka.de)
Date: Sun May 09 2010 - 07:02:18 CDT

Next message: Uriah Eisenstein: "Re: CJK Ideograph Fragments"

Previous message: Mark Davis ☕: "Re: CJK Ideograph Fragments"
In reply to: Mark Davis ☕: "Re: CJK Ideograph Fragments"
Next in thread: Uriah Eisenstein: "Re: CJK Ideograph Fragments"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Am Sonntag, 9. Mai 2010 schrieb Mark Davis ☕:
> FYI, I have a table of radicals at
> https://spreadsheets.google.com/pub?key=0AqRLrRqNEKv-dHlVMzY0RFZ3MTFLZ0RldS
> 1RNXN4Z3c&hl=en&output=html, mapping them to Unified ideographs. Not yet
> complete (the X values are tentative, and I don't know if there are values
> for the ones marked "#VALUE! ").

Currently no response and a few minutes ago it said "Thank you for helping
Google uncover a bug".

> I had also tried taking a look at the data at
> http://cvs.m17n.org/viewcvs/chise/ids/?sortdir=down&pathrev=kawabata#dirlis
> t(IDS-*), which Richard and John said was the best publicly available IDS
> data (although it has a GPS licence, which prevents many people from using
> it). While clearly a lot of work went into in, it is very flawed.
>
> - There are over 400 ill-formed IDS sequences.
> - There are 666 (coincidence?) characters that map to themselves (where
> you'd only expect that of "base" radicals).
> - About 5K characters are missing data.
> -
> - There appears to be free variation between using CJK radicals and
> using the corresponding Unified CJK characters.
> - It uses many NCR components with cryptic IDs, instead of radicals or
> Unified CJK.
> - A cursory look shows a signficant proportion of clear mistakes in the
> data (characters stacked vertically in the wrong order, for example).
> - Many characters cannot be recursively decomposed down to radicals.
>

Thanks for sharing these statistics. It seems CHISE offers the broadest range
of decomposition data and I haven't seen an assessment on quality before.

The reason Uriah, I believe, is asking about the encoding process, is that we
are building a database on both IDS and stroke count (more precisely order of
abstract strokes). This database is currently distributed under LGPL (hence
also useable in commercial projects) and used in the cjklib project
(http://cjklib.org). The data is currently hosted under svn on Google Code but
we are in the process of setting up a MediaWiki instance for everybody to view
and correct (http://characterdb.cjklib.org/). If this undertaking could one
day lead to an encoding of further components that would be the icing on the
cake. Anyhow, if this project generates any interest in the Unicode world I'll
be happy for any input.

-Christoph

Next message: Uriah Eisenstein: "Re: CJK Ideograph Fragments"
Previous message: Mark Davis ☕: "Re: CJK Ideograph Fragments"
In reply to: Mark Davis ☕: "Re: CJK Ideograph Fragments"
Next in thread: Uriah Eisenstein: "Re: CJK Ideograph Fragments"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun May 09 2010 - 07:21:52 CDT