Re: CJK Ideograph Fragments

From: Christoph Burgmer (cburgmer@ira.uka.de)
Date: Sun May 09 2010 - 07:02:18 CDT

  • Next message: Uriah Eisenstein: "Re: CJK Ideograph Fragments"

    Am Sonntag, 9. Mai 2010 schrieb Mark Davis ☕:
    > FYI, I have a table of radicals at
    > https://spreadsheets.google.com/pub?key=0AqRLrRqNEKv-dHlVMzY0RFZ3MTFLZ0RldS
    > 1RNXN4Z3c&hl=en&output=html, mapping them to Unified ideographs. Not yet
    > complete (the X values are tentative, and I don't know if there are values
    > for the ones marked "#VALUE! ").

    Currently no response and a few minutes ago it said "Thank you for helping
    Google uncover a bug".

    > I had also tried taking a look at the data at
    > http://cvs.m17n.org/viewcvs/chise/ids/?sortdir=down&pathrev=kawabata#dirlis
    > t(IDS-*), which Richard and John said was the best publicly available IDS
    > data (although it has a GPS licence, which prevents many people from using
    > it). While clearly a lot of work went into in, it is very flawed.
    >
    > - There are over 400 ill-formed IDS sequences.
    > - There are 666 (coincidence?) characters that map to themselves (where
    > you'd only expect that of "base" radicals).
    > - About 5K characters are missing data.
    > -
    > - There appears to be free variation between using CJK radicals and
    > using the corresponding Unified CJK characters.
    > - It uses many NCR components with cryptic IDs, instead of radicals or
    > Unified CJK.
    > - A cursory look shows a signficant proportion of clear mistakes in the
    > data (characters stacked vertically in the wrong order, for example).
    > - Many characters cannot be recursively decomposed down to radicals.
    >

    Thanks for sharing these statistics. It seems CHISE offers the broadest range
    of decomposition data and I haven't seen an assessment on quality before.

    The reason Uriah, I believe, is asking about the encoding process, is that we
    are building a database on both IDS and stroke count (more precisely order of
    abstract strokes). This database is currently distributed under LGPL (hence
    also useable in commercial projects) and used in the cjklib project
    (http://cjklib.org). The data is currently hosted under svn on Google Code but
    we are in the process of setting up a MediaWiki instance for everybody to view
    and correct (http://characterdb.cjklib.org/). If this undertaking could one
    day lead to an encoding of further components that would be the icing on the
    cake. Anyhow, if this project generates any interest in the Unicode world I'll
    be happy for any input.

    -Christoph



    This archive was generated by hypermail 2.1.5 : Sun May 09 2010 - 07:21:52 CDT