Re: Are Unihan variant relations expected to be symmetrical?

From: Uriah Eisenstein (uriaheisenstein@gmail.com)
Date: Sat Sep 04 2010 - 08:58:04 CDT

  • Next message: karl williamson: "What should happen with "\N{LATIN SMALL LIGATURE IJ}" =~ /(i)(j)/i"

    Hi,
    I browsed through the definitions alphabetically and found a few more
    definitions which look odd:
    齐 <U+9F50> 6576656E2C20756E69666F726D2C206F6620657175616C206C656E677468
    I guessed this was the hex representation of the definition in ASCII, and
    reverted it into:
    "ven, uniform, of equal length"
    蘒 <U+8612> 265:143
    萜 <U+841C> C5H8
    怸 <U+6038> cns 2-2A40 is different
    翖 <U+7FD6> ksc extension 3108

    And the following have typos:
    <U+3EC5> (si mplified form of 璯 <U+74AF>) jade decorated cap, used in
    person's name
    <U+34B7> (simple form of U+8208 興 <U+8208>) to prosper, to begin, to
    increase; to rise; to raise, flourishing
    <U+3B4F> (simplfied form of 椲 <U+6932>) a kind of wood ( used as a kind
    of material to make basin and bowl, etc.) (same as 楎 <U+694E>) a peg for
    handing things on, a clothes-horse
    <U+396A> (simplied form of 慺 <U+617A>) diligent; industrious; sedulous,
    to encourage; to make efforts
    <U+39DB> (simplied form of 掔 <U+6394>) thick; firm; substantial, to drag
    along; to pull, to lead

    Also, definitions of KangXi radicals aren't consistent, some of them mention
    the radical number while others don't. This information of course exists
    elsewhere, though.

    I've now added access to dictionary indices, though I don't have the
    dictionaries themselves, so I don't know if the following is meaningful.
    I found that all characters in the range U+F900..U+FA2D (compatibility
    variants) are listed with a kMorohashi index of 00000, didn't see any
    explanation for that. Some compatibility variants are similarly listed for
    kKangXi with a page index of 0.
    U+25531 is listed with kCheungBauerIndex of 137.09, while all other
    kCheungBauerIndex values indicate pages 338..475.

    I hope this information is helpful.
    Regards,
    Uriah Eisenstein

    On Sat, Aug 21, 2010 at 12:31 PM, Uriah Eisenstein <
    uriaheisenstein@gmail.com> wrote:

    > This is getting fun :) I've found some duplicates now in some of the
    > reading fields (I haven't processed them all yet). kJapaneseKun for 橫
    > (U+6A6B) just has all of its readings twice; kCantonese has several
    > duplications though, I can't tell if these should have been different
    > entries and are identical due to typos or are just redundant. The results
    > file is attached.
    > Also, should kHangul and kKorean be related? There is a rather different
    > number of entries for these fields.
    > Uriah
    >
    > P.S. Please inform me of course if there's anywhere else I should send this
    > info.
    >
    >
    > On Fri, Aug 20, 2010 at 9:19 PM, John H. Jenkins <jenkins@apple.com>wrote:
    >
    >> I'm fleshing them out. Even when they are (technically) correct, the
    >> current definitions don't really help anybody know what they're supposed to
    >> mean. Most are either units of measure or chemical elements; if I didn't
    >> happen to recognize some of the latter, I'd be totally at sea myself.
    >>
    >> On Aug 20, 2010, at 12:15 PM, Uriah Eisenstein wrote:
    >>
    >> Interesting indeed. I did suspect that "km" might stand for "kilometre",
    >> but most others look to me like gibberish, and anyway if they are
    >> abbreviations they could be ambiguous. A full definition would probably be
    >> more useful, if one could be found.
    >>
    >> On Fri, Aug 20, 2010 at 8:04 PM, John H. Jenkins <jenkins@apple.com>wrote:
    >>
    >>> Thanks. The interesting thing is that most of these are correct, just
    >>> unobvious abbreviations. We'll see how manage to slip in.
    >>>
    >>> On Aug 20, 2010, at 11:50 AM, Uriah Eisenstein wrote:
    >>>
    >>> Hi again,
    >>> Now I have a skeletal Java GUI application for Unihan SQL access, so I
    >>> can copy-paste the results along with the Chinese characters (doesn't seem
    >>> possible with my Windows console...). Attached is the set of characters with
    >>> kDefinition fields of 1 or 2 letters, they don't seem to make much sense.
    >>> I've also checked the 3-letter definitions but these all seem valid.
    >>> I hope this will be useful, especially as I understand that Unicode 6.0
    >>> is still in the making so maybe a few fixes could be "slipped in".
    >>> Regards,
    >>> Uriah
    >>>
    >>> 2010/8/17 Uriah Eisenstein <uriaheisenstein@gmail.com>
    >>>
    >>>> Great :) I'm attaching then the results file which made me raise the
    >>>> original question. I generated it with a Python script, actually, using the
    >>>> 3rd-party cjklib.
    >>>> Each line indicates one asymmetric relation: the character with a
    >>>> variant, the variant type (Z for Z-variant or M for semantic variant), and
    >>>> the variant which does not refer back to the original character.
    >>>> The script also checked for asymmetric Simplified/Traditional pairs, but
    >>>> didn't find any :)
    >>>> HTH,
    >>>> Uriah
    >>>>
    >>>>
    >>>> On Tue, Aug 17, 2010 at 7:05 PM, John H. Jenkins <jenkins@apple.com>wrote:
    >>>>
    >>>>> Any help we can get in cleaning up the Unihan data is greatly
    >>>>> appreciated. It would be very, very useful.
    >>>>>
    >>>>> On Aug 17, 2010, at 3:10 AM, Uriah Eisenstein wrote:
    >>>>>
    >>>>> Hi,
    >>>>> Continuing this issue - I've played a bit with SQL access to Unihan
    >>>>> data, and found also a few kDefinition fields which are only one or two
    >>>>> characters long, e.g. "c" or "lr". I suppose other seemingly erroneous
    >>>>> entries could be found.
    >>>>> My question is, would it be useful if I gather and send such data
    >>>>> (which I'd happily do), or do the Unihan maintainers have enough tools to
    >>>>> find it and just need the time and resources to act on it?
    >>>>>
    >>>>> Regards,
    >>>>> Uriah Eisenstein
    >>>>>
    >>>>> On Wed, Jun 30, 2010 at 11:55 AM, Uriah Eisenstein <
    >>>>> uriaheisenstein@gmail.com> wrote:
    >>>>>
    >>>>>> I see... Thanks for your answer. I suppose it should be easy enough to
    >>>>>> find some of the inconsistencies, such as asymmetrical variant relations,
    >>>>>> the real issue would be resolving them case-by-case.
    >>>>>> A specific case where resolution, too, seems as though it should be
    >>>>>> easy is when supposed Z-variants have quite a different total stroke count.
    >>>>>> This can be checked with just the Unihan data, I could do that myself (after
    >>>>>> overcoming the usual issues programming languages have with characters
    >>>>>> outside the BMP).
    >>>>>>
    >>>>>> Uriah
    >>>>>>
    >>>>>>
    >>>>>> On Tue, Jun 29, 2010 at 9:36 PM, John H. Jenkins <jenkins@apple.com>wrote:
    >>>>>>
    >>>>>>> The kZVariant field has bad data in it that we haven't had time to
    >>>>>>> clean up. It should, in theory, be symmetrical, and it should, in theory,
    >>>>>>> contain only unifiable forms, but as you note, it doesn't. In addition to
    >>>>>>> the use of the source separation rule, it should also cover characters which
    >>>>>>> were added to the standard in error.
    >>>>>>>
    >>>>>>> In any event, I'm afraid that right now it's probably best not to
    >>>>>>> rely on it for anything.
    >>>>>>>
    >>>>>>> On Jun 29, 2010, at 8:25 AM, Uriah Eisenstein wrote:
    >>>>>>>
    >>>>>>> Hi,
    >>>>>>> To clarify my question with an example :) The character 亀 (U+4E80) is
    >>>>>>> listed in Unihan as a Z-variant of 龜 (U+9F9C). However, the opposite is not
    >>>>>>> true. Similarly, 疍 (U+758D) is listed as a semantic variant of 蛋 (U+86CB),
    >>>>>>> but not vice versa. From the definitions of these variant types in UAX#38,
    >>>>>>> one would naturally expect them to be symmetrical, and both characters to
    >>>>>>> show each other as variants. There are quite a few other such cases,
    >>>>>>> although it does appear that in most cases the relation is symmetrical.
    >>>>>>> My reason for asking, BTW, is that I'm thinking of grouping
    >>>>>>> characters which are Z-variants of each other in some application, so I need
    >>>>>>> to understand whether Z-variants are expected to have clear "cliques" in
    >>>>>>> which each character is a Z-variant of all others.
    >>>>>>> I realize that the semantic variant relation, at least, is based on
    >>>>>>> external sources and not determined by Unicode; regarding Z-variants I'm not
    >>>>>>> clear. I'd like to know though whether the relation is expected to be
    >>>>>>> symmetrical, and the above cases are to be considered errors; or there is
    >>>>>>> some meaning to a one-directional relation; or something else.
    >>>>>>> On a side note, some Z-variants I've looked at seem to have very
    >>>>>>> different abstract shapes, in some cases looking more like
    >>>>>>> simplified/traditional pairs. As I said I don't know clearly how they are
    >>>>>>> determined. Are they supposed to be exactly those pairs which would be
    >>>>>>> unified if it were not for the Source Separation Rule?
    >>>>>>>
    >>>>>>> TIA,
    >>>>>>> Uriah
    >>>>>>>
    >>>>>>>
    >>>>>>> =====
    >>>>>>> John H. Jenkins
    >>>>>>> jenkins@apple.com
    >>>>>>>
    >>>>>>>
    >>>>>>>
    >>>>>>
    >>>>>
    >>>>> =====
    >>>>> Hoani H. Tinikini
    >>>>>
    >>>>> John H. Jenkins
    >>>>> jenkins@apple.com
    >>>>>
    >>>>>
    >>>>>
    >>>>
    >>> <short_definitions.txt>
    >>>
    >>>
    >>> =====
    >>> John H. Jenkins
    >>> jenkins@apple.com
    >>>
    >>>
    >>>
    >>
    >> =====
    >> Hoani H. Tinikini
    >> John H. Jenkins
    >> jenkins@apple.com
    >>
    >>
    >>
    >



    This archive was generated by hypermail 2.1.5 : Sat Sep 04 2010 - 09:04:00 CDT