From: Uriah Eisenstein (uriaheisenstein@gmail.com)
Date: Tue Aug 17 2010 - 13:03:07 CDT
Great :) I'm attaching then the results file which made me raise the
original question. I generated it with a Python script, actually, using the
3rd-party cjklib.
Each line indicates one asymmetric relation: the character with a variant,
the variant type (Z for Z-variant or M for semantic variant), and the
variant which does not refer back to the original character.
The script also checked for asymmetric Simplified/Traditional pairs, but
didn't find any :)
HTH,
Uriah
On Tue, Aug 17, 2010 at 7:05 PM, John H. Jenkins <jenkins@apple.com> wrote:
> Any help we can get in cleaning up the Unihan data is greatly appreciated.
> It would be very, very useful.
>
> On Aug 17, 2010, at 3:10 AM, Uriah Eisenstein wrote:
>
> Hi,
> Continuing this issue - I've played a bit with SQL access to Unihan data,
> and found also a few kDefinition fields which are only one or two characters
> long, e.g. "c" or "lr". I suppose other seemingly erroneous entries could be
> found.
> My question is, would it be useful if I gather and send such data (which
> I'd happily do), or do the Unihan maintainers have enough tools to find it
> and just need the time and resources to act on it?
>
> Regards,
> Uriah Eisenstein
>
> On Wed, Jun 30, 2010 at 11:55 AM, Uriah Eisenstein <
> uriaheisenstein@gmail.com> wrote:
>
>> I see... Thanks for your answer. I suppose it should be easy enough to
>> find some of the inconsistencies, such as asymmetrical variant relations,
>> the real issue would be resolving them case-by-case.
>> A specific case where resolution, too, seems as though it should be easy
>> is when supposed Z-variants have quite a different total stroke count. This
>> can be checked with just the Unihan data, I could do that myself (after
>> overcoming the usual issues programming languages have with characters
>> outside the BMP).
>>
>> Uriah
>>
>>
>> On Tue, Jun 29, 2010 at 9:36 PM, John H. Jenkins <jenkins@apple.com>wrote:
>>
>>> The kZVariant field has bad data in it that we haven't had time to clean
>>> up. It should, in theory, be symmetrical, and it should, in theory, contain
>>> only unifiable forms, but as you note, it doesn't. In addition to the use
>>> of the source separation rule, it should also cover characters which were
>>> added to the standard in error.
>>>
>>> In any event, I'm afraid that right now it's probably best not to rely on
>>> it for anything.
>>>
>>> On Jun 29, 2010, at 8:25 AM, Uriah Eisenstein wrote:
>>>
>>> Hi,
>>> To clarify my question with an example :) The character 亀 (U+4E80) is
>>> listed in Unihan as a Z-variant of 龜 (U+9F9C). However, the opposite is not
>>> true. Similarly, 疍 (U+758D) is listed as a semantic variant of 蛋 (U+86CB),
>>> but not vice versa. From the definitions of these variant types in UAX#38,
>>> one would naturally expect them to be symmetrical, and both characters to
>>> show each other as variants. There are quite a few other such cases,
>>> although it does appear that in most cases the relation is symmetrical.
>>> My reason for asking, BTW, is that I'm thinking of grouping characters
>>> which are Z-variants of each other in some application, so I need to
>>> understand whether Z-variants are expected to have clear "cliques" in which
>>> each character is a Z-variant of all others.
>>> I realize that the semantic variant relation, at least, is based on
>>> external sources and not determined by Unicode; regarding Z-variants I'm not
>>> clear. I'd like to know though whether the relation is expected to be
>>> symmetrical, and the above cases are to be considered errors; or there is
>>> some meaning to a one-directional relation; or something else.
>>> On a side note, some Z-variants I've looked at seem to have very
>>> different abstract shapes, in some cases looking more like
>>> simplified/traditional pairs. As I said I don't know clearly how they are
>>> determined. Are they supposed to be exactly those pairs which would be
>>> unified if it were not for the Source Separation Rule?
>>>
>>> TIA,
>>> Uriah
>>>
>>>
>>> =====
>>> John H. Jenkins
>>> jenkins@apple.com
>>>
>>>
>>>
>>
>
> =====
> Hoani H. Tinikini
>
> John H. Jenkins
> jenkins@apple.com
>
>
>
This archive was generated by hypermail 2.1.5 : Tue Aug 17 2010 - 13:05:59 CDT