From: Mark Davis ☕ (mark@macchiato.com)
Date: Tue Aug 17 2010 - 13:44:32 CDT
> One would be to use kIICore, since that theoretically flags the most
important characters.
I would not recommend using kIICore for a measure of importance. I tried
recently comparing those characters to the highest frequency Han characters
on the web; it does not match well at all.
Mark
*— Il meglio è l’inimico del bene —*
On Tue, Aug 17, 2010 at 10:26, John H. Jenkins <jenkins@apple.com> wrote:
>
> On Aug 17, 2010, at 7:58 AM, Wolfgang Schmidle wrote:
>
> > Am 29.06.10 21:36, schrieb John H. Jenkins:
> >
> >> The kZVariant field has bad data in it that we haven't had time to clean
> up. It should, in theory, be symmetrical, and it should, in theory, contain
> only unifiable forms, but as you note, it doesn't. In addition to the use
> of the source separation rule, it should also cover characters which were
> added to the standard in error.
> >>
> >> In any event, I'm afraid that right now it's probably best not to rely
> on it for anything.
> >
> >
> > In the examples I have looked at, the Z-variants are many-to-one
> relations, with all arrows pointing towards the standard character in the
> respective class, e.g. 曆 66C6, 歷 6B77, 回 56DE. However, you say that
> Z-variants are supposed to be symmetrical, and everything else is bad data.
> How, then, does one find the standard character? Do the "kIICore" characters
> play a special role here?
> >
>
> Assuming the z-variant data were sufficiently reliable to be useful, then
> there are a couple of approaches you could use. One would be to use
> kIICore, since that theoretically flags the most important characters.
> Otherwise, if you have some z-variants and one is in the Big Five and the
> others aren't, then the one in the Big Five could be taken as standard for
> traditional Chinese. You could also use GB0 as the standard for simplified
> Chinese, or look at the z-variant on the lowest plane in CNS 11643, or
> something like that.
>
> In the end, however, which one is standard may end up being purely
> arbitrary.
>
> > In general, how can searching in Chinese text be formalised? It seems
> that the Chinese characters cannot easily be divided into equivalence
> classes where one character in the class should find any other character in
> this class. If I search for 歴 6B74, I also want to find the semantic variant
> 歷 6B77 (i.e. the standard character) as well as the simplified character 历
> 5386. However, if I search for 历 5386, I may want to find the semantic
> variant 厲 53B2 (which is based on Fenn, but not Lau, Matthews or
> Meyer-Wempe), but definitely not the simplified character 厉 5389. The
> difference is that there are additional Z-variant connections in the first
> case.
> >
> > Does it make sense to create equivalence classes from the Z-variants?
>
> Not with the data as it stands.
>
> > As an example, the 歷 6B77-class would comprise 歴 6B74, 歷 6B77 and 历 5386
> (not counting the compatibility character 歷 F98C), and the 曆 66C6-class
> would comprise 66A6 and 曆 66C6 (not counting 曆 F98B). In particular, 曆 66C6
> would not find 歷 6B77. However, both characters have the same simplified
> character equivalent. Should these classes be unified for searching? Or
> should it make a difference if I search for a traditional or a simplified
> character, i.e. searching for 历 5386 finds the 曆 66C6-class as well as the 歷
> 6B77-class?
> >
> > Why is 歴 6B74 a semantic variant of 歷 6B77, but 66A6 is not a semantic
> variant of 曆 66C6? Is it simply because no dictionary has declared them to
> be equivalent, even though the respective relationships are obviously the
> same?
>
> Yes. One thing that makes this whole process even more complicated than it
> would otherwise be is that different sources make different judgments as to
> when two characters are variants of each other. At the moment, this is
> restricted to data from some of the smaller dictionaries. If and when we
> can get the variant data from one of the larger dictionaries in place (such
> as the Hanyu Da Zidian or the Kangxi), then an implementer can simply say
> that they are normalizing to HYDZD or KX and ignore the remaining variant
> data.
>
> > And how can two characters such as 歴 6B74 and 歷 6B77 be Z-variants if
> they do not have the same number of strokes? All unification rules seem to
> leave the number of strokes unchanged, as far as the component is not on the
> Annex S list of unifiable characters (such as 吕 5415 and 呂 5442).
> >
>
> This is an example of bad data in the kZVariant field.
>
> > According to UAX#38, the "kSemanticVariant" relation means "two
> characters have identical meanings". Thus, technically it should be
> transitive (as opposed to "kSpecializedSemanticVariant"), but for example 厤
> 53A4 (kDefinition "to calculate; the calendar") is connected via 曆 66C6
> ("calendar, era") with 歷 6B77 ("take place, past, history"), but there is no
> direct connection. Why?
> >
>
> Our goal at this point is to strictly define the the two semantic variant
> fields strictly in terms of source dictionaries. In this particular case,
> Lau and Mathews define U+53A4 and U+66C6 as equivalent, whereas Meyer-Wempe
> defines U+66C6 and U+6B74 as equivalent. You, the implementer, have the
> option of deciding which authority you want to base your implementation on.
>
> And, unfortunately, the dictionary-makers aren't always going to be careful
> to provide transitivity (or even reflexivity) in their variant data.
>
> > And why does Apple's character palette regard 历 5386 as related to 厯
> 53AF, when in fact no arrow leads from or to 厯 53AF? Or rather, where does
> this knowledge (see e.g.
> http://dict.variants.moe.edu.tw/yitia/fra/fra02074.htm) come from?
>
>
> Information on Apple's source is proprietary. This is true in general of
> actual implementations. Unihan is rather unusual in at least trying to
> state the authority based upon which the data is derived.
>
> Defining equivalence or normalization for Han is, in general, a very
> difficult task, not only because of competing authorities but also because
> of competing languages; normalizing text for Japanese would result in
> something different from the same text normalized for Chinese. Given the
> huge number of characters involved, the different competing needs and
> competing authorities, there isn't a good general solution in place. The
> goal in Unihan is to provide solid data for implementers to use, but
> unfortunately we're not quite there yet.
>
> =====
> Hoani H. Tinikini
> John H. Jenkins
> jenkins@apple.com
>
>
>
>
>
This archive was generated by hypermail 2.1.5 : Tue Aug 17 2010 - 13:47:18 CDT