From: Wolfgang Schmidle (wschmidle@mpiwg-berlin.mpg.de)
Date: Tue Aug 17 2010 - 08:58:35 CDT
Am 29.06.10 21:36, schrieb John H. Jenkins:
> The kZVariant field has bad data in it that we haven't had time to
> clean up. It should, in theory, be symmetrical, and it should, in
> theory, contain only unifiable forms, but as you note, it doesn't. In
> addition to the use of the source separation rule, it should also
> cover characters which were added to the standard in error.
>
> In any event, I'm afraid that right now it's probably best not to rely
> on it for anything.
In the examples I have looked at, the Z-variants are many-to-one
relations, with all arrows pointing towards the standard character in
the respective class, e.g. 曆 66C6, 歷 6B77, 回 56DE. However, you say
that Z-variants are supposed to be symmetrical, and everything else is
bad data. How, then, does one find the standard character? Do the
"kIICore" characters play a special role here?
In general, how can searching in Chinese text be formalised? It seems
that the Chinese characters cannot easily be divided into equivalence
classes where one character in the class should find any other character
in this class. If I search for 歴 6B74, I also want to find the semantic
variant 歷 6B77 (i.e. the standard character) as well as the simplified
character 历 5386. However, if I search for 历 5386, I may want to find
the semantic variant 厲 53B2 (which is based on Fenn, but not Lau,
Matthews or Meyer-Wempe), but definitely not the simplified character 厉
5389. The difference is that there are additional Z-variant connections
in the first case.
Does it make sense to create equivalence classes from the Z-variants? As
an example, the 歷 6B77-class would comprise 歴 6B74, 歷 6B77 and 历
5386 (not counting the compatibility character 歷 F98C), and the 曆
66C6-class would comprise 66A6 and 曆 66C6 (not counting 曆 F98B). In
particular, 曆 66C6 would not find 歷 6B77. However, both characters
have the same simplified character equivalent. Should these classes be
unified for searching? Or should it make a difference if I search for a
traditional or a simplified character, i.e. searching for 历 5386 finds
the 曆 66C6-class as well as the 歷 6B77-class?
Why is 歴 6B74 a semantic variant of 歷 6B77, but 66A6 is not a semantic
variant of 曆 66C6? Is it simply because no dictionary has declared them
to be equivalent, even though the respective relationships are obviously
the same? And how can two characters such as 歴 6B74 and 歷 6B77 be
Z-variants if they do not have the same number of strokes? All
unification rules seem to leave the number of strokes unchanged, as far
as the component is not on the Annex S list of unifiable characters
(such as 吕 5415 and 呂 5442).
According to UAX#38, the "kSemanticVariant" relation means "two
characters have identical meanings". Thus, technically it should be
transitive (as opposed to "kSpecializedSemanticVariant"), but for
example 厤 53A4 (kDefinition "to calculate; the calendar") is connected
via 曆 66C6 ("calendar, era") with 歷 6B77 ("take place, past,
history"), but there is no direct connection. Why?
And why does Apple's character palette regard 历 5386 as related to 厯
53AF, when in fact no arrow leads from or to 厯 53AF? Or rather, where
does this knowledge (see e.g.
http://dict.variants.moe.edu.tw/yitia/fra/fra02074.htm) come from?
Best,
Wolfgang
This archive was generated by hypermail 2.1.5 : Tue Aug 17 2010 - 09:03:35 CDT