L2/03-094
Re: | Han Properties |
From: | Mark Davis, Ken Whistler |
Date: | 2003-03-03 |
When it came time to implement the following:
92-A65] Action Item for Mark Davis, John Jenkins: For version 4, add the Numerical, Radical/Stroke, & Source Reference PropertyAliases.txt and PropertyValueAliases.txt as normative properties, and create a "provisional" property and state that all other Unihan tags are provisional. [L2/02-267R3]
We ran up against a few issues. Our recommendations are
1. Simply incorporate the numeric values from the Unihan file without adding a new property, using the numeric_type of nt (numeric). Most people using APIs that provide this new information would simply need the numeric values, and just care about the fact that they are non-decimal. So this makes it easier for the data to be incorporated, and the values are then just present in the extracted files: DerivedNumericType and DerivedNumericValue. (Many people will not bother to parse the Unihan.txt file if this is all the information they need from there; they will just pick up the extracted files).
2. Ken is concerned that numeric values for two characters (5793 gai1 and 4EAC jing1) in Unihan.txt are problematic, and should be removed (see below).
3. There are issues with the Source Standards. This is really not the kind of property that we should be surfacing. Although this information is crucial to the standards development work, it is not a property anyone would use. It is even misleading, since the source standards are 'idealized', and don't represent the actual mapping values that people would necessarily use in conversion tables. So we suggest dropping them.
Addendum from Ken
A. Han Numeric property issues. The simplest way (Option A) to comply with the UTC consensus is to create a property for each of the labelled fields in Unihan.txt that we are attempting to promote to normative status: kPrimaryNumeric ==> Han_Primary_Numeric (value: numeric) kAccountingNumeric ==> Han_Accounting_Numeric (value: numeric) kOtherNumeric ==> Han_Other_Numeric (value: numeric) Option B is to notice that these designations are in complementary distribution, and then to create an enumerated property for the type, and a numeric property for the value: Han_Numeric_Type (value: enumerated [Han_Primary, Han_Accounting, Han_Other]) Han_Numeric_Value (value: numeric) Option C is to notice that these designations are in complementary distribution with existing numeric types for non-Han characters, and then to merge them into the existing property by extending the enumeration: Numeric_Type (value: enumerated [Decimal, Digit, Numeric, None, Han_Primary, Han_Accounting, Han_Other]) Numeric_Value (value: numeric) In some regards, I think Option A is the simplest and least riddled with complications and side effects. Option C is the most problematical, since it changes the enumeration of an existing type. It also mixes apples and oranges, since it enumerates distinctions relevant to Han characters on the same virtual axis as distinctions which were originally put in place to try to distinguish digits from non-digits. This leads to yet another possible solution, Option D. Designate all Han numerics as having nt=Numeric, and then create a property type for the complementary distribution of subtypes relevant only to Han: Numeric_Type (value: enumerated [Decimal, Digit, Numeric, None]) (unchanged, but now specify the list of Han characters with nt=Numeric) Numeric_Value (value: numeric) (unchanged) Han_Numeric_Subtype (value: enumerated [Han_Primary, Han_Accounting, Han_Other]) In my opinion, *that* approach would be the easiest for people with existing API's to accomodate. Even simpler would be to omit the Han_Numeric_Subtype definition as well, since it can be derived from the Unihan.txt tags, and is not central to the problem of providing numeric values for Han characters. Another fly in the ointment is that some Han characters that we encoded as (Nl) symbols, rather than as unified or compatibility CJK characters, themselves have Numeric values: 3000, 3021..3029, 3038..303A. E.g.: 3039;HANGZHOU NUMERAL TWENTY;Nl;0;L;<compat> 5344;;;20;N;;;;; This shows gc=Nl and nt=Numeric and nv=20 for this character. And for the Hangzhou-style numerals, these overlap with those few odds and ends which get the kOtherNumeric tag. These should be accounted for in any solution which deals with the Han numeric values. There is an additional problem posed by two particular kPrimaryNumeric characters in Unihan.txt: This is a problem because the two characters in question are bizarre in the first place, and should *not* be given normative numeric values as they currently are. Quoting myself: > kPrimaryNumeric adds two values which > are *not* in Table 4-3 in the book -- and probably not > there for good reason. Those are 5793 gai1 and 4EAC jing1. > Unihan claims 4EAC is 10 quadrillion, i.e. (10000)^4 and > that 5793 is 100 quintillion, i.e. (10000)^5. I've checked > two dictionaries. Both claim that 5793 gai1 means 100 million. > One doesn't list a numeric value at all for 4EAC jing1, which > is a common character meaning 'capital', but the other lists > it also as an "ancient numeral" meaning 10 million. > A third, classical dictionary (Cihai) says of 4EAC jing1: > "Name of a number. 10 zhao4 [5146] constitute a jing1, > there are also those who aver that 10,000 zhao4 constitute > a jing1." So by that reckoning, it could either be > 10 trillion or 10 quadrillion. That same classic dictionary > cite a source for gai1 which claims that "10 man4 is called > yi4, 10 yi4 is called zhao4, 10 zhao4 is called jing1, > 10 jing1 is called gai1" (Incidentally that jing1 is > 7D93, *not* 4EAC, although the commentator says it is > meant for the same thing.) And then the commentator says > "there are also those who aver that 10,000 jing1 constitute > a gai1". Clearly nobody *really* knows what the heck these > numbers referred to. They probably started out a fantasy > concepts, equivalent to bezillion and gazillion, respectively. > One of the alternate meanings of jing1 'capital' is just 'big'. > The rationalization that jing1 means 10 quadrillion and > gai1 means 100 quintillion are just that -- rationalizations > by later commentators using the rank by 10,000's concept > of man4 (10,000), yi4 (100,000,000) and zhao4 (1,000,000,000,000). > > *NOBODY* uses these two characters as numbers in China. > It would be a disservice to our implementers and to other > users of the standard to take these fantastic commentaries > on imagined big numbers and reify them in API's that have > to spit out 10^16 and 10^20 as numeric values. One possible solution is a new Unihan tag. ;-) U+4EAC kFantasyNumeric bezillion (uncertain ancient large numeric quantity, variously annotated as equal to 10 million or to 10 quadrillion) U+5793 kFantasyNumeric gazillion (etc....) B. IRG_Source tag issues > > kIRG_GSource > > kIRG_HSource > > kIRG_JSource > > kIRG_KSource > > kIRG_KPSource > > kIRG_TSource > > kIRG_VSource We would need the U Source as well, don't we? Or is this not normative except for compatibility CJK characters? And there is deceptive complexity hiding here as well. Once again, the simplest approach would be to just turn each of these tags into a distinct property, and then use their current string values as their values. But then collectively, they define, by implication, an enumerated IRG_Source type property, presumably enumerated as [G, H, J, K, KP, T, V, U? ]. But wait, if I have a data entry: U+2F9F6 kIRG_TSource 5-5F5E That corresponds to T Source: "T5 CNS 11643-1992, plane 5" And there is another implied enumerated type property for the T Source: [T1, T2, T3, T4, T5, T6, T7, TF]. And so on for the other sources. That, in fact, is the actual structure of the listing in 10646 Clause 27.1 (normative) of the CJK Unified Ideograph sources. (Note also that in my example above, we are talking about a *compatibility* CJK character -- the compatibility CJK characters are also now all listed in Unihan.txt, and that tosses a further curve at us regarding what the status of the IRG_Source tag field meanings are here for the unified characters as opposed to the compatibility characters. Once again, as for the Han_Numeric, I don't think this property (or set of properties) is ready for prime time until we work through all the implications with a detailed proposal.