L2/02-267R2
Re: | Property file changes for UCD 3.2.1 |
From: | Mark Davis |
Date: | 2002-07-28 |
While doing some work for the IDN identifier tables, I came across a number of property issues; other ICU team members and users also ran across related property issues that are also included here. The list grew from there, as I collected other property-related issues that have come up, such as from the editorial committee.
The following are proposed for Unicode 3.2.1.
U+034F COMBINING GRAPHEME JOINER
should be in
Other_Default_Ignorable_Code_Point.
U+205F MEDIUM MATHEMATICAL SPACE
should be in White_Space
Essentially, it is that the SHY is a format character that indicates a
preferred intra-word line-break position. If the line is broken at that
point, just as if it is broken at some other intra-word position,
then whatever mechanism appropriate for intra-word line-breaks should be
invoked. Depending on the language and the word, that may involve simply
inserting a
hyphen, inserting a hyphen and changing spelling in the divided word parts,
or perhaps not even showing any visible change and simply breaking at that
point.
In line with this, the SHY and similar characters should be changed to be Cf characters and be added to Other_Default_Ignorable_Code_Point.
The idea is that if a program does not support, say, DEVANAGARI LETTER KA, it should still not ignore it in processing, especially in rendering. Displaying nothing would give the user the impression that it does not occur in the text at all. So we recommend displaying a box or a special last-resort glyph. If we were parsing for identifiers, we would not ignore an unsupported character (like KA); we would break an identifier before it -- and not include it in the identifier.
However, with characters like ZWJ, if the program does not support it, the best approach is to ignore it completely; don't display a box, since the normal display of the character is invisible -- it's effects are on other characters (which we can't show anyway since we don't support the character). In the discussion with Deborah Goldsmith in the UTC meeting, it was also clarified that another way to characterize this property is the set of characters that are normally invisible.
Ken had a good observation: Just defining a property by enumeration, and then having a two-liner patched into the documentation is evidently not enough. I think we are going to have to require some more explicit criteria put forward for new properties, so that a smart committee of people, conversant in character encoding, can reliably produce the same or similar results when asked to apply the property against some particular character, to make these kinds of determinations.
Unfortunately, Sk also contains some oddities like spacing MACRON, which should not be parts of words or identifiers. Natural languages don't use ^ as part of a word (separately as in "ro^le" -- they do use "rôle")? It is no more natural than "ro◇le", "ro⇕le", "ro░le" or "ro☺le". Nor so we recommend that "ro^le" even sort anything like "rôle".
I recommend that we make the following fixes:
02B9..02BA ; Sk # [2] MODIFIER LETTER PRIME..MODIFIER LETTER DOUBLE PRIME 02C2..02CF ; Sk # [14] MODIFIER LETTER LEFT ARROWHEAD..MODIFIER LETTER LOW ACUTE ACCENT 02D2..02DF ; Sk # [14] MODIFIER LETTER CENTRED RIGHT HALF RING..MODIFIER LETTER CROSS ACCENT 02E5..02ED ; Sk # [9] MODIFIER LETTER EXTRA-HIGH TONE BAR..MODIFIER LETTER UNASPIRATED
These character are intermixed with letters as a part of words, and should be allowed in both of them; there is also no reason to exclude them from identifiers. Moving them into Lm would reflect that fact, and produce better default behavior for all processes dealing with words and identifiers.
Note: this leaves the following characters in Sk005E ; Sk # CIRCUMFLEX ACCENT 0060 ; Sk # GRAVE ACCENT 00A8 ; Sk # DIAERESIS 00AF ; Sk # MACRON 00B4 ; Sk # ACUTE ACCENT 00B8 ; Sk # CEDILLA FF3E ; Sk # FULLWIDTH CIRCUMFLEX ACCENT FF40 ; Sk # FULLWIDTH GRAVE ACCENT FFE3 ; Sk # FULLWIDTH MACRON 0374..0375 ; Sk # [2] GREEK NUMERAL SIGN..GREEK LOWER NUMERAL SIGN 0384..0385 ; Sk # [2] GREEK TONOS..GREEK DIALYTIKA TONOS 1FBD ; Sk # GREEK KORONIS 1FBF..1FC1 ; Sk # [3] GREEK PSILI..GREEK DIALYTIKA AND PERISPOMENI 1FCD..1FCF ; Sk # [3] GREEK PSILI AND VARIA..GREEK PSILI AND PERISPOMENI 1FDD..1FDF ; Sk # [3] GREEK DASIA AND VARIA..GREEK DASIA AND PERISPOMENI 1FED..1FEF ; Sk # [3] GREEK DIALYTIKA AND VARIA..GREEK VARIA 1FFD..1FFE ; Sk # [2] GREEK OXIA..GREEK DASIA 309B..309C ; Sk # [2] KATAKANA-HIRAGANA VOICED SOUND MARK..SEMI-VOICED SOUND MARK
The above are all special-case spacing symbols that are not used in the interior of words or identifiers in practice, any more that other symbols are. They should be left as Symbols to reflect this.
The only reason for them to include Sk was for the characters in #2, but this also includes the unwanted characters from #1. Once the #2 characters are removed from Sk, then the word/titlecase definitions don't need them any more.
Other options (not preferred)
Move goofy (a technical term) characters like (spacing) MACRON into So. Leave the definition of titlecase, word alone (they use Sk). Add Sk to the definition of identifier*.
I thus recommend:
U+2118 # SCRIPT CAPITAL P U+212E # ESTIMATED SYMBOL U+309B..U+309C # KATAKANA-HIRAGANA VOICED SOUND MARK..SEMI-VOICED SOUND MARK
The only down-side that I can see with this is that we are slightly out of sync with the ISO TR 10176; but we are already out of sync since they are based on old versions of the Unicode standard (currently the one in ballot is based on 3.0, and is already 44K characters out of date!). This is a simple application of "practice what you preach", and makes it far easier for users of our standard to themselves have backwards-compatible identifiers.
One might think that extending the notion of identifier could cause
problems. But these characters are not an issue. The
only possibility of a conflict would arise if you were parsing a file,
and encountered something like:
...identifier<syntax character>...
which suddenly got treated as an identifier. However, none of the
mentioned characters are treated as syntax characters in any known
programming language, so it would not be an issue.
I believe the time has come to use UTF-8 consistently in all of our property data files. Currently Unihan.txt and NormalizationTest.txt are in UTF-8, a couple files are in Latin- 1, and most files are in ASCII. However, importantly:
This means that parsers that strip comments don't even need to know that the file is UTF-8 (unless they parse Unihan.txt); they can just treat it as ASCII. If we continue to follow these two principles, it makes the switchover almost unnoticeable. Initially, this would only matter in the few files that contain some Latin-1 non ASCII. Later, we could add real, readable annotations in comments to some of the files, e.g.:
00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE
could become:
00DF; 00DF; 0053 0073; 0053 0053; # ß; ß; Ss; SS; LATIN SMALL LETTER SHARP
S;
0130; 0069 0307; 0130; 0130; # İ; i ̇; İ; İ; LATIN CAPITAL
LETTER I WITH DOT ABOVE
As a general rule, we should not have the fallback value for a property (the one that we give code points that are not explicitly mentioned) require computation; it should be a single value. Otherwise, it is too error-prone; too easy for programmers to make mistakes when processing the data files.
The way that the BIDI class property is handled is very error-prone. We say in UAX #9 that all unassigned code points are given the following values
Unfortunately, this is not repeated in UnicodeData-3.2.0.html (where the properties of UnicodeData.txt are documented). Nor are the relevant R and AL code points listed explicitly in DerivedBidiClass-3.2.0.txt. We should address both of these points: document the ranges in the ..html file, and add the code points to DerivedBidiClass.txt.
The Joining Type T is also not explicitly listed in ArabicShaping-3.2.0.txt. While in this case, at least the formula for computing T is included in the comments in the file, it would be less error-prone if they were listed explicitly. Those values are already given in DerivedJoiningType-3.2.0.txt.
The data file says:
# - Assigned characters that are not listed explicitly are given the value "N".
It omits telling what the default is for unassigned code points. I assume they are also N, in which case this needs to be changed to:
# - All code points that are not listed explicitly are given the value "N".
If they are not all N, then the ones that aren't should be explicitly listed!
The data file says:
# - Assigned characters that are not listed explicitly are given the value # "AL". # - Unassigned characters are given the value "XX".
The data file actually lists all the characters that are AL, and should. The above should be changed to:
# - All code points that are not listed explicitly are given the value "XX".
UnicodeData.html says: "This field is omitted if the titlecase is the same as field 12."
A user noted that "this is apparently not true, except for 01C5, 01C8, 01CB and 01F2." The data should consistently either omit or include the field (when the same as field 12), and the documentation should match.
# When lowercasing, remove dot_above in the sequence I + dot_above, which will turn into i. # This matches the behavior of the canonically equivalent I-dot_above 0307; ; 0307; 0307; tr After_Soft_Dotted; # COMBINING DOT ABOVE 0307; ; 0307; 0307; az After_Soft_Dotted; # COMBINING DOT ABOVE
do not match the comment (which is correct). They need to be changed to:
# AFTER_I: The last preceding base character was an uppercase I, and # no combining character class 230 (above) has intervened. .... # When lowercasing, remove dot_above in the sequence I + dot_above, which will turn into i. # This matches the behavior of the canonically equivalent I-dot_above 0307; ; 0307; 0307; tr AFTER_I # COMBINING DOT ABOVE 0307; ; 0307; 0307; az AFTER_i # COMBINING DOT ABOVE
Note: This will not have an effect on CaseFolding.
D1. A character C is defined to be cased if it meets any of the following criteria:
- The general category of C is
- Titlecase Letter (Lt)
- In [CoreProps], C has one of the properties
- Uppercase, or
- Lowercase
- Given D = NFD(C), then it is not the case that:
- D = UCD_lower(D) = UCD_upper(D) = UCD_title(D)
Condition #3 is now redundant, since Uppercase and Lowercase have been 'closed' for #2. It thus does not add any additional characters. Thus #3 should be omitted (although we need to maintain consistency tests to ensure that it is captured in #2).
In TR21, there are two places the text needs to be changed to account for edge-cases with subscript-iota.
For any string X, let Q(X) = NFC(toCasefold(X)). In other words, Q is the result of casefolding X, then putting the result into NFC format...
That is, given R(X) = NFC(toCasefold(X)), there are some strings such that R(R(X)) != R(X).
to:
For any string X, let Q(X) = NFC(toCasefold(NFD(X))). In other words, Q is the result of normalizing X, then casefolding the result, then putting the result into NFC format...
That is, given R(X) = NFKC(toCasefold(NFD(X))), there are some strings such that R(R(X)) != R(X).
to:
Note: The multiple invocations of normalization in the above definitions are to catch relatively infrequent edge cases. In practice, implementations can produce optimized versions that avoid this, treating the edge cases as exceptions if they occur.
U+09CB..U+09CC # BENGALI VOWEL SIGN O..BENGALI VOWEL SIGN AU U+0B48 # ORIYA VOWEL SIGN AI U+0B4B..U+0B4C # ORIYA VOWEL SIGN O..ORIYA VOWEL SIGN AU U+0BCA..U+0BCC # TAMIL VOWEL SIGN O..TAMIL VOWEL SIGN AU U+0CC0 # KANNADA VOWEL SIGN II U+0CC7..U+0CC8 # KANNADA VOWEL SIGN EE..KANNADA VOWEL SIGN AI U+0CCA..U+0CCB # KANNADA VOWEL SIGN O..KANNADA VOWEL SIGN OO U+0D4A..U+0D4C # MALAYALAM VOWEL SIGN O..MALAYALAM VOWEL SIGN AU U+0DDA # SINHALA VOWEL SIGN DIGA KOMBUVA U+0DDC..U+0DDE # SINHALA VOWEL SIGN KOMBUVA HAA AELA-PILLA..SINHALA VOWEL SIGN KOMBUVA HAA GAYANUKITTA U+17BF..U+17C0 # KHMER VOWEL SIGN YA..KHMER VOWEL SIGN IE U+17C4..U+17C5 # KHMER VOWEL SIGN OO..KHMER VOWEL SIGN AU
U+093F # DEVANAGARI VOWEL SIGN I U+09BF # BENGALI VOWEL SIGN I U+09C7..U+09C8 # BENGALI VOWEL SIGN E..BENGALI VOWEL SIGN AI U+0A3F # GURMUKHI VOWEL SIGN I U+0ABF # GUJARATI VOWEL SIGN I U+0B47 # ORIYA VOWEL SIGN E U+0BC6..U+0BC8 # TAMIL VOWEL SIGN E..TAMIL VOWEL SIGN AI U+0D46..U+0D48 # MALAYALAM VOWEL SIGN E..MALAYALAM VOWEL SIGN AI U+0DD9..U+0DDB # SINHALA VOWEL SIGN KOMBUVA..SINHALA VOWEL SIGN KOMBU DEKA U+1031 # MYANMAR VOWEL SIGN E U+17BE # KHMER VOWEL SIGN OE U+17C1..U+17C3 # KHMER VOWEL SIGN E..KHMER VOWEL SIGN AI
U+0F90..U+0F97 # TIBETAN SUBJOINED LETTER KA..TIBETAN SUBJOINED LETTER JA U+0F99..U+0FBC # TIBETAN SUBJOINED LETTER NYA..TIBETAN SUBJOINED LETTER FIXED-FORM RA
Formally, each stable code point CP fulfills all the following conditions:
Example: In NFC, a-breve might satisfy all but (e), but if you add an ogonek it changes to a-ogonek + breve. So it is not stable. However, a-ogonek is stable in NFC, since it does satisfy (a-e).
There are pluses and minuses to adding these properties:
Recommended to be included as Properties | |
Numeric: | Completes the other set of numeric properties in the UCD. Proposed numeric type names: Han_Primary (hp), Han_Accounting (ha), Han_Other (ho) |
Variants | For foldings and comparison. Proposed property names: Semantic_Variant (semv), Simplified_Variant (simv), Specialized_Semantic_Variant (specv), Traditional_Variant (tradv), Z_Variant (zv) |
kRSUnicode: | For indexing and sorting. Proposed property name: Unicode_Radical_Stroke (urs) |
Recommended to be excluded as Properties (e.g. left simply as tags in the Unihan file) | |
Other Radical/Stroke: | Questionable validity; incomplete data |
Character Mapping: | Logically a part of character mapping tables, not Unicode Properties |
Dictionary Position, Definition, Grade: | Applicable only to very specific programs |
Frequency, Pronunciations | Questionable validity; incomplete data |
Redundant: | derivable from the UCD |
Complete list of categorized tags from Unihan
Category | Property Name | Description from Unihan (abbreviated) |
---|---|---|
Numeric | kAccountingNumeric | The value of the character when used in the writing of accounting numerals. |
kOtherNumeric | The numeric value for the character in certain unusual, specialized contexts. | |
kPrimaryNumeric | The value of the character when used in the writing of numbers in the standard fashion. | |
Variants | kSemanticVariant | The Unicode value for a semantic variant for this character. A semantic variant is an x- or y-variant with similar or identical meaning which can generally be used in place of the indicated character. |
kSimplifiedVariant | The Unicode value for the simplified Chinese variant for this character (if any). | |
kSpecializedSemanticVariant | The Unicode value for a specialized semantic variant for this character. A specialized semantic variant is an x- or y-variant with similar or identical meaning only in certain contexts (such as accountants' numerals). | |
kTraditionalVariant | The Unicode value(s) for the traditional Chinese variant(s) for this character. | |
kZVariant | The Unicode value(s) for known z-variants of this character | |
Radical/Stroke | kRSJapanese | A Japanese radical/stroke count for this character in the form "radical.additional strokes". |
kRSKanWa | A Morohashi radical/stroke count for this character in the form "radical.additional strokes". | |
kRSKangXi | A KangXi radical/stroke count for this character in the form "radical.additional strokes". | |
kRSKorean | A Korean radical/stroke count for this character in the form "radical.additional strokes". A ' after the radical indicates the simplified version of the given radical | |
kRSUnicode | A standard radical/stroke count for this character in the form "radical.additional strokes". A ' after the radical indicates the simplified version of the given radical | |
kTotalStrokes | The total number of strokes in the character (including the radical) | |
Pronunciations | kCantonese | The Cantonese pronunciation(s) for this character |
kJapaneseKun | The Japanese pronunciation(s) of this character | |
kJapaneseOn | The Sino-Japanese pronunciation(s) of this character | |
kKorean | The Korean pronunciation(s) of this character | |
kMandarin | The Mandarin pronunciation(s) for this character in pinyin | |
kTang* | The Tang dynasty pronunciation(s) of this character, derived from _T'ang Poetic Vocabulary_ | |
kVietnamese | The character's pronunciation(s) in Quốc ngữ | |
Definition | kDefinition | An English definition for this character |
Frequency | kFrequency | A rough fequency measurement for the character based on analysis of Chinese USENET postings |
Grade | kGradeLevel* | The grade in the Hong Kong school system by which a student is expected to know the character. |
Dictionary Position | kAlternateKangXi | An alternate possible position for the character in the KangXi dictionary |
kAlternateMorohashi | An alternate possible position for the character in the Morohashi dictionary | |
kCihaiT* | The position of this character in the Cihai (辭海) dictionary, single volume edition, published in Hong Kong by the Zhonghua Bookstore, 1983 (reprint of the 1947 edition), ISBN 962-231-005-2. | |
kCowles* | The index of this character in Roy T. Cowles, _A Pocket Dictionary of Cantonese_, Hong Kong: University Press, 1999. | |
kDaeJaweon | The position of this character in the Dae Jaweon (Korean) dictionary used in the four-dictionary sorting algorithm. | |
kFenn* | Data on the character from _Fenn's Chinese-English Pocket Dictionary_ | |
kHanYu | The position of this character in the Hanyu Da Zidian (HDZ) Chinese character dictionary (bibliographic information below). | |
kHKGlyph* | The index of the character in 常用字字形表 (二零零零年修訂本), 香港: 香港教育學院, 2000, ISBN 962-949-040-4. This publication gives the "proper" shapes for characters as used in the Hong Kong school system. | |
kIRGDaeJaweon | The position of this character in the Dae Jaweon (Korean) dictionary used in the four-dictionary sorting algorithm. | |
kIRGDaiKanwaZiten | The index of this character in the Dae Kanwa Ziten, aka Morohashi dictionary (Japanese) used in the four-dictionary sorting algorithm. | |
kIRGHanyuDaZidian | The position of this character in the Hanyu Da Zidian (PRC) dictionary used in the four-dictionary sorting algorithm. | |
kIRGKangXi | The position of this character in the KangXi dictionary used in the four-dictionary sorting algorithm. | |
kKangXi | The position of this character in the KangXi dictionary used in the four-dictionary sorting algorithm. | |
kKarlgren* | The index of this character in _Analytic Dictionary of Chinese and Sino-Japanese_ | |
kLau* | The index of this character in _A Practical Cantonese-English Dictionary_ | |
kMatthews | The index of this character in _Mathews' Chinese-English Dictionary_ | |
kMeyerWempe* | The index of this character in the Student's Cantonese-English Dictionary | |
kMorohashi | The index of this character in the Dae Kanwa Ziten, aka Morohashi dictionary (Japanese) used in the four-dictionary sorting algorithm. | |
kNelson | The index of this character in _The Modern Reader's Japanese-English Character Dictionary_ | |
kPhonetic* | The phonetic index for the character from _Ten Thousand Characters: An Analytic Dictionary_ | |
kSBGY | The position of this character in the Song Ben Guang Yun (SBGY) Medieval Chinese character dictionary (bibliographic and general information below). | |
kCangjie* | The cangjie input code for the character. This incorporates data from the file cangjie-table.b5 by Christian Wittern | |
Character Mapping | kBigFive | The Big Five mapping for this character in hex; note that this does *not* cover any of the Big Five extensions in common use, including the ETEN extensions. |
kCCCII | The CCCII mapping for this character in hex | |
kCNS1986 | The CNS 11643-1986 mapping for this character in hex | |
kCNS1992 | The CNS 11643-1992 mapping for this character in hex | |
kEACC | The EACC mapping for this character in hex | |
kGB0 | The GB 2312-80 mapping for this character in ku/ten form | |
kGB1 | The GB 12345-90 mapping for this character in ku/ten form | |
kGB3 | The GB 7589-87 mapping for this character in ku/ten form | |
kGB5 | The GB 7590-87 mapping for this character in ku/ten form | |
kGB7 | The "General Use Characters for Modern Chinese" mapping for this character | |
kGB8 | The GB 8565-89 mapping for this character in ku/ten form | |
kHKSCS | Mappings to the Big Five extended code points used for the Hong Kong Supplementary Character Set | |
kIBMJapan | The IBM Japanese mapping for this character in hex | |
kIRG_GSource | The IRG "G" source mapping for this character in hex. The IRG "G" source consists of data from the following national standards, publications, and lists from the People's Republic of China and Singapore. | |
kIRG_HSource | The IRG "H" source mapping for this character in hex. The IRG "H" source consists of data from the Hong Kong Supplementary Characer Set. | |
kIRG_JSource | The IRG "J" source mapping for this character in hex. The IRG "J" source consists of data from the following national standards and lists from Japan. | |
kIRG_KSource | The IRG "K" source mapping for this character in hex. The IRG "K" source consists of data from the following national standards and lists from the Republic of Korea (South Korea). | |
kIRG_KPSource | The IRG "KP" source mapping for this character in hex. The IRG "KP" source consists of data from the following national standards and lists from the Democratic People's Republic of Korea (North Korea). | |
kIRG_TSource | The IRG "T" source mapping for this character in hex. The IRG "T" source consists of data from the following national standards and lists from the Republic of China (Taiwan). | |
kIRG_VSource | The IRG "V" source mapping for this character in hex. The IRG "V" source consists of data from the following national standards and lists from Vietnam. | |
kJIS0213 | The JIS X 0213-2000 mapping for this character in min,ku,ten form | |
kJis0 | The JIS X 0208-1990 mapping for this character in ku/ten form | |
kJis1 | The JIS X 0212-1990 mapping for this character in ku/ten form | |
kKPS0 | The KP 9566-97 mapping for this character in hexadecimal form. | |
kKPS1 | The KPS 10721-2000 mapping for this character in hexadecimal form. | |
kKSC0 | The KS X 1001:1992 (KS C 5601-1989) mapping for this character in ku/ten form | |
kKSC1 | The KS X 1002:1991 (KS C 5657-1991) mapping for this character in ku/ten form | |
kMainlandTelegraph | The PRC telegraph code for this character, derived from "Kanzi denpou koudo henkan-hyou" | |
kPseudoGB1 | A "GB 12345-90" code point assigned this character for the purposes of including it within Unihan. | |
kTaiwanTelegraph | The Taiwanese telegraph code for this character, derived from "Kanzi denpou koudo henkan-hyou" | |
kXerox | The Xerox code for this character | |
Redundant | kCompatibilityVariant* | The compatibility decomposition for this ideograph, derived from the UnicodeData.txt file. |
Background Information
The following lists each Unihan tag, the total number of characters with that tag found in Unihan.txt, the minimum and lengths of the values associated with the tag, and a few sample values (separated by semicolons). Don't worry if some of the less common CJK characters appear as boxes on your machine; they are only examples.
Note: I have also run across some problems in some of the 'provisional' data; I have filed bugs directly with John on those.