L2/02-360

Re: Property file changes (remainder after August meeting)
From: Mark Davis
Date: 2002-10-25

This document contains the remaining items from L2/02-267R3 that we did not finish in the August meeting. I am not going to bother making everything pretty: the items we have already discussed are simply truncated and struck-through.

  1. Identifier_Revelations
  2. UTF-8 in Property Files
  3. Fallback_Properties
  4. Line_Break_TR
  5. SpecialCasing and TR21
  6. New_Properties in PropList.txt
  7. PropertyAliases_and_PropertyValueAliases
  8. Other_Properties
  9. Line_Break_Pair_Tables

1. Identifier Revelations


2. UTF-8 in Property Files

I believe the time has come to use UTF-8 consistently in all of our property data files. Currently Unihan.txt and NormalizationTest.txt are in UTF-8, a couple files are in Latin- 1, and most files are in ASCII. However, importantly:

This means that parsers that strip comments don't even need to know that the file is UTF-8 (unless they parse Unihan.txt); they can just treat it as ASCII. If we continue to follow these two principles, it makes the switchover almost unnoticeable. Initially, this would only matter in the few files that contain some Latin-1 non ASCII. Later, we could add real, readable annotations in comments to some of the files, e.g.:

00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
0130; 0069 0307; 0130; 0130; # LATIN CAPITAL LETTER I WITH DOT ABOVE

could become:

00DF; 00DF; 0053 0073; 0053 0053; # ß; ß; Ss; SS; LATIN SMALL LETTER SHARP S;
0130; 0069 0307; 0130; 0130; # İ; i ̇; İ; İ; LATIN CAPITAL LETTER I WITH DOT ABOVE


3. Fallback Properties

As a general rule, we should not have the fallback value for a property (the one that we give code points that are not explicitly mentioned) require computation; it should be a single value. Otherwise, it is too error-prone; too easy for programmers to make mistakes when processing the data files.

  1. Bidi_Class (UnicodeData.txt)

    The way that the BIDI class property is handled is very error-prone. We say in UAX #9 that all unassigned code points are given the following values

  2. Unfortunately, this is not repeated in UnicodeData-3.2.0.html (where the properties of UnicodeData.txt are documented). Nor are the relevant R and AL code points listed explicitly in DerivedBidiClass-3.2.0.txt. We should address both of these points: document the ranges in the ..html file, and add the code points to DerivedBidiClass.txt.

  3. Joining_Type (ArabicShaping.txt)

    The Joining Type T is also not explicitly listed in ArabicShaping-3.2.0.txt. While in this case, at least the formula for computing T is included in the comments in the file, it would be less error-prone if they were listed explicitly. Those values are already given in DerivedJoiningType-3.2.0.txt.

  4. EastAsianWidth.txt

    The data file says:

    # - Assigned characters that are not listed explicitly are given the value "N".

    It omits telling what the default is for unassigned code points. I assume they are also N, in which case this needs to be changed to:

    # - All code points that are not listed explicitly are given the value "N".

    If they are not all N, then the ones that aren't should be explicitly listed!

  5. LineBreak..txt

    The data file says:

    # - Assigned characters that are not listed explicitly are given the value # "AL".
    # - Unassigned characters are given the value "XX".

    The data file actually lists all the characters that are AL, and should. The above should be changed to:

    # - All code points that are not listed explicitly are given the value "XX".
  6. Simple titlecase mapping (field 14 in UnicodeData.txt)

    UnicodeData.html says: "This field is omitted if the titlecase is the same as field 12."

    A user noted that "this is apparently not true, except for 01C5, 01C8, 01CB and 01F2." The data should consistently either omit or include the field (when the same as field 12), and the documentation should match.


4. Line Break TR

  1. The TR has (LB15b) HY ÷ before (LB18) HY × NU. Since early rules have precedence, the second rule has no effect, meaning that this doesn't allow -3. This needs to be fixed. The minimal change is to move LB 15b to be LB 18b.
  2. I had thought (and reported) that Line Break would incorrectly break within Hangul syllables of the form LLVT. However, on closer examination, that doesn't happen: we are covered by Rule 6. However:
  3. The TR text should reflect more clearly that WORD JOINER (and that semantic of ZWNBSP) prevents line breaks, including a break at a hyphenation point in the interior of a word.
  4. It should also reflect our position on SHY (see above).
  5. The following text is in the rules:

    The text should be clearer that it is reasonable (but not required) to use the regular expression, instead of the approximate rules. (It gives better results than the pairwise approach, and for regular-expression-based linebreak engines, is much easier to implement.)

  6. The type CB must be resolved before the LB algorithm is invoked, but there is no other obvious type that it should normally behave like! B2 doesn't work, since that is specifically quotation dash. Either the text must describe what it is to map to, or a new rule should be added:
  7. The table of rules does not account for the types: SP, BK, CR, and LF. While some of the code implicitly handles SP, it does not account for edge cases, such as a space at the start of text.
  8. The text says "(See the Unicode Standard [U3.0] for other rules regarding graphemes.)". This needs to be updated to point at TR29, and use the newer term grapheme cluster.
  9. The following definitions are supplied below the pair table:

    However, the term "here" is imprecise. The normal interpretation would be that A ^ B iff A × B and A SP* × B. In that case, "here" means just "before the B". However, there are certain cases, such as ZW CL, where ZW SP × CL but ZW does break from CL. If the table is right, then the definition needs to be changed to A ^ B iff A × B and A SP* × B and A × SP* × B. I suggest the following clarification.

  10. The table of rules is incorrect for CM. For almost all types A, A ^ CM. This is because A × CM, and A × SP, and SP × CM. (ZW, oddly, is an exception). In addition, CM % PO (by rules 6/7 and 17) and and CM ÷ NU and CM ÷ AL (by rules 6/7 and 20). The reason for this is "Correspondingly, if there is no base, or if the base character is SP, CM* or SP CM* behave like ID."
  11. The ordering of the following rules is incorrect:

    LB 12  Break after spaces
      
         SP ÷

    LB 13  Don’t break before or after NBSP or WORD JOINER
            × GL
            GL ×

    The main purpose of WORD JOINER is exactly to prevent breaks where they would otherwise occur. The minimal fix is to change the ordering of LB 13, to move it to being LB 11b.

  12. At the end of this document (Line_Break_Pair_Tables) are two pair tables. The first represents the current TR results (with the above fixes where the table incorrectly deviates from the rules). The second has the suggested rule movements listed above.
  13. The SG section and value should be removed (see text below from the TR); Line Break should be not be phrased in terms of UTF-16 code units at point in time; and surrogate code points do not have the behavior below, nor are any of them "characters".

    SG - Surrogates (XP) - (normative)

    All characters with General Category Cs. There is no break between a high surrogate and a low surrogate....


5. SpecialCasing and TR21


6. New Properties in PropList.txt

  1. There are 3 headings in Section 4.2 of TUS (page 79) which are not reflected in the UCD, ...
  2. NF*_Stable. There are 4 derived properties that may be useful to add to the UCD, one for each normalization form. They are the set of code points that are always stable: never affected by the normalization process in the current version of Unicode. This property is rather useful for skipping over text that does not need to be considered at all when normalizing.

    Formally, each stable code point CP fulfills all the following conditions:

    1. CP has canonical combining class 0, and
    2. CP is (as a single character) not changed by this normalization form, and
      if NKC or NFKC, ALL of the following:
    3. CP can never compose with a previous character, and
    4. CP can never compose with a following character, and
    5. CP can never change if another character is added.

    Example: In NFC, a-breve might satisfy all but (e), but if you add an ogonek it changes to a-ogonek + breve. So it is not stable. However, a-ogonek is stable in NFC, since it does satisfy (a-e).

    There are pluses and minuses to adding these properties:

  3. In a number of cases, we have string transforms, functions that map a string onto a (perhaps) modified string. Thus we speak of NFC(x) as the normalized form of x (according to the definition of NFC). These transforms can also be used to derive useful binary properties: such as isNFC(x), where isNFC(x) is true iff NFC(x) == x. This would be useful to document somewhere.

7. Other Properties

  1. I tested Terminal_Punctuation (from PropList.txt) with a compatibility closure. The following items are in that set, but not in
  2. For UTR #29: Text Boundaries, the following should be added to Other_Extend. While we decided to remove combining marks
  3. For UTR #29: Text Boundaries, it would be useful to add some new properties once it is finalized (so probably in 4.0 instead of 3.2). That way people can use machine-readable properties instead of digging them out of TR text. The possibilities should be reviewed with the UTC review of the TR. Candidates include:

    1. MidLetter: Non-letters that normally can occur in the middle of words. This is an informative property, and may be
    2. MidNumber: Non-digits that normally can occur in the middle of numbers. This is an informative property, and may be
    3. Sep: Probably with a different name, it would be useful to have a property for all the characters that break lines.
    4. Ambiguous_Sentence_Punctuation: Terminal_Punctuation characters that normally have two usages: they can end
    5. Sentence_Punctuation: Terminal_Punctuation characters that normally end a sentence, and are not normally within a
    6. L, V, T, LV, LVT: It would be useful to have distinct properties for the first three, and derived properties for the latter two.
    7. There are two types of characters that need to be added when considering Katakana, characters that are not marked as being Katakana in Script.txt. We should consider adding the last to the script, and perhaps having a "shared" Katakana-Hiragana script.
      U+30FC # KATAKANA-HIRAGANA PROLONGED SOUND MARK
      U+FF70 # HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK
      U+FF9E..U+FF9F # HALFWIDTH KATAKANA SOUND MARKS
  4. The properties Grapheme_Link, Grapheme_Base, and Grapheme_Extend should be adjusted to be consistent with TR29.
 

8. PropertyAliases and PropertyValueAliases


9. Line Break Pair Tables

The tables below match the TR14 ordering, and use the notation described in the TR. The table is extended by also including SP..CB, and the values L, V, and T for Hangul Jamo. If your browser is enabled for tool-tips, then hovering over the cell reveals the Rule number that determines the breaking status in the case in question. Sometimes there are multiple rules, when a case has to be tested with and without intervening spaces. The differences between the two are marked in yellow.

This does not imply that the current layout of the table in the TR should be changed to be as large as the one below. The more complete table below is simply provided to illustrate the effects of the recommended changes.

Current:

  OP CL QU GL NS EX SY IS PR PO NU AL ID IN HY BA BB B2 ZW CM SP BK CR LF CB L V T
OP ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
CL _ ^ % % ^ ^ ^ ^ _ % _ _ _ _ % % _ _ ^ ^ ^ ^ ^ ^ _ _ ^ ^
QU ^ ^ % % % ^ ^ ^ % % % % % % % % % % ^ ^ ^ ^ ^ ^ % % ^ ^
GL % ^ % % % ^ ^ ^ % % % % % % % % % % ^ ^ ^ ^ ^ ^ % % ^ ^
NS _ ^ % % % ^ ^ ^ _ _ _ _ _ _ % % _ _ ^ ^ ^ ^ ^ ^ _ _ ^ ^
EX _ ^ % % % ^ ^ ^ _ _ _ _ _ _ % % _ _ ^ ^ ^ ^ ^ ^ _ _ ^ ^
SY _ ^ % % % ^ ^ ^ _ _ % _ _ _ % % _ _ ^ ^ ^ ^ ^ ^ _ _ ^ ^
IS _ ^ % % % ^ ^ ^ _ _ % _ _ _ % % _ _ ^ ^ ^ ^ ^ ^ _ _ ^ ^
PR % ^ % % % ^ ^ ^ _ _ % % % _ % % _ _ ^ ^ ^ ^ ^ ^ _ % ^ ^
PO _ ^ % % % ^ ^ ^ _ _ _ _ _ _ % % _ _ ^ ^ ^ ^ ^ ^ _ _ ^ ^
NU _ ^ % % % ^ ^ ^ _ % % % _ % % % _ _ ^ ^ ^ ^ ^ ^ _ _ ^ ^
AL _ ^ % % % ^ ^ ^ _ _ % % _ % % % _ _ ^ ^ ^ ^ ^ ^ _ _ ^ ^
ID _ ^ % % % ^ ^ ^ _ % _ _ _ % % % _ _ ^ ^ ^ ^ ^ ^ _ _ ^ ^
IN _ ^ % % % ^ ^ ^ _ _ _ _ _ % % % _ _ ^ ^ ^ ^ ^ ^ _ _ ^ ^
HY _ ^ % % % ^ ^ ^ _ _ _ _ _ _ % % _ _ ^ ^ ^ ^ ^ ^ _ _ ^ ^
BA _ ^ % % % ^ ^ ^ _ _ _ _ _ _ % % _ _ ^ ^ ^ ^ ^ ^ _ _ ^ ^
BB % ^ % % % ^ ^ ^ % % % % % % % % % % ^ ^ ^ ^ ^ ^ % % ^ ^
B2 _ ^ % % % ^ ^ ^ _ _ _ _ _ _ % % _ ^ ^ ^ ^ ^ ^ ^ _ _ ^ ^
ZW _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ^ _ ^ ^ ^ ^ _ _ _ _
CM _ ^ % % % ^ ^ ^ _ % _ _ _ % % % _ _ ^ ^ ^ ^ ^ ^ _ _ ^ ^
SP _ ^ _ _ _ ^ ^ ^ _ _ _ _ _ _ _ _ _ _ ^ ^ ^ ^ ^ ^ _ _ ^ ^
BK _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
CR _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ % _ _ _ _
LF _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
CB _ ^ % % % ^ ^ ^ _ _ _ _ _ _ % % _ _ ^ ^ ^ ^ ^ ^ _ _ ^ ^
L _ ^ % % % ^ ^ ^ _ % _ _ _ % % % _ _ ^ ^ ^ ^ ^ ^ _ % ^ ^
V _ ^ % % % ^ ^ ^ _ % _ _ _ % % % _ _ ^ ^ ^ ^ ^ ^ _ _ ^ ^
T _ ^ % % % ^ ^ ^ _ % _ _ _ % % % _ _ ^ ^ ^ ^ ^ ^ _ _ ^ ^

Recommended:

  OP CL QU GL NS EX SY IS PR PO NU AL ID IN HY BA BB B2 ZW CM SP BK CR LF CB L V T
OP ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
CL _ ^ % ^ ^ ^ ^ ^ _ % _ _ _ _ % % _ _ ^ ^ ^ ^ ^ ^ _ _ _ _
QU ^ ^ % ^ % ^ ^ ^ % % % % % % % % % % ^ ^ ^ ^ ^ ^ % % % %
GL % ^ % ^ % ^ ^ ^ % % % % % % % % % % ^ ^ ^ ^ ^ ^ % % % %
NS _ ^ % ^ % ^ ^ ^ _ _ _ _ _ _ % % _ _ ^ ^ ^ ^ ^ ^ _ _ _ _
EX _ ^ % ^ % ^ ^ ^ _ _ _ _ _ _ % % _ _ ^ ^ ^ ^ ^ ^ _ _ _ _
SY _ ^ % ^ % ^ ^ ^ _ _ % _ _ _ % % _ _ ^ ^ ^ ^ ^ ^ _ _ _ _
IS _ ^ % ^ % ^ ^ ^ _ _ % _ _ _ % % _ _ ^ ^ ^ ^ ^ ^ _ _ _ _
PR % ^ % ^ % ^ ^ ^ _ _ % % % _ % % _ _ ^ ^ ^ ^ ^ ^ _ % % %
PO _ ^ % ^ % ^ ^ ^ _ _ _ _ _ _ % % _ _ ^ ^ ^ ^ ^ ^ _ _ _ _
NU _ ^ % ^ % ^ ^ ^ _ % % % _ % % % _ _ ^ ^ ^ ^ ^ ^ _ _ _ _
AL _ ^ % ^ % ^ ^ ^ _ _ % % _ % % % _ _ ^ ^ ^ ^ ^ ^ _ _ _ _
ID _ ^ % ^ % ^ ^ ^ _ % _ _ _ % % % _ _ ^ ^ ^ ^ ^ ^ _ _ _ _
IN _ ^ % ^ % ^ ^ ^ _ _ _ _ _ % % % _ _ ^ ^ ^ ^ ^ ^ _ _ _ _
HY _ ^ % ^ % ^ ^ ^ _ _ % _ _ _ % % _ _ ^ ^ ^ ^ ^ ^ _ _ _ _
BA _ ^ % ^ % ^ ^ ^ _ _ _ _ _ _ % % _ _ ^ ^ ^ ^ ^ ^ _ _ _ _
BB % ^ % ^ % ^ ^ ^ % % % % % % % % % % ^ ^ ^ ^ ^ ^ _ % % %
B2 _ ^ % ^ % ^ ^ ^ _ _ _ _ _ _ % % _ ^ ^ ^ ^ ^ ^ ^ _ _ _ _
ZW _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ^ _ ^ ^ ^ ^ _ _ _ _
CM _ ^ % ^ % ^ ^ ^ _ % _ _ _ % % % _ _ ^ ^ ^ ^ ^ ^ _ _ _ _
SP _ ^ _ ^ _ ^ ^ ^ _ _ _ _ _ _ _ _ _ _ ^ ^ ^ ^ ^ ^ _ _ _ _
BK _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
CR _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ % _ _ _ _
LF _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
CB _ ^ % ^ _ ^ ^ ^ _ _ _ _ _ _ _ _ _ _ ^ ^ ^ ^ ^ ^ _ _ _ _
L _ ^ % ^ % ^ ^ ^ _ % _ _ _ % % % _ _ ^ ^ ^ ^ ^ ^ _ % % _
V _ ^ % ^ % ^ ^ ^ _ % _ _ _ % % % _ _ ^ ^ ^ ^ ^ ^ _ _ % %
T _ ^ % ^ % ^ ^ ^ _ % _ _ _ % % % _ _ ^ ^ ^ ^ ^ ^ _ _ _ %