This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.
Date/Time: Fri Jul 6 19:19:36 CDT 2012
Contact: kenw@sybase.com
Name: Ken Whistler
Report Type: Public Review Issue
Opt Subject: PRI #223
UCA does not require specific behavior for when the algorithm encounters ill-formed data (e.g., isolated surrogates in UTF-16 strings). A conformant implementation may, for example, throw an exception when it encounters ill-formed input. However, the conformance test data files include isolated surrogates in some of the test cases. In order to pass the conformance tests as written, an implementation *must* adopt a particular strategy and return particular values for ill-formed strings. (It can pass by weighting an isolated surrogate as it would an unassigned code point.) This anomaly should be documented in the test documentation, as a conformance test should not force a requirement on an implementation that the conformance requirements for the algorithm do not actually state. At least implementers using the conformance tests should be put on fair notice about this situation in the test data.
Date/Time: Sat Jul 7 17:06:02 CDT 2012
Contact: richard.wordingham@ntlworld.com
Name: Richard Wordingham
Report Type: Public Review Issue
Opt Subject: PRI #223: Proposed Update UTS #10: Unicode Collation Algorithm
Section 3.1: Because D1 to D5 run contrary to normal English transformational grammar, thereby impeding understanding,there should be a warning such as, "Note that although a level 3 ignorable is ignorable at level 2, it is not a level 2 ignorable." Section 3.3.2: In the new text in Section 3.3.2, I regret that 'completely ignorable' should be replaced by 'sufficiently ignorable'. The inserted character will give a different result to the plain characters with the contraction removed, be it only at the semistable level. The problem with the text as it stands is that CGJ maps to a quaternary element in DUCET. Section 3.6.1: The statement "the UCA does not use this fourth level of data" is wrong. The UCA uses whatever levels are provided to it, are within the implementation's capability (C2 requires at least 3), and are not otherwise disabled. I suggest, "the UCA does not require the use of this fourth level of data". One cannot state that the values in the fourth level are "not consistent with a well-formed collation element data table" until well-formedness condition 2 is strengthened. By the definition of 6.2.0 Draft 3, DUCET is well-formed even with the level 4 weights. The statement "If the first three levels are zero, the fourth level is also set to zero" is false. The simplest repair I can think of is to substitute, "For further details, see Section 7.3, Fourth-Level Weight Assignments". Section 3.6.2 (or revised allkeys.txt): If IgnoreSP is selected, is U+10A7F OLD SOUTH ARABIAN NUMERIC INDICATOR variable or not? It is variable under DUCET. It is ordered within the variably weighted numbers within the symbols, but the general category Po. This issue applies both to Unicode 6.1.0 and to allkeys-6.2.0d2.txt with UnicodeData-6.2.0d1.txt. U+10A7F is the only character for which this quandary arises. Section 4.2 S2.1 The draft dated 17 May of 2012 of the Minutes of the UTC 131 / L2 228 Joint Meeting San Jose, CA -- May 7-11, 2012 record no agreement to the change. The two relevant paragraphs from L2/112 are: [131-C10] Consensus: Adopt the recommendation for requiring prefix contractions as in document L2/12-131R, with a change to 2A that it only applies to contractions ending with a non-starter. For Unicode version 6.2. [131-A34] Action Item for Mark Davis, Editorial Committee: Add text to proposed update of UTS #10 with a review note with some text from L2/12-131R. However, Mark Davis reports receiving a substantially different version of the action item. One of the arguments for simplifying the processing was that so doing avoided the need to start processing from a buffer of characters and then continuing with the input string, which could be coming from a data stream. However, as the proposal to require prefixes for all contractions was rejected, similar processing is required even when there are no non-starters. For example, consider the processing for a string "abcdgh" when there are contractions for "ab", "abcde" and "dgh". As a simple example, sorting Russian transliterated to English conventions according to Russian sorting rules would require contractions for "sh" and "shch", but not "shc". In real text, examples where the algorithm change would cause problems are few and far apart, but there are some potential examples in the Tibetan and Tai Tham scripts. They would be occasioned by U+0F39 TIBETAN MARK TSA -PHRU, a consonant modifier which gets positioned after the vowel(s), and by U+1A60 TAI THAM SIGN SAKOT, which can be separated from the following consonant by a tone mark. I am currently seeking evidence of actual rather than potential problems. Section 4.5 Well-formedness conditions 3 to 5 are not essential to the UCA; they serve only to allow certain code optimisations. To comment well on condition 2 I need a technical term, taken from ISO 14651, which if adopted could be added to the end of Section 3.1 as: "D9. The character or sequence of characters mapped by a collation element mapping is a _collating element_." One could replace paragraph 1 "Only well-formed weights are allowed..." by the following: "The process of forming a sort key includes mapping the string into a sequence of collating elements and then into a sequence of collation elements, discarding collation elements that are ignorable at the relevant level. The process of discarding zero weights when forming the sort key threatens to break this correspondence. Well-formedness conditions 1 and 2 are the conditions necessary to preserve this correspondence." The example given for well-formedness condition 2 is wrong: (a) By well-formedness condition 2, 'b' shall have a secondary weight that is less than the secondary weight of a non-spacing grave! (b) For the stated ordering to result to hold, it is necessary that the secondary weight of non-spacing grave be less than the secondary weight of 'c'. This is the violation of well-formedness condition 2! Well-formedness condition 4 is explained at the end of Section 3.6.2 as a storage optimisation. Well-formedness condition 3 is presumably similarly a storage optimisation. What is not explained is why such an an optimisation does not cause problems. Presumably well-formedness condition 4 works because the characters to be (partially) ignored are similar in effect to have nothing, and adding material at the end of a string makes it sort later. It might be adding words to this effect. Well-formedness condition 5 should also be explained, for the restriction to contractions ending in non-starters is peculiar (and greatly weakens the benefit of the condition). I suggest adding: "Step S2.1 may be implemented by considering both a definite initial substring S1 which has a match and a longer initial substring S2 which is the initial substring of a string with a match. Characters are added to S2 while possible, with S1 becoming S2 whenever it has a match. In the logic of steps S2.1.1 to S2.1.3, only substrings that have matches are considered. When such an implementation is used, well-formedness condition 5 allows a program to move from the logic of Step S2.1 on encountering a non-starter rather than waiting until encountering a character which cannot be added to S2." Sections 5 (intro) and 6.5.1 We seem need three levels of normalisation support if we are to have proper separation of the collation capability: off: Proper behaviour requires NFD input. FCD: Proper behaviour requires FCD input. (An implementation of collation will have to examine the collation element tables to determine what partial normalisation it needs to do to satisfy the requirement.) on: Proper behaviour whatever the input. Depending on UTS#35, implmentations claiming compliant parametric normalisation tailoring are prohibiting from offering such a split via the normalisation parameter!
Date/Time: Thu Jul 19 14:43:18 CDT 2012
Contact: richard.wordinghan@ntlworld.com
Name: Richard Wordingham
Report Type: Public Review Issue
Opt Subject: PRI #223: Proposed Update to UTS#35 LDML
Section 5.14.3 Numeric: It is good to have a definition of the location of the primary weights of the decimal digit sequences. Section 5.14.3 Alternate: The new text says that "shifted" and "ignoresp" are synonymous. Is that intended? I presume the difference between UCA "shift" and "ignoresp" is to be handled by the variableTop setting. Or is “ignoresp” meant to handle discontiguous ranges of variable weights? Discontiguity arises from the ordering of the punctuation character U+ 10A7F OLD SOUTH ARABIAN NUMERIC INDICATOR, which in the “ducet” collation is ordered between numbers which are not 'decimal digits'. Section 5.14.13 Case Level: At present, cases for characters are currently derived from the tertiary weights using the information in UTS#10 Section 7.3. Weights are treated as upper case if recorded as upper case or normal or narrow kana. It is only in the case of contractions created by tailorings that derivation rules are missing. Thus an application can currently support the case tailorings (though not 'rules') on the basis of UnicodeData.txt and one of allkeys.txt, allkeys_CLDR.txt and FractionalUCA.txt. Fuller support of Unicode rules will now be required for the implmentation of case tailorings. If the procedure is only intended to apply to contractions created in the 'rules' (by <p>, <s>, <t>,<q> and their derivatives), then the process is clear enough, but such a restriction should be stated. In this case, it should also be stated whether it applies to <i>, or whether <i> preserves case modifications. If the procedure is to be applied more widely, then presumably it applies to all mappings, including contractions and formal expansions. Does it apply to expansions in tailorings for the expansion part, or do the character added in the expansion retain their original case mapping properties? For example, would &c <<< k/H result in 'k' having two mixed case collation elements or a lower case and an upper case collation element? Would &h <<< C | hh result in Chh having two mixed case collation elements or an upper and a lower case case element? The list of upper exceptions should be given in terms of code points just as the list of lower exceptions is. At present, U+00D8 LATIN CAPITAL LETTER O WITH STROKE is collated, to the first three levels, identically to the sequence <U+004F LATIN CAPITAL LETTER O U+0338 COMBINING LONG SOLIDUS OVERLAY>. If the change applies to formal expansions, they will no longer be collated identically when case ordering is enabled or a case layer is inserted, for both collation elements of U+00D8 will be upper case but <U+004F, U+0338> will have one upper and one lower case collation element. This would appear to be unintended. It may be possible to fix this problem by changing the derived collation elements for secondary elements from 0.s.ct and 0.s.c.t to 0.s.1t and 0.s.1.t, but this seems very ad hoc. If the procedure is to be applied to mappings already in the UCA collation tables, it will change the casing of circled and squared katakana, such as U+32D0 CIRCLED KATAKANA A and U+1F213 SQUARED KATAKANA DE, which are currently treated as lower case. After the change, they will be treated as upper case. Is this change intended? The weights given for tertiary elements produce an ill-formed collation element table. Note that the normal DUCET tertiary weights cannot be applied to tertiary elements, for so doing would produce an ill-formed collation element table. (DUCET has no tertiary elements, while the CLDR root locale collation has exactly one if one believes allkeys_CLDR.txt is wrong.) The modification to 0.0.ct needs to be changed to a modification to 0.0.(c+3)t – the third weight must be greater than cu for any u that is the third weight of a primary or a secondary element. The modification to 0.0.c.t needs to be replaced for the same reason. To motivate the replacement, I considered a tailoring &\u0000 <<< ch | h and the strings 'chan', 'chhaN', 'chhan' and 'chaN'. (The motivation for the example is that Indic CHA is sometimes transliterated as 'chh'.) If we insert a case level and allow the 0.0.c.t weight, we get the ordering 'chan', 'chhan', 'chhaN', 'chaN', typical of an ill-conditioned table. If we replace 0.0.c.t by 0.0.(c+3).t, we get 'chan', 'chaN', 'chhan', 'chhaN'. If we replace 0.0.c.t by 0.0.0.t, we get 'chan', 'chhan', 'chaN', 'chhaN'. Another tailoring with a similar effect would be &h <<< c | hh, we would also get 'chan', 'chhan', 'chaN', 'chhaN'. I therefore recommend replacing 0.0.c.t by 0.0.0.t.
Date/Time: Wed Jul 25 14:48:29 CDT 2012
Contact: markus.icu@gmail.com
Name: Markus Scherer
Report Type: Public Review Issue
Opt Subject: PRI #223, UCA 6.2: diffs DUCET-CLDR
At the end of UTS #10 (UCA) section 3.6 DUCET there is some text describing how the CLDR root collation differs from the DUCET. It would be cleaner to move that text elsewhere, probably into CollationAuxiliary.html. The current draft has this text: "Note also that [CLDR] tailors general symbols be classified with the regular "groups, not the variable groups, using the IgnoreSP option. CLDR also adds "tailorings of two special values: The code point U+FFFF is tailored to have a primary weight higher than all other characters. This allows the reliable specification of a range, such as “Sch” ≤ X ≤ “Sch\uFFFF” to include all strings starting with "sch" or equivalent. The code point U+FFFE produces a CE with special minimal weights on all levels, regardless of alternate handling. This allows for Merging Sort Keys within code point space. For example, when sorting names in a database, a sortable string can be formed with last_name + '\uFFFE' + first_name. These strings would sort properly, without ever comparing the last part of a last name with the first part of another first name. So as to maintain the highest and lowest status, in CLDR these values are not further tailorable, and nothing can be tailored to have the same primary weights." We should keep the last line of section 3.6 where it is: "For most languages, some degree of tailoring is required to match user "expectations. For more information, see Section 5, Tailoring.
Date/Time: Fri Jul 27 16:39:06 CDT 2012
Contact: kenw@sybase.com
Name: Ken Whistler
Report Type: Public Review Issue
Opt Subject: PRI #223 CollationTest.html format issues
The script generating the UCA CollationTest data file has apparently got bugs in it. Item #1: Look, for example, at entries like: 1D1BB 0334 1D16F That is then generating the UTF8 for the display characters wrong, ending up with a '\uD834' entry, and also ends up with the wrong character name, a code point label: <surrogate-D834> I'm guessing there is some bad interaction between whatever the script may be doing to eliminate the repetitive listing of second elements that get repeated a zillion times, like the question marks and exclamation points, and what is happening for the entries which aren't just the repetitive cp 003F, cp 0334, cp 0021 type. Item #2: Also, for the SHIFTED files, there aren't any quaternary weights in the sort key representation, which seems incorrect to me. Item #3: CollationTest.html also doesn't show that the code point (or code point sequence, if not abbreviated) is listed in parentheses between the "#" and the character name(s), or what the conventions are for abbreviation of sequences. (This is not a bug in the script for generating the CollationTest files, but is related to the format issues.)
Date/Time: Tue Aug 21 14:15:53 CDT 2012
Contact: cdutro@twitter.com
Name: Cameron Dutro
Report Type: Other Question, Problem, or Feedback
Opt Subject: Clarifying French Backwards Accent Sorting in TR-10
The TR-10 document is written as though French backwards accent sorting applies to all French dialects, when in reality it only applies to Canadian French. Can the document be updated to mention this fact? Relevant tickets: http://unicode.org/cldr/trac/ticket/2905 and http://unicode.org/cldr/trac/ticket/2984. Thanks!