This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.
Date/Time: Tue Oct 1 20:24:36 CDT 2019
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: PRI #404: [:toNFC=Å:]
Section 2.8 “Optional Properties” describes [:toNFC=Å:] as “The set of all strings X such that toNFC(X) = "a"”. Shouldn’t that be “... = "Å"”?
Date/Time: Sun Oct 6 13:33:23 CDT 2019
Name: Richard Wordingham
Report Type: Public Review Issue
Opt Subject: PRI #404
UTS #18 states, "All syntax and API presented in this document is only for the purpose of illustration; there is absolutely no requirement to follow such syntax or API." 1. Consequently, the following proposed text in Section 1.2 is illegitimate: "For best compatibility, expressions involving properties with set values should be interpreted as containment, not equality." I suggest a wording such as, "Mandatorily provided expressions involving set values shall provide tests for the sets containing, or not containing, values. An implementation might choose to use a notation used for equality with other properties to denote containment instead, and likewise for inequality. Such notation is used in examples here and in the LDML." 2. What is a literal cluster? Is it a string element of a Unicode set? 3. While the proposed text says, "The syntax for Character Ranges could be extended to allow or strings, but that is not required by this specification", it would appear that RL2.2 needs some such extension. In the definition of 'character ranges', it would be helpful to say that a character range may also include a *finite* set of strings. 4. Is there a list of string properties defined by the Unicode standards and data files? 5. If what appears to be a character property is a string property, then we need a way to restrict its scope to characters as opposed to strings. Having a suggestion would be good, and it is probably necessary for CLDR.
Feedback above this line was reviewed during UTC #161 in October, 2019.
Date/Time: Thu Oct 10 17:11:17 CDT 2019
Name: Richard Wordingham
Report Type: Error Report
Opt Subject: Wrong Section Reference from UTS#18 to UTS#10
There is a wrong section reference following the definition of RL3.2. Instead of "Section 6.9" in "See Section 6.9, Handling Collation Graphemes in UTS #10", it should be "Section 9.9".
Date/Time: Sat Oct 19 08:45:43 CDT 2019
Name: Richard Wordingham
Report Type: Public Review Issue
Opt Subject: PRI 404: RL3.2
Order ----- I do not believe that it is intended to restrict this concepts to collations used for ordering; collations used for searching make sense. The word 'order' needs to be removed or de-emphasised. For example, the visual order Tai scripts have contractions of preposed vowel plus consonant in DUCET, but these are removed for search collations. Canonical Equivalence --------------------- The mathematically-natural extension of strings of 8-bit characters is traces of Unicode strings under canonical equivalence. However RL3.2 is completely inappropriate for this extension, for collation grapheme clusters are not closed under canonical equivalence. (Example: The common-enough misspelling รูู้ <U+0E23, U+0E39, U+0E39, U+0E49> of รู้. Under DUCET, the normalised form consists of three collation grapheme clusters, but one of the canonical equivalents consists of two collation grapheme clusters.) Possible Solution ----------------- It might be possible to make the removal of RL3.2 for Level 3 conformance dependent on compliance with a re-instated RL2.1 with teeth. For example, I would expect "óớ" to match "(ó)+(\u031b)+".
Date/Time: Sat Oct 19 19:31:51 CDT 2019
Name: Richard Wordingham
Report Type: Public Review Issue
Opt Subject: PRI 404 - Update to UTS #18
For Section 2.8, \p{isNFD} is useful as a *character* property in contexts such as [\p{L}&\p{isNFD}]\p{Mn}*, where one is handling all letters, regardless of whether they're explicitly encoded in Unicode.
Date/Time: Sun Jan 5 23:41:36 CST 2020
Name: Karl Williamson
Report Type: Public Review Issue
Opt Subject: PRI 404 UTS #18
I am using a single form to submit all my comments about this PRI. I hope that's most convenient for you. I am used to seeing the document organized by level. And retaining that is fine with me. Maybe newcomers would be better off the other way; I don't know Character class is a better term, so I agree with that change I'm fine with removing level 3 conformance. I think the additions to 1.2 are good. Note that string matching has long been an issue under case-insensitive matching, when a string may be case folded to by a single code point. The most common example is the LATIN SHARP S, both upper and lower, which matches strings like 'SS', 'Ss', \x{17f}\x{17f} caselessly. Regarding \m vs \p, I support your option 2, to use \m in this document for the most clarity. Perl does support symmetric difference, and will continue to support it even if you remove it. It corresponds to exclusive or. I do not know how much actual use it has gotten. An argument for keeping it is that it creates a complete set of operations on sets that correspond to non-set operators. FYI, I originally implemented in Perl union, etc., all at the same precedence level, but user feedback that this was confusing given that the language otherwise emulates C precedence, forced me to change it. I would expect that the precedence chosen should mirror that of the containing language. You say "it far is better to write \u{1F44E) rather than \uD83D\uDC4E (using UTF-16) or \xF0\x9F\x91\x8E (using UTF-8)." I completely agree with that, but I think it should be phrased "it is far better ..." Section 2.7: I would like to see an example of a useful regex that contains the newly required Equivalent_Unified_Ideograph property. Though unchanged in this release, the earlier example "Characters with names starting with "LATIN LETTER" and ending with "P":" seems to me like no one would ever want to use this except some nerds out drinking, looking for Unicode trivia. Names are pretty arbitrary and capricious, and I don't understand what the motivation for this query would be. I could see some use for the one about names containing "variation", etc, as those names are less capricious (I hope anyway). I have some problems with Identifier_Status and _Type. I'm fine with including them. I do wish that any required property files would be part of the UCD, instead of me having to go fish for them each new release. I also don't think these are ready for prime time. People writing regex patterns using them will really want to have abbreviated names for them and their property values, as these are quite long. If you actually put them into the UCD, I suspect you would think more about good abbreviations, and add those. And contrary to what it's said in TR39, these files aren't completely in the UCD format. They lack an @missing line for example, and the heading comments are in a different format, which would make me modify my parser to handle them. And it just seems sloppy for you to not bother to make things consistent, forcing extra work on those who would implement your standard. The new file RegexPropertyAliases.txt will help, but it has a completely different format that will force me to write code to parse it.
Date/Time: Mon Jan 6 12:48:09 CST 2020
Name: Nozomu Katō
Report Type: Public Review Issue
Opt Subject: PRI 404 UTS#18
About 1.2 Properties: I do not think that it is a good idea to use the \p notation for expressing specific sequences of code points, i.e., properties of strings, in addition to a single code point. I have the impression that it is not kind to users that some \p{...} can be used in a character class while some \p{...} cannot. What are called "properties of strings" in the proposed text look like aggregate versions of Named Character Sequences. Whichever option TC39 chooses, I would not like the UTC to support or promote using of \p for both properties of characters and strings.
Feedback above this line was reviewed during UTC #162 in January, 2020.