Accumulated Feedback on PRI #404

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Tue Oct 1 20:24:36 CDT 2019
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: PRI #404: [:toNFC=Å:]

Section 2.8 “Optional Properties” describes [:toNFC=Å:] as 
“The set of all strings X such that toNFC(X) = "a"”. 
Shouldn’t that be “... = "Å"”?

Date/Time: Sun Oct 6 13:33:23 CDT 2019
Name: Richard Wordingham
Report Type: Public Review Issue
Opt Subject: PRI #404

UTS #18 states, "All syntax and API presented in this document is only for
the purpose of illustration; there is absolutely no requirement to follow
such syntax or API."

1. Consequently, the following proposed text in Section 1.2 is illegitimate:
"For best compatibility, expressions involving properties with set values
should be interpreted as containment, not equality."  I suggest a wording
such as, "Mandatorily provided expressions involving set values shall
provide tests for the sets containing, or not containing, values.  An
implementation might choose to use a notation used for equality with other
properties to denote containment instead, and likewise for inequality.  Such
notation is used in examples here and in the LDML."   

2. What is a literal cluster?  Is it a string element of a Unicode set?

3. While the proposed text says, "The syntax for Character Ranges could be
extended to allow or strings, but that is not required by this
specification", it would appear that RL2.2 needs some such extension.  In
the definition of 'character ranges', it would be helpful to say that a
character range may also include a *finite* set of strings.

4. Is there a list of string properties defined by the Unicode standards and
data files?

5. If what appears to be a character property is a string property, then we
need a way to restrict its scope to characters as opposed to strings. 
Having a suggestion would be good, and it is probably necessary for CLDR.

Feedback above this line was reviewed during UTC #161 in October, 2019.

Date/Time: Thu Oct 10 17:11:17 CDT 2019
Name: Richard Wordingham
Report Type: Error Report
Opt Subject: Wrong Section Reference from UTS#18 to UTS#10

There is a wrong section reference following the definition of RL3.2.  Instead 
of "Section 6.9" in "See Section 6.9, Handling Collation Graphemes in UTS #10", 
it should be "Section 9.9".

Date/Time: Sat Oct 19 08:45:43 CDT 2019
Name: Richard Wordingham
Report Type: Public Review Issue
Opt Subject: PRI 404: RL3.2

Order
-----

I do not believe that it is intended to restrict this concepts to collations
used for ordering; collations used for searching make sense.  The word
'order' needs to be removed or de-emphasised.  For example, the visual order
Tai scripts have contractions of preposed vowel plus consonant in DUCET, but
these are removed for search collations.

Canonical Equivalence
---------------------
The mathematically-natural extension of strings of 8-bit characters is
traces of Unicode strings under canonical equivalence.  However RL3.2 is
completely inappropriate for this extension, for collation grapheme clusters
are not closed under canonical equivalence.  (Example: The common-enough
misspelling รูู้ <U+0E23, U+0E39, U+0E39, U+0E49> of รู้.  Under
DUCET, the normalised form consists of three collation grapheme clusters,
but one of the canonical equivalents consists of two collation grapheme
clusters.)

Possible Solution
-----------------
It might be possible to make the removal of RL3.2 for Level 3 conformance
dependent on compliance with a re-instated RL2.1 with teeth.  For example, I
would expect "óớ" to match "(ó)+(\u031b)+".

Date/Time: Sat Oct 19 19:31:51 CDT 2019
Name: Richard Wordingham
Report Type: Public Review Issue
Opt Subject: PRI 404 - Update to UTS #18

For Section 2.8, \p{isNFD} is useful as a *character* property in contexts 
such as [\p{L}&\p{isNFD}]\p{Mn}*, where one is handling all letters, 
regardless of whether they're explicitly encoded in Unicode.

Date/Time: Sun Jan 5 23:41:36 CST 2020
Name: Karl Williamson
Report Type: Public Review Issue
Opt Subject: PRI 404 UTS #18


I am using a single form to submit all my comments about this PRI.  I hope
that's most convenient for you.

I am used to seeing the document organized by level.  And retaining that is
fine with me.  Maybe newcomers would be better off the other way; I don't
know

Character class is a better term, so I agree with that change

I'm fine with removing level 3 conformance.

I think the additions to 1.2 are good.  Note that string matching has long
been an issue under case-insensitive matching, when a string may be case
folded to by a single code point.  The most common example is the LATIN
SHARP S, both upper and lower, which matches strings like 'SS', 'Ss',
\x{17f}\x{17f} caselessly.

Regarding \m vs \p, I support your option 2, to use \m in this document for
the most clarity.

Perl does support symmetric difference, and will continue to support it even
if you remove it.   It corresponds to exclusive or.  I do not know how much
actual use it has gotten.  An argument for keeping it is that it creates a
complete set of operations on sets that correspond to non-set operators.

FYI, I originally implemented in Perl union, etc., all at the same
precedence level, but user feedback that this was confusing given that the
language otherwise emulates C precedence, forced me to change it.  I would
expect that the precedence chosen should mirror that of the containing
language.

You say "it far is better to write \u{1F44E) rather than \uD83D\uDC4E (using
UTF-16) or \xF0\x9F\x91\x8E (using UTF-8)." I completely agree with that,
but I think it should be phrased "it is far better ..."

Section 2.7:

I would like to see an example of a useful regex that contains the newly
required Equivalent_Unified_Ideograph property. 

Though unchanged in this release, the earlier example "Characters with names
starting with "LATIN LETTER" and ending with "P":" seems to me like no one
would ever want to use this except some nerds out drinking, looking for
Unicode trivia.  Names are pretty arbitrary and capricious, and I don't
understand what the motivation for this query would be.  I could see some
use for the one about names containing "variation", etc, as those names are
less capricious (I hope anyway).

I have some problems with Identifier_Status and _Type.  I'm fine with
including them.  I do wish that any required property files would be part of
the UCD, instead of me having to go fish for them each new release.  I also
don't think these are ready for prime time.  People writing regex patterns
using them will really want to have abbreviated names for them and their
property values, as these are quite long.  If you actually put them into the
UCD, I suspect you would think more about good abbreviations, and add those.
 And contrary to what it's said in TR39, these files aren't completely in
the UCD format.  They lack an @missing line for example, and the heading
comments are in a different format, which would make me modify my parser to
handle them.  And it just seems sloppy for you to not bother to make things
consistent, forcing extra work on those who would implement your standard. 
The new file RegexPropertyAliases.txt will help, but it has a completely
different format that will force me to write code to parse it.


Date/Time: Mon Jan 6 12:48:09 CST 2020
Name: Nozomu Katō
Report Type: Public Review Issue
Opt Subject: PRI 404 UTS#18

About 1.2 Properties:

I do not think that it is a good idea to use the \p notation for expressing
specific sequences of code points, i.e., properties of strings, in addition
to a single code point. I have the impression that it is not kind to users
that some \p{...} can be used in a character class while some \p{...}
cannot.

What are called "properties of strings" in the proposed text look like
aggregate versions of Named Character Sequences.

Whichever option TC39 chooses, I would not like the UTC to support or
promote using of \p for both properties of characters and strings.

Feedback above this line was reviewed during UTC #162 in January, 2020.