This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.
Date/Time: Sun Aug 23 18:12:18 CDT 2015
Name: Richard Wordingham
Report Type: Error Report
Opt Subject: Word Boundary Property
In UAX#29, the paragraph before Section 4.1 refers to a 'word boundary property'. Either this should be 'word break property', or the term should be defined.
Date/Time: Sat Oct 24 22:24:23 CDT 2015
Name: Karl Williamson
Report Type: Error Report
Opt Subject: UAX 29 issues
It makes no sense for \b{} in a regular expression pattern to match an empty string, as there is nothing to break at. But the first rule for all 4 boundary types is that there is a break at SOT and at EOT. I assert that that should be clarified so there is no break if there is no text. An empty string is a common thing to match against a pattern. I wrote an email to the unicode.org mailing list 2 months ago, asking for guidance from Unicode about their thinking about the rules of \b{wb}. It is <55D8D6AE.8080008@khwilliamson.com>. The single response was from Richard Wordingham, who I find usually astute in his answers, does not speak even informally for the Consortium. I posted it there so that an answer would be public so that someone in the future might see it rather than bother you again with the same question. But since there was no answer, I'll repeat the question here: The concept of \b in a regular expression meaning to match the boundary between a word and non-word was invented by Larry Wall, for the Perl programming language. This was before Unicode, and a word was defined as alphanumerics plus the underscore, which fit well with how identifiers in that computer language (and many others) were defined. Essentially \b is defined to break between runs of word characters versus runs of non-word characters. The latest version of Perl 5 (recently released) has added \b{w} based on Unicode's definition. The typical expectation of its programmers is that it would be a drop-in replacement for the old \b, with much better results in parsing natural languages. But it isn't such a replacement, creating some consternation, and the main reason is that, unlike \b, it treats the boundary between white space characters as a breaking opportunity, so that it doesn't create runs of them. Thus if you have two spaces after a full stop, it treats each as an individual word. My question is "Was this intentional, and if so, Why?" TR18 says \b{w} is a"Zero-width match at a Unicode word boundary. Note that this is different than \b alone, which corresponds to \w and \W." And UAX29 says "adjacent spaces are collapsed to a single space" in intelligent cut and paste using the WB property.
Date/Time: Wed Oct 28 10:40:12 CDT 2015
Name: Likasoft
Report Type: Error Report
Opt Subject: U+19B0 and TR29
Hello, In TR29 report in revision 27 you wrote that you removed U+19B0 ~ U+19C9 from exception list. This mean that it must appear in GraphemeBreakProperty.txt as SpacingMark. But no. Could you please say what file is more correct? Bye.
Feedback above this line was reviewed at the November 2015 UTC meeting.
Date/Time: Mon Apr 4 22:33:32 CDT 2016
Name: Karl Williamson
Report Type: Public Review Issue
Opt Subject: Changes to UAX 29
Here are your review issues, and my responses to each I have an opinion on: GB10 was renamed to GB999, since it is always the final rule. I agree with the renaming Rule GB8c could be dropped, since it is a by-product of GB999. I favor dropping it. It would be cleaner to move the RI rules to be GB12+, to group the emoji rules together. I favor doing this WB13c was replaced by WB15..WB17 Fine with me WB14 was renamed WB999, since it will always be the final rule. I agree with the renaming WB17 could be dropped, since it is a by-product of WB999. I favor dropping it. If we moved WB14..WB16 above WB4, it would disallow certain degenerate cases, such as a combining mark between two RI characters. I think it should be moved, but I have not studied it for any deleterious effects. Should we add Any × (Format | Extend) as an explicit rule? I'm uncertain about doing this, but if this is not done, it should be mentioned in the rule description, for both WB and SB. Should we disallow breaks between sequences of whitespace? I favor doing this, depending on how it would impact other languages besides Perl 5. If you don't do it, you should mention the possibility of tailoring to get this effect, which Perl 5 has already done. Mark Davis asked me to submit Perl's tailored rules. We have created a new property value, Horizontal Space, which includes all characters of that ilk. For Perl 5, I tried to make our tailoring as minimally disruptive as possible, so only one rule would be affected: WB3 Do not break between CR LF nor between Horizontal Space, nor between Horizontal Space and sequences of Extend and/or Format This is a higher priority rule than WB4, about ignoring Extend and Format. If this change were to be done generally, my concern about making a minimal tailoring would not be a consideration, and the rules could be rewritten so as to mention Extend and Format only in WB4. However, moving WB14..wB16 above WB4 might effect this.
Date/Time: Mon May 2 00:06:49 CDT 2016
Name: Peter Edberg
Report Type: Public Review Issue
Opt Subject: PRI #306 feedback: Add Extend* in new rule GB10
In proposed update UAX #29 9.0.0 draft 7 (2016-04-19), I believe that rule GB10 should be changed as follows (adding "Extend*", or at least "Extend?"): GB10 (E_Base | EBG) Extend* × E_Modifier This does not seem to be implied by the preceding rule GB9 GB9 × (Extend | ZWJ) which is excluded by the larger context in rule GB10. The reason for this change is to treat as a grapheme an emoji modifier sequence as described by definition ED-13 in the current version of UTR #51 (2.0, 2015-11-12): (emoji_modifier_base | emoji_base_variation_sequence) emoji_modifier While the proposed update to UTR #51 (3.0 draft 4) does suggest removing emoji_base_variation_sequence from the above definition, there is existing data that uses such sequences (generated by systems that were new as of about a year ago and treat them as single graphemes) , for example: U+270C U+FE0F U+1F3FB
Date/Time: Sat May 7 12:00:52 CDT 2016
Name: Andrew West
Report Type: Public Review Issue
Opt Subject: Incorrect definition of Sentence Break = Sp in TR29 draft for 9.0
In SentenceBreakProperty-9.0.0d23.txt U+202F NARROW NO-BREAK SPACE is not listed under Sp (this is a recent change related to Mongolian). However, the latest draft of TR29 (http://www.unicode.org/reports/tr29/tr29-28.html) specifies that Sentence_Break property Sp = White_Space = Yes and Sentence_Break ≠ Sep and Sentence_Break ≠ CR and Sentence_Break ≠ LF. According to this definition Sp should include 202F as it has the White_Space property and is not Sep, CR or LF. I think that the definition of Sentence Break = Sp in TR29 needs to be updated to reflect the exclusion of 202F.