Public Review Issues

Accumulated Feedback on PRI #306

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Sun Aug 23 18:12:18 CDT 2015
Name: Richard Wordingham
Report Type: Error Report
Opt Subject: Word Boundary Property

In UAX#29, the paragraph before Section 4.1 refers to a 'word boundary property'.  
Either this should be 'word break property', or the term should be defined.

Date/Time: Sat Oct 24 22:24:23 CDT 2015
Name: Karl Williamson
Report Type: Error Report
Opt Subject: UAX 29 issues

It makes no sense for \b{} in a regular expression pattern to match an empty
string, as there is nothing to break at.  But the first rule for all 4
boundary types is that there is a break at SOT and at EOT.  I assert that that
should be clarified so there is no break if there is no text.  An empty string
is a common thing to match against a pattern.

I wrote an email to the unicode.org mailing list 2 months ago, asking for
guidance from Unicode about their thinking about the rules of \b{wb}.  It is
<55D8D6AE.8080008@khwilliamson.com>.  The single response was from Richard
Wordingham, who I find usually astute in his answers, does not speak even
informally for the Consortium.  I posted it there so that an answer would be
public so that someone in the future might see it rather than bother you again
with the same question.  But since there was no answer, I'll repeat the
question here:

The concept of \b in a regular expression meaning to match the boundary
between a word and non-word was invented by Larry Wall, for the Perl
programming language. This was before Unicode, and a word was defined as
alphanumerics plus the underscore, which fit well with how identifiers in that
computer language (and many others) were defined. Essentially \b is defined to
break between runs of word characters versus runs of non-word characters.

The latest version of Perl 5 (recently released) has added \b{w} based on
Unicode's definition. The typical expectation of its programmers is that it
would be a drop-in replacement for the old \b, with much better results in
parsing natural languages.

But it isn't such a replacement, creating some consternation, and the main
reason is that, unlike \b, it treats the boundary between white space
characters as a breaking opportunity, so that it doesn't create runs of them.
Thus if you have two spaces after a full stop, it treats each as an individual
word.

My question is "Was this intentional, and if so, Why?"

TR18 says \b{w} is a"Zero-width match at a Unicode word boundary. Note that
this is different than \b alone, which corresponds to \w and \W."

And UAX29 says "adjacent spaces are collapsed to a single space" in
intelligent cut and paste using the WB property.

Date/Time: Wed Oct 28 10:40:12 CDT 2015
Name: Likasoft
Report Type: Error Report
Opt Subject: U+19B0 and TR29

Hello,

In TR29 report in revision 27 you wrote that you removed U+19B0 ~ U+19C9 from 
exception list. This mean that it must appear in GraphemeBreakProperty.txt 
as SpacingMark. But no.
Could you please say what file is more correct?

Bye.

Feedback above this line was reviewed at the November 2015 UTC meeting.

Date/Time: Mon Apr 4 22:33:32 CDT 2016
Name: Karl Williamson
Report Type: Public Review Issue
Opt Subject: Changes to UAX 29

Here are your review issues, and my responses to each I have an opinion on:

GB10 was renamed to GB999, since it is always the final rule.

I agree with the renaming

Rule GB8c could be dropped, since it is a by-product of GB999.

I favor dropping it.

It would be cleaner to move the RI rules to be GB12+, to group the emoji rules together.

I favor doing this

WB13c was replaced by WB15..WB17

Fine with me

WB14 was renamed WB999, since it will always be the final rule.

I agree with the renaming

WB17 could be dropped, since it is a by-product of WB999.

I favor dropping it.

If we moved WB14..WB16 above WB4, it would disallow certain degenerate cases,
such as a combining mark between two RI characters.

I think it should be moved, but I have not studied it for any deleterious effects.

Should we add Any × (Format | Extend) as an explicit rule?

I'm uncertain about doing this, but if this is not done, it should be mentioned
in the rule description, for both WB and SB.

Should we disallow breaks between sequences of whitespace?

I favor doing this, depending on how it would impact other languages besides Perl 5.
If you don't do it, you should mention the possibility of tailoring to get this effect,
which Perl 5 has already done. Mark Davis asked me to submit Perl's tailored rules.
We have created a new property value, Horizontal Space, which includes all characters
of that ilk. For Perl 5, I tried to make our tailoring as minimally disruptive as
possible, so only one rule would be affected:

WB3 Do not break between CR LF nor between Horizontal Space, nor between Horizontal
Space and sequences of Extend and/or Format

This is a higher priority rule than WB4, about ignoring Extend and Format.
If this change were to be done generally, my concern about making a
minimal tailoring would not be a consideration, and the rules could be
rewritten so as to mention Extend and Format only in WB4. However, moving
WB14..wB16 above WB4 might effect this.

Date/Time: Mon May 2 00:06:49 CDT 2016
Name: Peter Edberg
Report Type: Public Review Issue
Opt Subject: PRI #306 feedback: Add Extend* in new rule GB10

In proposed update UAX #29 9.0.0 draft 7 (2016-04-19), I believe that 
rule GB10 should be changed as follows (adding "Extend*", or at least "Extend?"):

GB10	(E_Base | EBG)  Extend*  ×  E_Modifier

This does not seem to be implied by the preceding rule GB9

GB9	×  (Extend | ZWJ)

which is excluded by the larger context in rule GB10.

The reason for this change is to treat as a grapheme an emoji modifier 
sequence as described by definition ED-13 in the current version of 
UTR #51 (2.0, 2015-11-12):

	(emoji_modifier_base | emoji_base_variation_sequence) emoji_modifier

While the proposed update to UTR #51 (3.0 draft 4) does suggest removing 
emoji_base_variation_sequence from the above definition, there is existing 
data that uses such sequences (generated by systems that were new as of 
about a year ago and treat them as single graphemes) , for example:

	U+270C U+FE0F U+1F3FB

Date/Time: Sat May 7 12:00:52 CDT 2016
Name: Andrew West
Report Type: Public Review Issue
Opt Subject: Incorrect definition of Sentence Break = Sp in TR29 draft for 9.0

In SentenceBreakProperty-9.0.0d23.txt U+202F NARROW NO-BREAK SPACE is not 
listed under Sp (this is a recent change related to Mongolian).

However, the latest draft of TR29 (http://www.unicode.org/reports/tr29/tr29-28.html) 
specifies that Sentence_Break property Sp = White_Space = Yes and Sentence_Break ≠ Sep 
and Sentence_Break ≠ CR and Sentence_Break ≠ LF.  According to this definition Sp 
should include 202F as it has the White_Space property and is not Sep, CR or LF.

I think that the definition of Sentence Break = Sp in TR29 needs to be updated to 
reflect the exclusion of 202F.