This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.
Date/Time: Fri Mar 11 04:16:11 CST 2016
Name: Sascha Brawer
Report Type: Error Report
Opt Subject: Missing reference in TR14/TR41
Section 3.1 of TR14 has a broken link. The final paragraph of http://www.unicode.org/reports/tr14/#BreakOpportunities says: “In bidirectional text, line breaks are determined before applying rule L1 of the Unicode Bidirectional Algorithm [Bidi].” The [Bidi] link points to http://www.unicode.org/reports/tr41/tr41-17.html#Bidi but there is no #Bidi anchor in TR41.
Date/Time: Wed Apr 20 15:19:33 CDT 2016
Name: Andy Heninger
Report Type: Error Report
Opt Subject: UAX 14 feedback, PRI #322
The UAX-14 line breaking of numbers beginning with a decimal point can be bad. Consider the string "start .789 end". With the default rules there will only be one break, "start .789 |end". Rule LB13, "x IS" will prevent a break before the number. With the tailoring of numbers from example 7 of section 8.2 there will be an unexpected break after the full stop, yielding "start .|789 |end", because the regular expression for numbers does not allow a character of class IS to precede the first digit. How this might be fixed will require some thought This problem was originally reported by Bernhard Fey in an ICU bug report, http://bugs.icu-project.org/trac/ticket/12017
Date/Time: Tue Apr 26 15:03:57 CDT 2016
Name: Andy Heninger
Report Type: Public Review Issue
Opt Subject: UAX 14 feedback
Line Break rule LB1 says that, in the absence of other criteria, unknown characters (class XX) should be treated as alphabetic (class AL). There is no break opportunity between alphabetic characters. Emojis are having problems with this. Adoption of new emoji characters tends to occur extremely quickly, leaving un-updated implementations of line-break seeing them as unknown. Treating unknown characters as class ID might give better results. Or maybe something could be done based on blocks, treating unassigned characters from blocks for alphabetic scripts as AL and others as ID.
Date/Time: Fri Apr 29 16:06:51 CDT 2016
Name: Marcin Grzegorczyk
Report Type: Public Review Issue
Opt Subject: UAX 14 feedback (PRI #322)
The new rule LB30a (as of rev. 36 draft 1) cannot be implemented with a pair table without extra processing. If the rule is to retain its extended context, then the implementation presented in chapter 7 – not just the pair table, but the sample code, too – will have to be updated to account for it. One possible implementation of LB30a is to introduce an additional, artificial line breaking class – let’s call it RI2 – and change RI into RI2 in the main loop if the previous class (taking LB9 into account) was RI (RIs previously mapped to RI2 would not count). Then LB30a can be rewritten as (RI × RI2) and (RI2 ÷ RI), which can be implemented directly in the pair table. This is part of a broader issue with regional indicators. Because there is only a single set of regional indicator symbol letters (as opposed to separate sets of leading and trailing letters), if some process accidentally breaks a string of RIs on an odd boundary (e.g. due to a limited buffer size) the entire part of that string following the break is corrupted. (This is similar to the weakness inherent in several multi-byte character sets such as EUC-CN.) The new rule LB30a provides a partial mitigation of the problem (direct break opportunities in strings of RIs significantly reduce the number of cases where an emergency line break is required), but the fundamental issue remains. However, I guess it is too late now to add a second set of trailing RI symbol letters, although it might be a good idea to re-include (perhaps in the main Standard text) the recommendation to insert ZWSP (or WJ if a break is undesirable) between pairs of RIs, since it would limit the potential damage to a single RI pair.