This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.
Date/Time: Mon Oct 14 18:46:18 CDT 2019
Name: Peter Constable
Report Type: Public Review Issue
Opt Subject: PRI #406 (PU UAX #14)
At UTC 161, a proposed change in rule LB30 of PU UAX #14 was discussed, to introduce reference to East-Asian width properties. The alternative change would have been to create a new line-breaking property and to divide OP characters into two classes. The main rationale for the proposed change is that it requires only a small in UAX #14 while the alternative would have required more churn. It should be noted that the proposed change requires revision in at least one other part of UAX #14 besides LB30 in section 6.1: In section 5, the text under the "Data File" sub-heading states, "The full classification of all Unicode characters by their line breaking properties is available in the file LineBreak.txt [Data14] in the Unicode Character Database [UCD]." With the proposed change, this statement would no longer be true. I haven't reviewed all of PU UAX #14 to see if there any other parts of the text that would be impacted.
Date/Time: Tue Nov 5 18:25:35 CST 2019
Name: Elika Etemad
Report Type: Error Report
Opt Subject: UAX14 Categorization of Javanese
Overview: UAX14 and Unicode Chapter 17.4 disagree on line-breaking in Javanese. Details: Unicode Chapter 17.4 says that Javanese breaks between orthographic syllables, and defines a BNF pattern for these syllables. UAX14 says Javanese is treated as AL, which does not allow breaks between units. These requirements conflict. Proposal: In UAX14, recategorize Javanese as SA, which is defined to determine breakpoint based on lexical analysis. Links: https://www.unicode.org/versions/Unicode12.0.0/ch17.pdf http://unicode.org/reports/tr14/#AL “no line breaks are allowed between pairs” http://unicode.org/reports/tr14/#SA “require morphological analysis to determine break opportunities”
Date/Time: Wed Dec 4 15:18:53 CST 2019
Name: Markus W Scherer
Report Type: Error Report
Opt Subject: UAX #14 line break: confusing naming/behavior of lb=BA eg U+3000
Feedback on behalf of Javier Fernandez & Florian Rivoal, see https://unicode-org.atlassian.net/browse/ICU-20843 The last comment there (from Florian) is: Even though he BA class is described as as “(A)” “providing a break opportunity” in section 5, if you follow the rules of section 6, you’re right that the effect is to suppress breaks before, not to introduce them after. So my comment above was wrong, and there should not be a break between ID and U+3000, nor between CJ (treated as ID or as NS) and U+3000. Which means that ICU does not have a bug after all. Sorry for the confusion. I do wonder if an editorial bug should be open on UAX 14: the informative text of section 5 in this case is a poor indicator of the normative behavior of section 6, leading to misunderstanding like the mistake I made above.
Date/Time: Wed Jan 1 23:06:51 CST 2020
Name: Karl Williamson
Report Type: Public Review Issue
Opt Subject: 406 Proposed Update UAX #14, Unicode Line Breaking Algorithm
"Review Note: The above change uses combinations of properties, as is done in UAX #29, rather than breaking out new classes OP2 and CP2, and then changing all other instances of OP and CP2 to be (OP|OP2) and (CP|CP2), respectively." I'm not happy about this trend. I wasn't (and still am not) happy about it in UAX#29. An implementation will have to split these classes anyway. The work must get done. Why can't you ship a UCD with that work already done? instead of having each implementer having to do it individually. That's more cost to society as a whole, and will delay the implementations somewhat, and there could well be bugs in the divergent implementations. And it feels to me like it is shirking your responsibility of furnishing an adequate set of data files. I could understand the need for doing this with the pictograph property used in UAX #29 this way, as the file wasn't included in the UCD. That itself was an inconvenience to your implementers, as I and others had to go digging up the Emoji data file, and shoe-horn it into our database. Now that it is in the UCD, it would be good if you fixed your data files to not require implementations to roll their own partitions. In brief, my assertion is that your data files for all enum properties should partition the entire Unicode range into equivalence classes, without the need for implementations to have to write code to do it themselves.