Public Review Issues

Accumulated Feedback on PRI #406

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Mon Oct 14 18:46:18 CDT 2019
Name: Peter Constable
Report Type: Public Review Issue
Opt Subject: PRI #406 (PU UAX #14)

At UTC 161, a proposed change in rule LB30 of PU UAX #14 was discussed, to
introduce reference to East-Asian width properties. The alternative change
would have been to create a new line-breaking property and to divide OP
characters into two classes. The main rationale for the proposed change is
that it requires only a small in UAX #14 while the alternative would have
required more churn.

It should be noted that the proposed change requires revision in at least
one other part of UAX #14 besides LB30 in section 6.1: In section 5, the
text under the "Data File" sub-heading states, "The full classification of
all Unicode characters by their line breaking properties is available in the
file LineBreak.txt [Data14] in the Unicode Character Database [UCD]." With
the proposed change, this statement would no longer be true.

I haven't reviewed all of PU UAX #14 to see if there any other parts of the
text that would be impacted.

Date/Time: Tue Nov 5 18:25:35 CST 2019
Name: Elika Etemad
Report Type: Error Report
Opt Subject: UAX14 Categorization of Javanese


Overview:

    UAX14 and Unicode Chapter 17.4 disagree on line-breaking in Javanese.
    
Details:
    
    Unicode Chapter 17.4 says that Javanese breaks between orthographic syllables,
    and defines a BNF pattern for these syllables.

    UAX14 says Javanese is treated as AL, which does not allow breaks between units.

    These requirements conflict.
    
Proposal:
    
    In UAX14, recategorize Javanese as SA, which is defined to determine breakpoint
    based on lexical analysis.

Links:
    https://www.unicode.org/versions/Unicode12.0.0/ch17.pdf
    http://unicode.org/reports/tr14/#AL “no line breaks are allowed between pairs”
    http://unicode.org/reports/tr14/#SA “require morphological analysis to determine break opportunities”

Date/Time: Wed Dec 4 15:18:53 CST 2019
Name: Markus W Scherer
Report Type: Error Report
Opt Subject: UAX #14 line break: confusing naming/behavior of lb=BA eg U+3000

Feedback on behalf of Javier Fernandez & Florian Rivoal, see https://unicode-org.atlassian.net/browse/ICU-20843 

The last comment there (from Florian) is:

Even though he BA class is described as as “(A)” “providing a break
opportunity” in section 5, if you follow the rules of section 6, you’re
right that the effect is to suppress breaks before, not to introduce them
after. So my comment above was wrong, and there should not be a break
between ID and U+3000, nor between CJ (treated as ID or as NS) and U+3000.
Which means that ICU does not have a bug after all. Sorry for the confusion.

I do wonder if an editorial bug should be open on UAX 14: the informative
text of section 5 in this case is a poor indicator of the normative behavior
of section 6, leading to misunderstanding like the mistake I made above.

Date/Time: Wed Jan 1 23:06:51 CST 2020
Name: Karl Williamson
Report Type: Public Review Issue
Opt Subject: 406 Proposed Update UAX #14, Unicode Line Breaking Algorithm

"Review Note: The above change uses combinations of properties, as is done
in UAX #29, rather than breaking out new classes OP2 and CP2, and then
changing all other instances of OP and CP2 to be (OP|OP2) and (CP|CP2),
respectively."

I'm not happy about this trend.  I wasn't (and still am not) happy about it
in UAX#29.

An implementation will have to split these classes anyway.  The work must
get done.  Why can't you ship a UCD with that work already done? instead of
having each implementer having to do it individually.  That's more cost to
society as a whole, and will delay the implementations somewhat, and there
could well be bugs in the divergent implementations.  And it feels to me
like it is shirking your responsibility of furnishing an adequate set of
data files.

I could understand the need for doing this with the pictograph property used
in UAX #29 this way, as the file wasn't included in the UCD.  That itself
was an inconvenience to your implementers, as I and others had to go digging
up the Emoji data file, and shoe-horn it into our database.  Now that it is
in the UCD, it would be good if you fixed your data files to not require
implementations to roll their own partitions.

In brief, my assertion is that your data files for all enum properties
should partition the entire Unicode range into equivalence classes, without
the need for implementations to have to write code to do it themselves.