This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.
Date/Time: Tue Apr 5 07:14:53 CDT 2022
Name: Charlotte Buff
Report Type: Public Review Issue
Opt Subject: 446
In UAX #14, the descriptions of the line breaking classes Postfix_Numeric (PO) and Prefix_Numeric (PR) don’t match the actual behaviour of the line breaking algorithm when it comes to the treatment of intervening spaces. The description of PO states: »Characters that usually follow a numerical expression may not be separated from preceding numeric characters or preceding closing characters, even if one or more space characters intervene. For example, there is no break opportunity in “(12.00) %”.« And similarly, the description of PR states: »Characters that usually precede a numerical expression may not be separated from following numeric characters or following opening characters, even if a space character intervenes. For example, there is no break opportunity in “$ (100.00)”.« However, the actual line breaking rules that govern these classes (LB23a, LB24, LB25, LB27) don’t actually contain a special provision for intervening spaces. As a result, the strings given as examples *do* in fact contain line breaking opportunities simply due to rule LB18 (Break after spaces) – before the percent sign in the former and before the opening parenthesis in the latter. This can be confirmed via the use of Unicode’s online utility tool for breaks and segmentation (https://util.unicode.org/UnicodeJsps/breaks.jsp).
Date/Time: Sun Apr 10 20:12:11 CDT 2022
Name: Norbert Lindenberg
Report Type: Error Report
Opt Subject: UAX 14
Section 3.1 of UAX 14 has the following description of the South East Asian style of line breaking: “The third style is used for scripts such as Thai, which do not use spaces, but which restrict word breaks to syllable boundaries, whose determination requires knowledge of the language comparable to that required by a hyphenation algorithm. Such an algorithm is beyond the scope of the Unicode Standard.” This description is odd in not starting out with line breaking, but with word breaks, whose relevance to line breaking is not explained. The problem statement I usually hear is that Thai, Lao, Khmer, and Myanmar allow line breaks only at word boundaries, but do not mark word boundaries in any way, so that they have to be determined by higher-level algorithms, typically based on dictionaries. See, for example, the W3C layout requirements: https://www.w3.org/International/sealreq/thai/#h_line_breaking https://www.w3.org/International/sealreq/lao/#h_line_breaking https://www.w3.org/International/sealreq/khmer/#h_line_breaking The comparison with hyphenation algorithms is also questionable, as the complexity of hyphenation algorithms can vary substantially between languages. Finally, Thai does use spaces to separate phrases. I propose replacing the text quoted above with "The third style is used for scripts such as Thai, which allow line breaks only at word boundaries, but do not mark word boundaries in any way, so that the determination of line break opportunities requires language dependent text analysis. Algorithms and data for such analysis are beyond the scope of the Unicode Standard."
Date/Time: Fri Jun 3 10:22:13 CDT 2022
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: 446
UAX #14 says that U+23B6 BOTTOM SQUARE BRACKET OVER TOP SQUARE BRACKET is a member of class QU, but that has not been true for many years.
Date/Time: Fri Jun 3 09:10:34 CDT 2022
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: 446
The SY class is motivated by the commonness of URLs. Hebrew letters can appear in URLs. What is the rationale for LB21b? Why is Hebrew special among all scripts that can appear in URLs? Documenting the reason would help implementers decide how to tailor the algorithm. Maybe the reasoning is that, although Hebrew can appear in URLs, most URLs are still ASCII, so a slash in Hebrew is probably not a URL slash and so isn’t a break opportunity. However, if so, that reasoning applies to all non-ASCII characters; the only reason Hebrew is treated specially is that it happens to have its own line break class for an unrelated reason, not because Hebrew is actually different from other scripts. If this is the reason, there are two ways to make the algorithm more consistent. The first is to delete LB21b. The second is to expand LB21b to all non-ASCII alphabetic/symbol characters.
Date/Time: Fri Jun 3 19:49:05 CDT 2022
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: 446
L2/21-042 gives examples of U+2E55..U+2E5C within words, just like how U+0029 is used in “(s)he”. It is central to these characters’ purpose to appear within words, so it is likely that their line breaking works the same as for U+0029. The closing characters U+2E56, U+2E58, U+2E5A, and U+2E5C should therefore have Line_Break=Close_Parenthesis.
Date/Time: Mon Jul 11 20:35:10 CDT 2022
Name: David Corbett
Report Type: Other Document Submission
Opt Subject: Anatolian hieroglyphic line breaks
The standard says that “Spaces are used in modern renditions of [Anatolian] hieroglyphic text”; accordingly, most Anatolian hieroglyphs have Line_Break=Alphabetic, such that there are no line break opportunities within words. The only exceptions are U+145CE and U+145CF. If U+145CF appears within a word, there is a line break opportunity after it. Is that really true? It seems more likely that modern renditions of Anatolian hieroglyphic text break on spaces, not within words. U+145CE and U+145CF should therefore get Line_Break=Alphabetic.
Date/Time: Tue Jul 19 11:24:29 CDT 2022
Name: Brad Andalman
Report Type: Error Report
Opt Subject: UAX#14
UAX#14 [https://unicode.org/reports/tr14/] asserts that “Line breaks can occur before and after an em dash.” It also claims that the only use for an em dash is to “set off parenthetical text.” However, that is only one of many ways that an em dash can be used in English. The Chicago Manual of Style – beginning at entry 6.85 in the 17th edition – enumerates numerous ways an em dash can be used. Entry 6.87 mentions that an em dash should be used for “sudden breaks or interruptions.” One of the examples it uses is as follows: “Well, I don’t know,” I began tentatively. “I thought I might—” “Might what?” she demanded. If that trailing em dash followed by a quotation mark were to end on its own line, it would look terrible. This is easy to make happen on a simple web page (see my bug report to WebKit: https://bugs.webkit.org/show_bug.cgi?id=242822), and it can often be seen in Apple Books as well (e.g. when reading The Invisible Man By H.G. Wells). This is because Apple Books is based on WebKit, which faithfully implements the line-breaking behavior specified in UAX#14. The Chicago Manual of Style addresses the problem of line breaks directly (in 6.90): “In printed publications, line breaks should generally be made after an em dash but not before, in the manner of hyphens. In the case of a closing quotation mark (or any other mark of punctuation) immediately following the dash, however, the quotation mark and dash *must not be broken at the end of a line*” [emphasis mine]. It would be great if UAX#14 could be updated to reflect the varied uses of em dashes in writing. Thanks!