This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.
Date/Time: Fri Jan 6 18:26:42 CST 2023
Name: Marshall Stoner
Report Type: Error Report
Opt Subject: www.unicode.org/reports/tr29/
The Rule WB4 should be expanded and clarified. As is, the algorithm may break an Arabic numeric heading such as U+061C U+0600 U+0664 U+0666 in the wrong place. The word break rules should lead to "U+061C ÷ U+0600 × U+0664", *not* "U+061C x U+0600 ÷ U+0664". According to the same document, the sequence "U+0600 U+0664" is a grapheme cluster that should not be broken. I think there should be a rule in addition to WB4 that clarifies the break should come *after* most 'Format', 'Extend', or 'ZWJ', code points, but 'Format' should exclude any format characters that are subtending marks. Format characters that are subtending marks should be placed in a new category and there should then be two rules.. WB4a: Any × ( Extend | Format | ZWJ ) WB4b: Prepend × Any Therefore, if there is a sequence [some letter] ( Extend | Format | ZWJ )* Prepend* [ another letter ], the break should always occur after the "( Extend | Format | ZWJ)*" string but *before* the "Prepend*" string. Prepend should be characters excluded from Format.
Feedback above this line was reviewed during or prior to UTC #175 in April, 2023
Date/Time: Fri Jun 16 21:13:48 CDT 2023
ReportID: ID20230616211348
Name: Eiso Chan
Report Type: Public Review Issue
Opt Subject: 469
In Table 1c, “ri-sequence” and “RI-Sequence” are both used. Maybe all “ri-sequence” in Table 1c should be “RI-Sequence”.
Date/Time: Tue Jun 20 13:51:08 CDT 2023
ReportID: ID20230620135108
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 469
I’m happy to see some progress in fixing UAX 29 for Brahmic scripts, even if it’s initially only for 6 of the roughly 40 scripts that need a fix. However, in the rule that defines consonant clusters, it’s not clear at all whether the class ExtCccZwj includes or excludes the right characters. The combining class for marks in Brahmic scripts (except for viramas and, up to now, nuktas) should generally be 0, and assignments of other values were in most cases mistakes that unfortunately can not be corrected. Trying to derive meaning from ccc values in Brahmic scripts is almost certainly a mistake. Why should variation selectors be excluded from consonant clusters? Is the exclusion of three Gujarati nuktas intentional? Is the inclusion of Vedic tone marks intentional? If combining classes are really considered the appropriate basis for selecting characters that can occur within a consonant cluster, then this should be explained. If not, then the class should be defined so as to include the right characters, independent of ccc values.
Date/Time: Wed Jun 21 10:21:17 CDT 2023
ReportID: ID20230621102117
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 469
UAX 29 uses the set operators “&” and “-” in several regular expressions. UTR 18 and Appendix A of The Unicode Standard have settled on “&&” and “--”. UAX 29 should follow.
Date/Time: Fri Jun 23 11:30:10 CDT 2023
ReportID: ID20230623113010
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 469
The proposed update of UAX 29 states twice in new text that “the default grapheme clusters are also known as extended grapheme clusters”, and that legacy grapheme clusters are defined as a profile. On the other hand, existing text talks about a “key feature of default Unicode grapheme clusters (both legacy and extended)”, notes that “default [i.e., extended] Unicode grapheme clusters were previously referred to as ‘locale-independent graphemes’” even though that note predates the invention of extended grapheme clusters, has a section “Default Grapheme Cluster Boundary Specification” that covers both legacy and extended grapheme clusters, and requires “When citing the Unicode definition of grapheme clusters, it must be clear which of the two alternatives are being specified: extended versus legacy” as if there were no default. The use of “default” and defaults with respect to grapheme clusters should be reviewed and made consistent.
Date/Time: Fri Jun 23 11:30:48 CDT 2023
ReportID: ID20230623113048
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 469
UAX 29 has a note claiming that “The boundary between default Unicode grapheme clusters can be determined by just the two adjacent characters”. Looking at rules GB9c, GB11, GB12, and GB13, I don’t believe this is true.
Date/Time: Fri Jun 23 11:31:50 CDT 2023
ReportID: ID20230623113150
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 469
The description of Table 2a states “each macro represents a repeated union of the basic Grapheme_Cluster property values”. This seems to be incorrectly adapted from the descriptions of other tables. In reality, the table uses intersection and difference rather than union, and uses several other Unicode properties besides Grapheme_Cluster_Break (the real name of “Grapheme_Cluster”). The other macro tables in UAX 29 consider “represents” clear enough without a “=“ sign; I think this would work here too.
Date/Time: Fri Jun 23 11:32:29 CDT 2023
ReportID: ID20230623113229
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 469
When rule GB9c is rendered in a narrow view (such as a printed page), it appears as LinkingConsonant ExtCccZwj* × LinkingConsonant ConjunctLinker ExtCccZwj* which invites a reading very different from the intended one. The rendering could be improved by using “vertical-align: bottom” on the last two cells of the row.
Date/Time: Fri Jun 23 11:33:23 CDT 2023
ReportID: ID20230623113323
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 469
The introduction to word boundaries in UAX 29 has a paragraph on the relationship between word boundaries and line boundaries. It should be clarified that this relationship exists only in some scripts, not in others. In Chinese, Japanese, Balinese, Brahmi, etc. line breaking pays no attention to words. Also, thanks to hyphenation engines for languages where words do matter for line breaking, line breaks within words are far more common than the statement on SHY would imply. The last paragraph in the same section mentions three Line_Break property values and then states “that means that satisfactory treatment of languages like Chinese or Thai requires special handling”. Chinese uses none of the three Line_Break property values, and while word breaking for Chinese requires special handling, that has nothing to do with its line breaking.
Date/Time: Sun Jun 25 06:15:35 CDT 2023
ReportID: ID20230625061535
Name: Charlotte Buff
Report Type: Public Review Issue
Opt Subject: 469
Section 3, “Grapheme Cluster Boundaries”, states: »Word boundaries, line boundaries, and sentence boundaries should not occur within a grapheme cluster: in other words, a grapheme cluster should be an atomic unit with respect to the process of determining these other boundaries.« This does not actually hold true for line boundaries when an emoji modifier is applied to a non-standard base character. For example, the sequence <U+1F9DF, U+1F3FB> 🧟🏻 (ZOMBIE, EMOJI MODIFIER FITZPATRICK TYPE-1-2) is a single grapheme cluster because emoji modifiers have Grapheme_Cluster_Break=Extend, but nonetheless a line break is theoretically allowed between the two characters because ZOMBIE has Emoji_Modifier_Base=False and line break rule LB30b applies only to characters with Emoji_Modifier_Base=True or unassigned code points with Extended_Pictographic=True. In fact, Chromium-based web browsers will break lines in the middle of these sequences.
Date/Time: Fri Jun 30 07:45:06 CDT 2023
ReportID: ID20230630074506
Name: Charlotte Buff
Report Type: Public Review Issue
Opt Subject: 469
Table 1c defines the following regex pattern: conjunctCluster := LinkingConsonant ExtCccZwj* (ConjunctLinker ExtCccZwj* LinkingConsonant)+ If we expand the “(ConjunctLinker ExtCccZwj* LinkingConsonant)+” part, we get a sequence pattern where ExtCccZwj can occur only *after* a ConjunctLinker but not *before* it: ConjunctLinker ExtCccZwj* LinkingConsonant ConjunctLinker ExtCccZwj* LinkingConsonant ConjunctLinker ExtCccZwj* LinkingConsonant ... This does not match rule GB9c which accounts for ExtCccZwj in both positions, which is necessary because Indic scripts make use of combining marks with CCC values both smaller and greater than 9 (Virama). Therefore I think the definition should actually be: conjunctCluster := LinkingConsonant ExtCccZwj* (ConjunctLinker ExtCccZwj* LinkingConsonant ExtCccZwj*)+
Date/Time: Tue Jul 04 17:39:03 CDT 2023
ReportID: ID20230704173903
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 469
The discussion of Aksaras in UAX 29 states that “consonant cluster aksaras are not incorporated into the default rules”. That’s no longer correct; such aksaras are now incorporated for six scripts, and more will hopefully follow. The same paragraph mentions “additional prefixed consonants”. That seems to reflect a Devanagari-centric view, as in many other scripts the additional consonants are better described as “subjoined” or in other terms. I suggest removing the word “prefixed”.
Date/Time: Tue Jul 04 17:39:43 CDT 2023
ReportID: ID20230704173943
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 469
The proposed update of UAX 29 states “Boundaries never occur within a combining character sequence or conjoining sequence, so the boundaries within non-NFD text can be derived from corresponding boundaries in the NFD form of that text.” Unfortunately, the stated condition is not sufficient. It would also be ncessary that normalization didn’t reorder characters out of character pairs that should not be broken up. As the section “Compatibility with normalization” of L2/23-141 discusses, it sometimes does, and workarounds are necessary to achieve the desired results in normalized text.
Date/Time: Tue Jul 04 17:40:02 CDT 2023
ReportID: ID20230704174002
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 469
Document L2/23-140, Setting expectations for grapheme clusters, is intended to be feedback to PRI 469.