This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.
Date/Time: Mon Nov 28 16:01:08 CST 2016
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: PRI #341 formatting
One of the example tailored grapheme clusters is ⟨kʷ⟩. This is encoded in HTML as `k<sup>w</sup>`. Why not use Unicode?
Date/Time: Tue Jan 3 19:06:13 CST 2017
Name: Manish Goregaokar
Report Type: Error Report
Opt Subject: UAX #29: Property tables should be updated for emoji sequences
The spec lists GraphemeBreakProperty.txt[1] and WordBreakProperty.txt[2] as the normative source for grapheme and word categorization respectively. However, the spec also gives non-normative definitions of these properties. In particular, it defines Glue_After_Zwj[3] as >> Emoji characters that do not break from a previous ZWJ in a defined >> emoji zwj sequence, and are not listed as Emoji_Modifier_Base=Yes in emoji-data.txt. See [UTR51]. Going through emoji-zwj-sequences.txt[4], there are a lot of emoji characters that satisfy this property. The kiss/heart emojis are like this, as well as every object emoji in the "Gendered Role, with object" section. However, we only count the kiss, heart, and speech bubble emoji as GAZ in the property table. The property table should include all role and gender modifiers as GAZ. Could this be updated? [1]: http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt [2]: http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/WordBreakProperty.txt [3]:http://www.unicode.org/reports/tr29/proposed.html#Glue_After_Zwj [4]: http://unicode.org/Public/emoji/4.0/emoji-zwj-sequences.txt
Date/Time: Wed Jan 4 04:31:52 CST 2017
Name: Manish Goregaokar
Report Type: Public Review Issue
Opt Subject: UAX #29: Avoiding grapheme breaks on Indic consonant clusters
I've often noticed that grapheme clusters in Indic scripts don't span consonant clusters. For example, "ग्रा" is not a single grapheme cluster, but two: <ग्> + <रा>. There is reasoning given in the spec for this. >> Grapheme clusters can be tailored to meet further requirements. Such >> tailoring is permitted, but the possible rules are outside of the scope of >> this document. One example of such a tailoring would be for the aksaras, >> or orthographic syllables, used in many Indic scripts. Aksaras usually >> consist of a consonant, sometimes with an inherent vowel and sometimes >> followed by an explicit, dependent vowel whose rendering may end up on any >> side of the consonant letter base. Extended grapheme clusters include such >> simple combinations. >> However, aksaras may also include one or more additional prefixed consonants, >> typically with a virama (halant) character between each pair of consonants in >> the sequence. Such consonant cluster aksaras are not incorporated into the >> default rules for extended grapheme clusters, in part because not all such >> sequences are considered to be single “characters” by users. Indic scripts >> vary considerably in how they handle the rendering of such aksaras—in some >> cases stacking them up into combined forms known as consonant conjuncts, and >> in other cases stringing them out horizontally, with visible renditions of the >> halant on each consonant in the sequence. There is even greater variability in >> how the typical liquid consonants (or “medials”), ya, ra, la, and wa, are >> handled for display in combinations in aksaras. So tailorings for aksaras may >> need to be script-, language-, font-, or context-specific to be useful. This really boils down to "it depends on the font" and "you can use a tailoring here". I'll note that: - Most fonts for most modern-used consonant clusters will produce a single glyph without a halant. It's only when you get to things like three-consonant clusters (rare) that it stops working, and even then for most three-consonant clusters (e.g. those involving a `ra` on one end) that come up you will have a glyph. More common is consonant clusters rendering as larger glyphs, but that shouldn't mean they get split up into separate grapheme clusters. - As far as the language is concerned the halant and sans-halant form are equivalent, but the sans-halant form is generally preferred. I've only seen it used in complex clusters from Sanskrit and in typewriter-produced text. - As far as text segmentation is concerned you rarely want to break a consonant cluster. If, for example, I'm selecting a segment of a word to copy-paste, I will almost always select whole clusters. - As far as I can tell, tailoring is for ambiguous cases where it wouldn't make sense to use the tailoring as part of the default algorithm, either if you're trying for a very specific form of segmentation (e.g. backspace -- backspace usually gobbles individual combining characters, but in the case of flag emoji many input fields will delete the entire emoji -- this is not the regular algorithm for segmentation), or for shared scripts where you don't want to cause conflicts. This case seems to be mostly unambiguous, on the other hand. Additionally, Hangul has a very similar problem, but it does have special handling for it. While modern Korean only uses choseong+jungseong+optional jongseong (LV or LVT) syllable blocks, the spec does allow for things like LLLLVTTT (e.g. <ᄀᄀᄀ각ᆨᆨ>) In this case, the whole sequence is considered a single grapheme cluster (it even selects without segmentation in Firefox and Chrome). There don't seem to be any fonts which handle anything more than LVT glyphs, however. I think we should be consistent here, and try to match what would be expected in an Indic language. The simplest thing to do would be to define halant characters as non-breaking on either side. This does mean that if a halant character is side-by-side with something from a different script it will still form the same cluster, which is questionable (but we do that already with things like <gौ> being considered a single cluster). If that behavior is undesired, a system similar to the Hangul one can be devised, where an indic grapheme cluster is defined as C(HC)*V* (one base consonant, possibly followed by halant-consonant pairs, followed by one or more vowel modifiers) Thanks!
Date/Time: Sat Jan 21 17:21:11 CST 2017
Name: Karl Williamson
Report Type: Error Report
Opt Subject: UAX29 and spans of space
I submitted a request last year suggesting that the Word Break property not consider each individual horizontal white space character in a span of them to be a separate word. I was told that this might have merit, but it was too late for Unicode 9.0, but would be put out for public comment afterwards. I did not follow up, assuming that you would. But now, I see that this isn't being asked about in the 10.0 proposed UAX29. I did read the minutes of the meetings since, and I don't believe there was any mention of this, so my guess is that this dropped through the cracks.
Feedback above this line was reviewed during UTC #150, January 2017.
Date/Time: Wed Mar 8 11:30:18 CST 2017
Name: Nick Wellnhofer
Report Type: Error Report
Opt Subject: RI characters in grapheme clusters
In revision 29 of UAX #29, the grapheme cluster rules were updated to break after each pair of RI characters (GB 12 and 13). But the text still contains the following paragraph (also in the draft for revision 30): "The base can be single characters, or be any sequence of Hangul Jamo characters that form a Hangul Syllable, as defined by D133 in The Unicode Standard, or be any sequence of Regional_Indicator (RI) characters. The RI characters are used in pairs to denote Emoji national flag symbols corresponding to ISO country codes. Sequences of more than two RI characters should be separated by other characters, such as U+200B ZERO WIDTH SPACE (ZWSP)." I think the paragraph should be updated to reflect the new rules.
Date/Time: Fri Apr 7 15:09:47 CDT 2017
Name: Andy Heninger
Report Type: Error Report
Opt Subject: Full Width Digits Word Break Property
Full-width ASCII digits (U+FF10 - U+FF19] have the word break property of "Other". It should probably be Numeric. The full width digits existing Line break property of Ideographic is correct; line wrapping within a full width number is expected. But word selection should match a multi-digit number. Also for this problem was the CLDR ticket http://unicode.org/cldr/trac/ticket/6555, which was resolved as out of scope and with a suggestion to submit feedback to Unicode. As far as I can find, this did not happen. Here is a number composed of full width digits: 1234. Double-click it to check browser word break behavior. Chrome, at least, treats the digits as numeric.