This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.
Date/Time: Mon Sep 25 10:21:46 CDT 2017
Name: Tsuyoshi Ito
Report Type: Error Report (PRI #355)
Opt Subject: The regular expressions for a grapheme cluster in Table 1b of UAX #29 do not match the rules in Section 3.1.1
Table 1b of UAX #29, “Unicode Text Segmentation” (http://www.unicode.org/reports/tr29/tr29-31.html), shows the regular expressions for a legacy and an extended grapheme cluster. Section 6.3 seems to indicate that they are supposed to be equivalent to the rules in Section 3.1.1. (However, to be honest, the wording of Section 6.3 is not very clear to me. It says “The conversion into a regular expression is fairly straightforward for the grapheme cluster boundaries of Table 2.” but Table 2 is a summary of the Grapheme_Cluster_Break property values, not the rules to determine grapheme cluster boundaries.) However, I think that they are quite different. For example: * According to the rules in Section 3.1.1, a string of more than two regional indicator symbols is not a (legacy or extended) single grapheme cluster. However, according to the regular expressions in Table 1b, it is a single (legacy and extended) grapheme cluster. * According to the rules in Section 3.1.1, an emoji zwj sequence is a single grapheme cluster. However, the regular expression for neither a legacy nor extended grapheme cluster treats ZWJ in a special way, and it puts a grapheme cluster boundary before and after ZWJ. Please consider one of the following changes: Option 1: The regular expressions in Table 1b (and the regular expressions in Table 1c used there) should be updated to match the rules in Section 3.1.1. Option 2: The text of Sections 3.1.1 and 6.3 should be updated to clarify that the regular expressions in Table 1b do not necessarily match the rules in Section 3.1.1.
Date/Time: Tue Oct 10 10:06:06 CDT 2017
Name: David Corbett
Report Type: Error Report
Opt Subject: PRI #355: Cursors in ligatures
“For example, the text editing framework must know if a digraph is represented as a single glyph in the font, which therefore cannot have a cursor separating its two parts.” That is not true: text editing frameworks can and do put cursors within ligature glyphs.
Date/Time: Tue Oct 10 10:15:49 CDT 2017
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: PRI #355: Devanagari kshi does not need tailoring
Table 1a lists ⟨क्षि⟩ as a tailored grapheme cluster, but it no longer needs tailoring.
Date/Time: Tue Oct 10 10:29:29 CDT 2017
Name: David Corbett
Report Type: Error Report
Opt Subject: PRI #355: More LinkingConsonants
Grapheme_Cluster_Break=LinkingConsonant should be expanded to include Indic_Syllabic_Category=Vowel_Independent and Indic_Syllabic_Category=Consonant_Dead. Independent vowels may be subjoined in Khmer, and Bengali’s khanda ta may take a repha.
Date/Time: Wed Oct 11 14:18:02 CDT 2017
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: PRI #355: Indic clusters without virama
Some Indic consonant clusters do not use a virama. GB9c should be (StackingConsonant | Virama | ZWJ) × LinkingConsonant, where StackingConsonant is Indic_Syllabic_Category = Consonant_With_Stacker.
Date/Time: Wed Oct 11 14:33:49 CDT 2017
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: PRI #355: U+0BCD TAMIL SIGN VIRAMA
U+0BCD TAMIL SIGN VIRAMA generally does not create conjuncts. The exceptions are ⟨க்ஷ⟩ and ⟨ஶ்ரீ⟩. It may better match user expectations to exclude U+0BCD from GCB = Virama.
Date/Time: Thu Oct 19 17:50:25 CDT 2017
Name: Roozbeh Pournader
Report Type: Public Review Issue
Opt Subject: Virama and UAX #29
Because of the dual usage of the characters with the Indic Syllabic Category of Virama. It appears to me that a virama is more frequently just a visible killer instead of an invisible stacker. Tamil is a common example where the visible killer frequency is much higher than the invisible stacker frequency. But I expect several other scripts would have a similar situation, and even for languages such as Hindi, the frequency of visible killer usage would be too high for always disallowing a grapheme break. If breaks after this class are being forbidden, I suggest removing the Indic_Syllabic_Category = Virama class from the new virama class, and renaming the class to InvisibleStacker. Also, forbidding breaks between ZWJ and LinkingConsonant appears incorrect. ZWJ is generally used in Indic as an invisible letter. So in that usage, it could be thought of as ending a cluster with a break allowed after it. Also note that ZWJ is used after virama in the legacy representation of Malayalam Chillus, which is still very common on the internet and in newly created content. Forbidding a cluster break between a ZWJ and a consonant would be incorrect in such usage. Finally, note that it's not just character of InSC=Consonant that take post- stacker forms. Independent Vowels, Consonant Placeholders, and perhaps Consonant_Deads and Consonant_With_Stacker may appear after stackers. There may even be more. Altogether, I think the proposed rules are based on a simplified version of the Indic grapheme cluster patterns which needs much more research. They should be rewritten to only discourage breaks in these non-controversial cases: 1. Forbid grapheme breaks after all characters of InSC=Invisible_Stacker, regardless of the character that comes after. (We don't need to worry about odd cases, like when an Invisible_Stacker is followed by a space or punctuation. These are malformed text, and it's OK to go either way on malformed text.) 2. Forbid breaks before all InSC={Virama, Invisible_Stacker, Pure_Killer} (note that this is already the case, since they are currently categorized as Extend, but may be necessary if the Extend class is split).
Date/Time: Sat Dec 9 18:16:56 CST 2017
Name: Richard Wordingham
Report Type: Public Review Issue
Opt Subject: PRI 355: Proposed Update UAX #29 Unicode Text Segmentation
If the second of the three notes at the end of Section 3.1.1 (starting "A tailoring for basic aksara support") is to be retained, note that it will typically be untrue for a language with both a virama or an invisible_stacker and a pure killer, e.g. U+1039 MYANMAR SIGN VIRAMA and U+103A MYANMAR SIGN ASAT and U+0D4D MALAYALAM SIGN VIRAMA and U+0D3B MALAYALAM SIGN VERTICAL BAR VIRAMA and U+0D3C MALAYALAM SIGN CIRCULAR VIRAMA. The note also assumes that the tailoring is specific to a script. Further to the point about cursors stepping through ligatures, this can be seen with the Latin ligature 'ffi' and the Arabic lam-alif. One can also find the cursor stepping through, sometimes not visibly, through Indic aksharas composed of multiple base consonants and even the Tai Thamligature. Is there evidence to support the claim in Section 3, 'The extended grapheme clusters should be used in implementations in preference to legacy grapheme clusters, because they provide better results for Indic scripts such as Tamil or Devanagari in which editing by orthographic syllable is typically preferred.' Further more, why should the preferences of Indians determine how Cambodians edit. The difference between the two types of cluster will become larger when extended grapheme clusters grow to be whole aksharas (with limited exemptions for Ahom, Myanmar and Tai Tham), and modifying or inserting a character required deleting all the akshara's character from its position onwards.
Date/Time: Mon Dec 11 12:35:36 CST 2017
Name: Otto Stolz
Report Type: Error Report
Opt Subject: Proposed Update Unicode® Standard Annex #29
http://www.unicode.org/reports/tr29/tr29-32.html#Word_Boundaries Figure 2 does not match the pertinent text which says: “That is done with the above boundaries by ignoring any words that do not contain a letter, as in Figure 2.” In contrast, figure 2 comprises the word “32.3” that does not contain any letter.
Date/Time: Tue Jan 2 08:14:48 CST 2018
Name: Manish Goregaokar
Report Type: Error Report
Opt Subject: UAX #29: Eliminating E_Modifier
Originally discussed at https://unicode.org/mail-arch/unicode-ml/y2018-m01/0000.html In UAX 29, the GB10 rule[1] (and the WB14 rule[2]) states that we should not break before E_modifier characters in case it is after an emoji base (with optional Extend characters in between) Given that the spec is allowed to ignore degenerates, we should merge E_Modifier into Extend, as outlined in Mark's email[4], and eliminate GB10 entirely. <random non-emoji, skin tone modifier> sounds very much like a degenerate case to me. lt;non-EBG GAZ emoji, skin tone> also feels rather degenerate. There are only three GAZes (heart (U+2764), kiss (U+1F48B), speech bubble (U+1F5E8)) and I can't see why you'd end up with a skin tone modifier on them except by accident. Additionally, the current draft[3] is eliminating the GAZ categories anyway. Thanks, -Manish [1]: http://www.unicode.org/reports/tr29/#GB10 [2]: http://www.unicode.org/reports/tr29/#WB14 [3]: https://www.unicode.org/reports/tr29/tr29-32.html [4]: https://unicode.org/mail-arch/unicode-ml/y2018-m01/0004.html
Date/Time: Mon Jan 22 21:04:53 CST 2018
Name: Richard Wordingham
Report Type: Public Review Issue
Opt Subject: PRI355: Akshara Boundaries
The major principled issue I have is that UAX#29 can no longer claim to have a sound definition of the concept of a 'user-perceived character'. Perhaps it never did. Some of the claims would be better if there were evidence to back them up. For example, this evening I did a quick bit of research and asked the Korean owner of the local Korean restaurant how many letters there were in the hangul spelling of 'Gangnam'. She traced out the spelling of the word (강남) and came back with the answer '6'. UAX#29 claims it has 2 user-perceived characters. You might also argue that she has spent too long in England to be a useful informant. The following old paragraph causes grief for me: "As far as a user is concerned, the underlying representation of text is not important, but it is important that an editing interface present a uniform implementation of what the user thinks of as characters. Grapheme clusters commonly behave as units in terms of mouse selection, arrow key movement, backspacing, and so on. For example, when a grapheme cluster is represented internally by a character sequence consisting of base character + accents, then using the right arrow key would skip from the start of the base character to the end of the last accent." The problem is that many editors read this as saying that the arrow keys should move by whole characters. The result of this is that in many applications, to replace the first character of a grapheme cluster one must retype the entire grapheme cluster. With a grapheme cluster of three characters, as is common in Thai and Korean, this is irritating. With a grapheme cluster of four or five characters, as is common in Northern Thai, it is annoying. The prospect of the grapheme cluster being extended to include a whole akshara fills me with dismay. Consider the Northern Thai word ᩉ᩠ᨾᩰᩬᩫᩡ <U+1A49 HIGH HA, U+1A60 SAKOT, U+1A3E MA, U+1A70 SIGN OO, U+1A6C SIGN OA BELOW, U+1A6B SIGN O, U+1A61 SIGN A> /mɔʔ/ 'scrumptious'. At present, this 7 character word is split into three grapheme clusters, of lengths 2, 4 and 1. However, it is clearly a single akshara. To change the first character, I would have to also retype the other 6 characters. My first thought that changing software that way would breach the UK's Equality Act 2010, by further restricting the ability of Northern Thai users to do character by character editing. (My wife's protected characteristic extends to me for the purposes of the Act.) However, there may be a get-out in the form of Schedule 3 Section 30 (https://www.legislation.gov.uk/ukpga/2010/15/schedule/3/paragraph/30). The supplier of the service can claim that they only supply a character by character editing facility to the ethnic groups using simple scripts, and that they are under no obligation to supply the service to members of other ethnic groups. - "If a service is generally provided only for persons who share a protected characteristic, a person (A) who normally provides the service for persons who share that characteristic does not contravene section 29(1) or (2)— (a)by insisting on providing the service in the way A normally provides it, or (b)if A reasonably thinks it is impracticable to provide the service to persons who do not share that characteristic, by refusing to provide the service." But what an embarrassing defence to offer! However, there is another reason for rejecting the extension of grapheme clusters to whole aksharas. Currently, U+1A63 TAI THAM VOWEL SIGN AA starts a grapheme cluster. However, for non-defective text, it is part of the same akshara as the preceding grapheme cluster. Now, the decision to make U+1A63 start a new grapheme cluster is intrinsically reasonable. It can have its own stack with a subscript consonant and even a vowel, and it is not difficult to find manuscripts showing a line break before it, e.g. L2/07-007 Figure 9b Leaf 2 lines 2/3, ᩈᨾᩮᩣᨴ᩠ᨴᨾ-ᩣᨶᩮᩉᩥ. I believe that the akshara should be a level of text above the grapheme cluster. Ideally, it would be below the level of a word, but of course in Sanskrit, word boundaries readily occur within present day grapheme clusters. (I made this recommendation in L2/17-122.) Further comments apply to the definition of akshara boundaries, regardless of whether they are to coincide with the boundaries of grapheme clusters. These rules do not work well where virama may fall back to visible virama. This is particularly the case with Tamil, where conjuncts are restricted to K.SSA and SH.RII. Johny Cibu provided an example where the title துக்ளக் is broken as [ta-u, ka-virama, lla, ka-virama]. However, as per the proposed algorithm it would be: [ta-u, ka-virama-lla, ka-virama] http://www.chennaispider.com/attachments/Resources/3486-7144-Thuglak-Tamil-Magazine-Chennai.jpg For native intuition, I would cite the Tamil letter-counting account at https://venkatarangan.com/blog/content/binary/Counting%20Letters%20in%20an%20Unicode%20String.pdf . What the author counts is not spacing glyphs, but vowel letters and consonant characters, with two significant modifications. Firstly, K.SSA counts as just one consonant, and SH.R.II is also counted as containing a single consonant. In other words, the Tamil virama character works as a pure killer except in those two environments. This is also the story the TUNE protagonists tell us. It will be an inelegant rule for UAX#29, but, unfortunately, reality is messy. To quote Johny Cibu further: "Malayalam could be a similar story. In case of Malayalam, it can be font specific because of the existence of traditional and reformed writing styles. A conjunct might be a ligature in traditional; and it might get displayed with explicit virama in the reformed style. For example see the poster with word ഉസ്താദ് broken as [u, sa-virama, ta-aa, da-virama] - as it is written in the reformed style. As per the proposed algorithm, it would be [u, sa-virama-ta-aa, da-virama]. These breaks would be used by the traditional style of writing. https://upload.wikimedia.org/wikipedia/en/6/64/Ustad_Hotel_%282012%29_-_Poster.jpg I believe there is a problem with the first two examples in Table 12-33. If one suffixed <U+0D15 MALAYALAM LETTER KA, U+0D3E MALAYALAM VOWEL SIGN AA> to the first two examples, yielding *പാലു്കാ and *എ്ന്നാകാ, one would have three Malayalam aksharas, not two extended grapheme clusters as the proposed rules would say.
Feedback above this line was reviewed in the January 2018 UTC meeting.
Date/Time: Mon Feb 12 20:02:46 CST 2018
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: PRI #355: Armenian medial punctuation
Because the Armenian punctuation marks U+055B, U+055C, and U+055E occur within words, rather than occurring finally, they should have Word_Break = ExtendNumLet.
Date/Time: Mon Feb 12 20:20:51 CST 2018
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: PRI #355: Tone letters’ Word_Break
The tone letters U+02E5..02E9 and U+A708..A716 are modifier letters used after syllables written in IPA, pinyin, or other Latin-script phonetic alphabets. For the same reason that applied to the characters in table 1 of L2/16-336, these tone letters should have Word_Break = ALetter. Another reason is that the tone bars ligate for contour tones, and it is odd to break a ligature like ⟨˥˩⟩ into two words.
Date/Time: Wed Mar 7 03:19:38 CST 2018
Name: Yichao 'Peak' Ji
Report Type: Error Report
Opt Subject: UAX29 Text Segmentation: WB11 and WB12
According to Word Boundary Rules #11 and #12 from the Technical Report "UAX29 Text Segmentation": U+FF0C ( , ) FULLWIDTH COMMA U+FF1B ( ; ) FULLWIDTH SEMICOLON ... are considered as "MidNum" and should not break within numeric sequences, such as “123,345” or “1;2”. Fullwidth chars are widely used in CJK, but we hardly use fullwidth commas in numbers as separators. In fact, this rule could produce unexpected results since we don't add whitespaces after fullwidth commas. For example: 今晚19:30,2014大奖赛即将开幕。 (The 2014 championship will start at 19:30) The segmentation utility (https://unicode.org/cldr/utility/breaks.jsp) will generate the following tokens: 今晚 | 19 | : | 30,2014 | 大奖 | 赛 | 即将 | 开幕 | 。 For search engines using Unicode tokenizers, "30,2014" will be indexed as a single token while processing this page: http://sports.sina.com.cn/j/2014-07-02/21357237624.shtml We've scanned a Chinese news corpus of 2.8 million articles, and found more than 5 thousand invalid tokens. This behavior might heavily affect search engines and analytic softwares. I'm wondering why the fullwidth comma and semicolon are in the list of "MidNum", and is there any chance to update tr29?
Date/Time: Thu Apr 12 17:52:31 CDT 2018
Name: Daniel Bünzli
Report Type: Public Review Issue
Opt Subject: UAX 29 and \p{Extended_Pictographic}
Hello, UAX 29 relies on a property \p{Extended_Pictographic} that is not part of the UCD but defined in an ad-hoc text file in UTS #51. As a result the property is not part of the ucdxml (UAX #42) which complicates the life of implementers whose pipeline depend on the ucdxml to implement the standard. Would it be possible to add the data of the UTS #51 to the ucdxml (possibly as separate files). Thanks, Daniel Bünzli
Date/Time: Mon Apr 23 16:02:50 CDT 2018
Name: Karl Williamson
Report Type: Public Review Issue
Opt Subject: 11.0 Uax #29
It appears that this UAX will depend on a property published only in UTS #51. That contradicts what you say about UTS's. "Conformance to the Unicode Standard does not imply conformance to any UTS" ( http://www.unicode.org/reports/about-reports.html#Types ) To solve this, you merely need to put the data for the Extended_Pictograph property in the UCD. Then you can update UTS#51 all you want, and it won't affect Unicode 11. Unicode 12 would then include a snapshot of what's in UTS#51 at that time. Also, I support making sequences of horizontal white space a single word. I can't think of a reasonable use-case where the current behavior actually would be desirable. Perhaps you can. But if not, that's a strong argument for changing it to what's in the draft. But have you really thought through what should happen if the final character in such a sequence is succeeded by a combining mark. I claim it makes more sense for that final character to be peeled off the rest and attached to the mark.
Date/Time: Wed Apr 25 12:30:28 CDT 2018
Name: Asmus Freytag
Report Type: Public Review Issue
Opt Subject: PRI #355 / UAX #29
There's an appearance of an improper normative dependence on an UTS. It says in http://www.unicode.org/reports/about-reports.html#Types that "Conformance to the Unicode Standard does not imply conformance to any UTS" Yet the proposed UAX#29 depends on the Extended_Pictograph property published only in UTS#51. If that is correct, this situation would be improper and should be rectified at the earliest opportunity. The objection is not to the use of such a property, but if used normatively, it needs to be part of the UCD, and not in some separate UTS. There is a good reason to consider such dependencies improper, not least the fact that (generally) any UTS has a separate maintenance and publication cycle from the UCD, creating periods where a UAX may effectively not be defined for the entire repertoire. For a very limited transition period, it may be possible to declare the dependence as not normative; in that case, there should be a clear indication of how the apparent conflict is to be resolved in future versions. I believe this is to some extent a known issue, but it has caused concerns in some parts of the user community. Therefore this request to address this issue promptly and positively.