This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.
Date/Time: Wed Jul 3 19:10:38 CDT 2019
Name: Johan Curcio Lindström
Report Type: Error Report
Opt Subject: ARABIC NUMBER SIGN Control or Prepend in UAX #29
Hello, When implementing the extended grapheme cluster segmentation algorithm, I noticed what appears to be a mistake in the specification. The ARABIC NUMBER SIGN code point (U+0600) belongs to the Format general category, which means that it will have a Grapheme_Cluster_Break of Control. Since it also has Prepended_Concatenation_Mark = Yes, it could also be considered for the Prepend value, but the specification states that Control wins because it is higher up in the table. GB4 states that we break after Control and GB9b that we don't break after Prepend. Since Control won out, we should not break after ARABIC NUMBER SIGN. The provided test cases in GraphemeBreakTest.txt expect a break after this code point and the tables in GraphemeBreakProperty.txt include it under Prepend and not under Control. This seems true for a range of code points in Prepend in fact, but only U+0600 is part of the tests so that's why I mention it. Is there a part of the specification that I'm missing? It is also a bit unclear what the difference between the GB* rules that use the Any value and just leaving one side blank is, e.g., "sot ÷ Any" vs. " × SpacingMark". Why is GB1 not written as "sot ÷ " or GB9a not as "Any × SpacingMark"?
Date/Time: Sat Jul 6 16:57:48 CDT 2019
Name: Charlotte Buff
Report Type: Error Report
Opt Subject: Line-Break Behaviour of Emoji Modifier Sequences
In Revision 33 of UAX #29 (Unicode Text Segmentation), the rules governing emoji modifier sequences were simplified. In particular, emoji modifiers are now considered generic extenders. This change has not carried over to the line breaking algorithm, however, which still relies on the Emoji_Modifier and Emoji_Modifier_Base properties. As a consequence, certain sequences of characters now form a single grapheme cluster, but still theoretically allow line breaks inside of them, which isn’t very sensible. This discrepancy affects existing characters such as U+1F9DF 🧟 ZOMBIE, which is available in different skin tones as part of Microsoft’s Segoe UI Emoji font despite not being an official modifier base, as well as newly released characters whose properties may not have been fully implemented yet. I propose deprecating the line break properties E_Base and E_Modifier, and merging the affected characters into Ideographic and Combining_Mark respectively. This would synchronise the behaviour between line breaking and text segmentation, and also automatically future‐proof the system for new emoji modifier bases that might be added in the future.
Feedback above this line was reviewed and processed during UTC #160 in July 2019.
Date/Time: Tue Jul 30 12:55:18 CDT 2019
Name: David Corbett
Report Type: Public Review Issue
Opt Subject: PRI #396: More modifier letters for Chinese tones
The examples in L2/04-107 show U+A700..A707 used the same way as the other Chinese tone letters, so U+A700..A707 should have Word_Break = ALetter, like the rest of the Chinese tone letters that are proposed to be ALetter. Is the reason they weren’t proposed along with the others that they are used with Han characters, which have different Word_Break values than Latin? If so, page 7 shows some of them with Latin letters too.
Date/Time: Fri Aug 23 14:17:34 CDT 2019
Name: Theodore Beers
Report Type: Other Question, Problem, or Feedback
Opt Subject: ZWNJ in Annex 29
I think it would help if a sentence or two were added to clarify the exclusion of ZWNJ (U+200C) as a grapheme boundary. In Persian, this character is used to prevent letters from being connected to one another, mainly at points where prefixes and suffixes are attached to words. This (arguably) does not generate a new "user-perceived character," but rather dictates which of the standard letter forms may be set at the point in question.[0] Somewhat similar is the use of ZWNJ in German, to prevent a ligature across the stems of a compound word (e.g., no U+FB02 in Auflage). Specialists have indicated that they agree with the current rule—i.e., that ZWNJ by default should *not* be treated as a grapheme boundary, and thus that it should be grouped with the preceding cluster. I'm not entirely convinced… but more importantly, the annex might benefit from further detail on this point. Where the ZWNJ is mentioned, it is in reference to Indic languages, in which there are in fact unique user-perceived characters that rely on the ZWNJ to be composed correctly. So those are cases where it is obvious that ZWNJ cannot be a grapheme boundary. I think the proper treatment of this character in a language like Persian is less self-evident. There are, to be sure, advantages to the rule as it stands, e.g., facilitating cursor positioning and the counting of user-perceived characters (if not their definition). Still, my feeling is that the annex could make the rationale a bit clearer to non-cognoscenti. [0] It seems relevant to me to note that ZWNJ has not always been readily available for typing/typesetting/rendering in Persian (it's much easier these days), and it remains the case that many people are unaware of its availability or have not learned to use it. So it is still common to see a full space entered where there ought to be a ZWNJ. This is not ideal, of course—the result is breaking a compound word into two words. (There are also people who hypercorrect, using ZWNJ before a suffix even in cases where allowing the letters to connect would produce no ambiguity.) The story of this character in Persian is extremely messy. There are words where you might find ZWNJ, or a full space, or connected letters. Somehow this exacerbates my confusion when it comes to the rule for segmenting graphemes.
Date/Time: Fri Sep 20 15:07:12 CDT 2019
Name: Yichao 'Peak' Ji
Report Type: Public Review Issue
Opt Subject: UAX #29: Full width comma in WB11 and WB12
(Note: Re-open for PRI #396, as discussed in the Unicore mail list.) Actually I’ve submitted this issue before in PRI #355, but today I found a Review Note about the exact same thing, so please allow me to elaborate more: In English and other languages, we use commas in long numbers for better readability, and use comma plus a whitespace to separate clauses. But in Chinese, we use full width commas without any whitespace to separate clauses, and never use comma in numbers. U+FF0C FULLWIDTH COMMA and U+FF1B FULLWIDTH SEMICOLON as MidNum would prevent implementations from breaking legit Chinese clauses starting with digits after clauses ending with digits. For example, “今晚19:30,2014大奖赛即将开幕” (The 2014 championship will start at 19:30), a weird “30,2014” token will be generated. This behavior affects a wide range of Chinese news articles, as I mentioned before in the report, we found more than 5k invalid tokens like these in a corpus of 2.8 million articles. AFAIK, Chinese is the only language using the full width variant. Japanese uses u+3001 (、) and Korean uses the ascii comma. So I’d say it’s safe to remove U+FF0C and U+FF1B from MidNum.