This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.
Date/Time: Fri Aug 26 18:42:20 CDT 2011
Contact: umavs@ca.ibm.com
Name: V.S. Umamaheswaran
Report Type: Public Review Issue
Opt Subject: PRI 193 ... UAX 29
Forwarding a comment from our Thai expert: Nattapong Sirilappanich (email id: natta@th.ibm.com) ... My inputs are: 1. Keep all grapheme properties for Thai characters. This is my input for section 3 question: [Editorial Note: The text has been modified to not favor extended grapheme clusters, given that legacy grapheme clusters are preferred for Thai, Lao, and Tai Viet characters. An alternative approach would be to remove the characters (U+0E30, U+0E32, U+0E33, U+0E40-U+0E45, U+0EB0, U+0EB2, U+0EB3, U+0EC0-U+0EC4, U+AAB5, U+AAB6, U+AAB9, U+AABB, U+AABC) from Extend and Spacing_Mark. Feedback on this issue would be appreciated.] 2. To prevent confusion, let's make it clear. Thai language use legacy grapheme for cursor movement and editing behavior. Thai language also use extended grapheme for rendering purpose. So this line in section 3 should be updated: However, for Southeast Asian scripts such as Thai and Lao, the legacy grapheme clusters are generally preferred
Date/Time: Mon Oct 24 08:22:15 CDT 2011
Contact: emmanuel@vallois.name
Name: Emmanuel Vallois
Report Type: Public Review Issue
Opt Subject: PRI #193: Proposed Update UAX #29: Unicode Text Segmentation
A minor editorial comment: 8. Hangul Syllable Boundary Determination
(http://www.unicode.org/reports/tr29/tr29-18.html#Hangul_Syllable_Boundary_Determination): Under subtitle “Transforming into Standard Korean Syllables”, in the line [^L] V → [^L] Lf V the f is neither subscripted nor italicized as it should be.
Date/Time: Thu Nov 10 02:33:11 CST 2011
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Public Review Issue
Opt Subject: UAX#29: word breaks with hiragana and voiced marks
I'd like to renew an old feedback I made about word breaks with hiragana and voiced marks in an UAX#29 PRI in... 2007. Because absoluetly nobody seems to have replied to this feedback, and visibly some characters that are used in both hiragana and katakana are not treated consistently as they should (for example with differences between normal and halfwidth variants). See http://unicode.org/mail-arch/unicode-ml/y2007-m08/0091.html Quoting the message: This UAX treats KATAKANA specially, to avoid breaking between two Katakana letters, but still break between hiragana. However, this is probably not true for every thing, notably in the sequence of an Hiragana letter and a voiced/semi voiced mark: U+309B (゛) KATAKANA-HIRAGANA VOICED SOUND MARK U+309C (゜) KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK and possibly other characters currently listed in the Katakana value in table 3: U+3031 (〱) VERTICAL KANA REPEAT MARK U+3032 (〲) VERTICAL KANA REPEAT WITH VOICED SOUND MARK U+3033 (〳) VERTICAL KANA REPEAT MARK UPPER HALF U+3034 (〴) VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HALF U+3035 (〵) VERTICAL KANA REPEAT MARK LOWER HALF U+30A0 (゠) KATAKANA-HIRAGANA DOUBLE HYPHEN U+30FC (ー) KATAKANA-HIRAGANA PROLONGED SOUND MARK U+FF70 (ー) HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK U+FF9E (゙) HALFWIDTH KATAKANA VOICED SOUND MARK U+FF9F (゚) HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK Do really word break occur between Hiragana letters and these marks coded after them (note that Hiragana letters are excluded from "Aletter" in table 3) ? If not, then (1) the list of characters above should better be listed under a separate value (say "ExtendKana"), and removed from Katakana in table 3. (2) a new value "Hiragana" should be created for Hiragana letters in table 3, like this: Katakana script="KATAKANA" (rewritten first row in table 3) Hiragana script="HIRAGANA" (new inserted row in table 3) ExtendKana (the list of characters above) (new row in table 3) (3) the existing rule WB13 (Katakana × Katakana) should be rewritten equivalently as: WB13. (Katakana | ExtendKana) × (Katakana | ExtendKana) (4) the following subrules WB13a and WB13b rewritten equivalently as: WB13a. (ALetter | Numeric | Katakana | ExtendKana | ExtendNumLet) × ExtendNumLet WB13b. ExtendNumLet × (ALetter | Numeric | Katakana | ExtendKana) (5) Another subrule should be added: WB13c. (Hiragana | ExtendKana) × ExtendKana No other change is needed, because word break will still occur either between two Hiragana letters, or after an ExtendKana and before a Hiragana letter, in the next rule: WB14. Any ÷ Any Or am I missing something?
Date/Time: Thu Nov 10 03:26:08 CST 2011
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Public Review Issue
Opt Subject: UAX#29: property values used in [Charts29]
The UAX#29 makes an informative reference to a sample table-based implementation shown in [Charts29]: http://www.unicode.org/Public/6.0.0/ucd/auxiliary/WordBreakTest.html However this HTML page still contains row and column headers containing old non-standard property values that do not match the Word-Break property values enumerated and described in UAX#29 and assigned to characters in the normative datafile [Data29]. Why are those "_FE" suffixes" added to a couple of property values in the chart table and in all tooltips appearing when hovering characters in the sample strings ? It seems that this [Charts29] page has never been updated since long, and this is also visible in the numeric mapping of rules names (which is also used in the test data file), which: - forgets to assign the number 3.2 to the rule named WB3b (insert a word break before an explicit line break, "÷ (Newline | CR | LF)") ; - still incorrectly defines the rule named WB4 and numbered 4.0 as the outdated contextual rule "[^ Newline CR LF ] × [Format Extend]", instead of the current rewriting rule "X (Extend | Format)* → X". - gives the wrong definition for the last contextual rule, named WB14 and numbered 999.0, displaying "÷ Any", instead of "Any ÷ Any" ;