This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.
Date/Time: Wed Dec 10 10:11:49 CST 2014
Name: Itamar Syn-Hershko
Report Type: Error Report
Opt Subject: Swedish "connect" in UAX#29
In swedish, "connect" is a way to shortcut writings of words. So "C:a" mentioned in UAX#29 is infact "cirka" which means "approximately". I guess it can be thought of as English acronyms, only apparently after talking to several Swedish-speaking people its way less commonly used in Swedish (my source says "very very seldomly used; old style and not used in modern Swedish at all"). Not only it is hardly being used, apparently it's only legal in 3 letter combinations (c:a but not c:ka). And also, the affects it has are quite severe at the moment - 2 words with a colon in between that didn't have space will be outputted as one token even though its 100% its not applicable to Swedish, since each words has > 2 characters. So I move to fix UAX#29 by either: 1. Not considering a colon (:) as a MidLetter any more, like the change made between 6.2 and 6.3. Reasoning is it is not used in modern Swedish, and also when it is used it is only used in very specific cases. If required, exceptions could be added in a case by case basis by Lexers and other implementations. 2. If keeping this rule, please make sure to limit it to real-world usage so only 3-letter words are legal with colon as its middle character and not other usages. So c:a is a legal Unicode word but word:word isn't (but note word:word currently _is_ a legal word by UAX#29. This request grew up out of real-world scenarios where lexers implementing this standard produced problematic tokens because of this very hardly used rule. Thank you for your consideration.
Feedback above this line was considered at the February 2015 UTC meeting.
Date/Time: Sun Feb 8 12:13:28 CST 2015
Name: Karl Williamson
Report Type: Public Review Issue
Opt Subject: UAX #29
First, a formatting issue, present in the draft and released versions. In 6.2 beginning, "An alternate expression that resolves to a single character is treated as a whole. For example:", the *not* shows up on my Firefox browser as on the line below what it should, so it looks like → (STerm (Extend | Format)* | ATerm (Extend | Format)*) not Second, it is not obvious to me that the rules "Ignore Format and Extend characters" imply Any × (Format | Extend) even if it is obvious to others. I think there should be an explicit rule to that effect. Third, the statement in SB "Ignore Format and Extend characters, except after sot, Sep, CR, or LF. " seems to me to be better than the corresponding statement in WB "Ignore Format and Extend characters, except when they appear at the beginning of a region of text." I propose changing the WB wording. Finally, the WB rules WB6 and WB7 are sloppily worded. See <54C48C81.3080405@khwilliamson.com> and follow-on messages to the unicode.org mailing list. Phillipe Verdy has proposed some changes to tighten that up. I haven't inspected those to see if I agree, but my claim is that the rules should be constructed so that once a situation matches a rule, later rules shouldn't overrule that. So, if nothing else changed, WB7a should come before the current WB6, as both have rules for the same combination, and end up with different dispositions.
Date/Time: Thu Apr 30 15:34:08 CDT 2015
Name: Karl Williamson
Report Type: Error Report
Opt Subject: Sentence Break property doesn't follow TUS 7.0 recommendations
5.8 of TUS 7.0 includes this text: R2c In parsing, choose the safest interpretation. For example, in recommendation R2c an implementer dealing with sentence break heuris- tics would reason in the following way that it is safer to interpret any NLF as LS: • Suppose an NLF were interpreted as LS, when it was meant to be PS. Because most paragraphs are terminated with punctuation anyway, this would cause misidentification of sentence boundaries in only a few cases. • Suppose an NLF were interpreted as PS, when it was meant to be LS. In this case, line breaks would cause sentence breaks, which would result in significant problems with the sentence break heuristics. However, UAX #29 disregards this, treating the NLF as a PS (It defines ParaSep to be (Sep | CR | LF), so rule SB4 Break after paragraph separators. SB4 ParaSep ÷ leads to a break after any NLF.) This leads in practice to exactly what TUS says it will: "line breaks would cause sentence breaks, which would result in significant problems with the sentence break heuristics." There is no discussion in UAX 29 as to why the TUS recommendations aren't followed. I think it would be best if the SB property were revised to follow the TUS recommendations. If for whatever reason that is not possible, there should be significant discussion in UAX 29 as to why not. Note that I posted this discrepancy on the unicode.org mailing list in February of 2015, <54E8D816.9010606@khwilliamson.com>. No response was made. Therefore, I'm formally submitting it, to force consideration.