Public Review Issues

Accumulated Feedback on PRI #290

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Wed Dec 10 10:11:49 CST 2014
Name: Itamar Syn-Hershko
Report Type: Error Report
Opt Subject: Swedish "connect" in UAX#29

In swedish, "connect" is a way to shortcut writings of words. So "C:a"
mentioned in UAX#29 is infact "cirka" which means "approximately". I guess it
can be thought of as English acronyms, only apparently after talking to
several Swedish-speaking people its way less commonly used in Swedish (my
source says "very very seldomly used; old style and not used in modern Swedish
at all").

Not only it is hardly being used, apparently it's only legal in 3 letter
combinations (c:a but not c:ka). And also, the affects it has are quite severe
at the moment - 2 words with a colon in between that didn't have space will be
outputted as one token even though its 100% its not applicable to Swedish,
since each words has > 2 characters.

So I move to fix UAX#29 by either:

1. Not considering a colon (:) as a MidLetter any more, like the change made
between 6.2 and 6.3. Reasoning is it is not used in modern Swedish, and also
when it is used it is only used in very specific cases. If required,
exceptions could be added in a case by case basis by Lexers and other
implementations.

2. If keeping this rule, please make sure to limit it to real-world usage so
only 3-letter words are legal with colon as its middle character and not other
usages. So c:a is a legal Unicode word but word:word isn't (but note word:word
currently _is_ a legal word by UAX#29.

This request grew up out of real-world scenarios where lexers implementing
this standard produced problematic tokens because of this very hardly used
rule.

Thank you for your consideration.

Feedback above this line was considered at the February 2015 UTC meeting.

Date/Time: Sun Feb 8 12:13:28 CST 2015
Name: Karl Williamson
Report Type: Public Review Issue
Opt Subject: UAX #29

First, a formatting issue, present in the draft and released versions.  In 6.2
beginning, "An alternate expression that resolves to a single character is
treated as a whole. For example:", the *not* shows up on my Firefox browser as
on the line below what it should, so it looks like

         → 	(STerm (Extend | Format)* | ATerm (Extend | Format)*)
   not

Second, it is not obvious to me that the rules "Ignore Format and Extend
characters" imply

   Any  ×  (Format | Extend)

even if it is obvious to others.  I think there should be an explicit 
rule to that effect.

Third, the statement in SB "Ignore Format and Extend characters, except after
sot, Sep, CR, or LF. " seems to me to be better than the corresponding
statement in WB "Ignore Format and Extend characters, except when they appear
at the beginning of a region of text."  I propose changing the WB wording.

Finally, the WB rules WB6 and WB7 are sloppily worded.  See
<54C48C81.3080405@khwilliamson.com> and follow-on messages to the unicode.org
mailing list.  Phillipe Verdy has proposed some changes to tighten that up.  I
haven't inspected those to see if I agree, but my claim is that the rules
should be constructed so that once a situation matches a rule, later rules
shouldn't overrule that.  So, if nothing else changed, WB7a should come before
the current WB6, as both have rules for the same combination, and end up with
different dispositions.

Date/Time: Thu Apr 30 15:34:08 CDT 2015
Name: Karl Williamson
Report Type: Error Report
Opt Subject: Sentence Break property doesn't follow TUS 7.0 recommendations

5.8 of TUS 7.0 includes this text:

R2c In parsing, choose the safest interpretation.
For example, in recommendation R2c an implementer dealing with sentence break heuris-
tics would reason in the following way that it is safer to interpret any NLF as LS:
• Suppose an NLF were interpreted as LS, when it was meant to be PS. Because
most paragraphs are terminated with punctuation anyway, this would cause
misidentification of sentence boundaries in only a few cases.
• Suppose an NLF were interpreted as PS, when it was meant to be LS. In this
case, line breaks would cause sentence breaks, which would result in significant
problems with the sentence break heuristics.

However, UAX #29 disregards this, treating the NLF as a PS
(It defines ParaSep to be (Sep | CR | LF), so rule SB4
Break after paragraph separators. SB4 ParaSep ÷
leads to a break after any NLF.)

This leads in practice to exactly what TUS says it will: "line breaks would cause sentence breaks,
which would result in significant problems with the sentence break heuristics."

There is no discussion in UAX 29 as to why the TUS recommendations aren't followed.

I think it would be best if the SB property were revised to follow the TUS recommendations. If for whatever
reason that is not possible, there should be significant discussion in UAX 29 as to why not.

Note that I posted this discrepancy on the unicode.org mailing list in February
of 2015, <54E8D816.9010606@khwilliamson.com>. No response was made.
Therefore, I'm formally submitting it, to force consideration.