This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.
Date/Time: Tue Nov 20 08:37:04 CST 2012
Contact: ritt.ks@gmail.com
Name: Konstantin
Report Type: Error Report
Opt Subject: An issue with breaking sentences and words separated with dot-alike characters
As of Unicode 5.1, the MidNumLet Word_Break property value (apostrophe-alike + dot-alike characters) caused sequences <(ALetter)+ MidNumLet (ALetter)+> to be treated like a single word. Whilst it seems to be an improvement in handling words with apostrophes like "can't" or "aujourd`hui", it also causes a regression in handling words separated with dot-alike characters (e.g. domain names -- see http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/63311, missed space(s) in the user's text -- "hi.there", or navigating through the code -- "struct.member" (yeah, I know this is out of scope of the default breaking algorithm, but still), and so on). And the worst thing is that the default algorithm now specifies a sentence break in the middle of a word. As for example: "mr.Hamster" - there are two sentences due to rule SB8 (http://www.unicode.org/reports/tr29/#SB8) but still a single word due to rules WB6-WB7 (http://www.unicode.org/reports/tr29/#WB6). A simple possible solution is to map some or all of those dot-alike characters (FULL STOP, ONE DOT LEADER, SMALL FULL STOP, and FULLWIDTH FULL STOP) back to MidNum Word_Break property value (this way CLDR tailors the default algorithm for en_US_POSIX). Another possible solution I see is to split ALetter into Upper, Lower, and OLetter, and to map those dot-alike characters to some new Term Word_Break property value (like the appropriate Sentence_Break property values), and to extend the word breaking rules so that there are no breaks will be allowed within sequences like <Upper x Term x Upper (Term)?> and <Lower x Term x Lower (Term)?> surrounded with <(!(Upper | Lower | OLetter))*>. Then, rules WB6-WB7 could probably be replaced with ones that specifies a word break for sequence <Lower (MidLetter | MidNumLet) Upper> and maybe <OLetter (MidLetter | MidNumLet) Upper>.
Added from mail archive per request from author:
From: Konstantin Ritt <ritt.ks_at_gmail.com>It seems like there is an inconsistency between what the default
grapheme
clusters specification says and what the test results are
expected to be:
The UAX#29 says:
> Another key feature (of default
Unicode grapheme clusters) is that <b>default Unicode grapheme clusters are
atomic units with respect to the process of determining the Unicode default
line, word, and sentence boundaries</b>.
Also this mentioned in UAX#14:
> Example 6. Some implementations may wish to tailor
the line breaking algorithm to resolve grapheme clusters according to Unicode
Standard Annex #29, “Unicode Text Segmentation” [UAX29], as a first stage.
<b>Generally, the line breaking algorithm does not create line break
opportunities within default grapheme clusters</b>; therefore such a tailoring
would be expected to produce results that are close to those defined by the
default algorithm. However, if such a tailoring is chosen, characters that are
members of line break class CM but not part of the definition of default
grapheme clusters must still be handled by rules LB9 and LB10, or by some
additional tailoring.
However, <U+0020 (SP), U+0308 (CM)> in the line breaking algorithm is
handled by the rules LB10+LB18 and produces a break opportunity while
GB9
prohibits break between <U+0020 (Other), U+0308 (Entend)>.
Section 9.2
"Legacy Support for Space Character as Base for Combining
Marks" in UAX#29
clarifies why there is a line break occurs, but the
fact that the statements
above are false statements and introduce some
ambiguility.
In case the
space character is not a grapheme base anymore the
grapheme cluster breaking
rules need to be updated.
Kind regards,
Konstantin
Date/Time: Mon Mar 25 16:47:56 CDT 2013
Contact: wellnhofer@aevum.de
Name: Nick Wellnhofer
Report Type: Public Review Issue
Opt Subject: New word boundary rules in UAX #29, Unicode 6.3.0 (draft 2)
The new rule WB7c in UAX #29, Unicode 6.3.0 (draft 2) can be simplified to read: Hebrew_Letter Double_Quote × Hebrew_Letter The single quote case is already handled in rule WB7.
Date/Time: Wed May 1 17:42:13 CDT 2013
Contact: andy.heninger@gmail.com
Name: Andy Heninger
Report Type: Public Review Issue
Opt Subject: UAX 29 proposed word break rules
In the UAX 29 proposed update (draft 2), there is a redundancy in the word break rules. From the draft we have WB7. (ALetter | Hebrew_Letter) (MidLetter | MidNumLet | Single_Quote) × (ALetter | Hebrew_Letter) WB7c. Hebrew_Letter (Single_Quote | Double_Quote) × Hebrew_Letter The "Single_Quote" term in WB7c is redundant - the same sequence of Hebrew_Letter Single_Quote × Hebrew_Letter is also covered by WB7. So WB7c could be simplified to Hebrew_Letter Double_Quote × Hebrew_Letter
Date/Time: Fri May 3 05:12:20 CDT 2013
Contact: kent.karlsson14@telia.com
Name: Kent Karlsson
Report Type: Public Review Issue
Opt Subject: objection to changes based on L2/12-282
I object to the change done according to http://www.unicode.org/L2/L2012/12282-colon.html As I noted several months ago, in http://unicode.org/cldr/trac/ticket/3987, when the same issue was raised in CLDR: First: "... because Swedish uses it in the middle of a word"; well, it is used in a few particular abbreviations, when the middle of the word is abbreviated away. There are very few such abbreviations in general use, "c:a" (for "cirka"), "k:a" (for "kyrka", church), "s:t" (for "sankt"), "g:a" (for "gamla", old). (B.t.w., Danish and Norwegian uses the abbreviation "ca." for "cirka".) Colon is also used when adding inflections to abbreviated names, e.g. "tv:n" (this seems to be used for Finnish as well), "USA:s" (this seems to be used, at least sometimes, also in Norwegian and (Northern?) Sami), "UFO:t", "UFO:na", or to numbers, e.g. "3:e", "3:ans". Colon is also used between digits, in e.g. currency values (like "12:50") and time values (as it is for many languages), and some other cases. So even [it] this use may be more prominent for Swedish, I would not limit it to just Swedish; and indeed the limitation to "letter colon letter" is too limiting. [Some other examples, Swedish: "Björn J:son Lindh" (abbreviation), "AIK:are" (inflection), "Gustav III:s" (inflection). Finnish: Examples taken from http://fi.wikipedia.org/wiki/Kaksoispiste: "EU:n" (inflection), "v:sta" (abbreviation), "20:nnelle (inflection of number)", "STTK:lainen" (inflection), "H:ki" (abbreviation), "t:mi" (abbreviation).] And then: I would suggest updating the following rules in UAX 29: WB6. ALetter × (MidLetter? | MidNumLet?) ALetter WB7. ALetter (MidLetter? | MidNumLet?) × ALetter to WB6. (Numeric | ALetter) × (MidLetter? | MidNumLet?) ALetter WB7. (Numeric | ALetter) (MidLetter? | MidNumLet?) × ALetter in order to handle number inflections (like 3:e (for tredje), 3:ans (for treans)). And change (first one editorial): U+003A ( : ) COLON (used in Swedish) to U+003A ( : ) COLON and move the colon-like characters from MidLetter? to MidNumLet? (to handle numerals like "3:50" as one "word"). UAX 29 text changes (editorial): Change: Certain cases such as colons in words (c:a) are included in the default even though they may be specific to relatively small user communities (Swedish) because they do not occur otherwise, in normal text, and so do not cause a problem for other languages. to Certain cases such as colons in abbreviated words (e.g., "c:a") and inflections (e.g., "3:ans", "tv:n") are included in the default even though they may be specific to relatively small user communities (Swedish and other languages) because they do not occur otherwise, in normal text, and so do not cause a problem for languages that do not use this convention. and It includes characters that may not be appropriate for identifiers, and some that would not be parts of words. It also permits some characters that may be part of words in a broad sense, but not part of names, such as in "c:a" in Swedish, or hyphenation points used in dictionary words. to It includes characters that may not be appropriate for identifiers, and some that would not be parts of words. It also permits some characters that may be part of words in a broad sense, but not part of names, such as in some abbreviations like "c:a" and some inflections like "USA:s" and "3:e" in Swedish, or hyphenation points used in dictionary words. [Consider adding some of the Finnish examples too.] ====================== I would also like to point out that colon is also used as a fallback for modifier letter triangular colon. And this may be used in phonetic notation for many languages. ====================== Jonathan Kew pointed out in an email recently: It has also been used in other orthographies to represent tone; for an example, see "Table 7: Old Tone Orthography for Etung (Cameroon)" in [1]. I'm sure that wouldn't be the only example. ISTM that a "mid-word" colon should be treated similarly to a hyphen or apostrophe in the same position. ======================= Regarding the suggestion to tailor the in-word behaviour of colon for certain languages (Swedish, Finnish, ...), in particular in CLDR: Firstly, it does not help when ":" is used as fallback for triangular colon (a modifier letter). Secondly, most text is not language tagged. Even though colon inside of words may be "unexpected" in some languages, it appears that it being allowed inside a word would only be noticeable for mistypings, e.g. "Participants:George, ..." (space after colon is missing). It could possibly be an issue in languages where space is not used between words, and "western" punctuation is used. Maybe Thai. On the other hand, those languages need special handling (like dictionary lookup) for finding word boundaries anyway. So I wonder which languages are actually hurt by allowing ":" inside words. None have been exemplified in the 3987 CLDR ticket, nor in L2-12/282.
Date/Time: Fri May 3 15:48:28 CDT 2013
Contact: asmus@unicode.org
Name: asmus
Report Type: Public Review Issue
Opt Subject: objection to changes based on L2/12-282
I second the objections brought by Kent Karlsson under this subject header. I would like to further point out that colon is used in legal personal names in Sweden, (and possibly in entity names in a wider context). The problem with names is that they must be supported in databases and other systems where data do not form single-language "documents" and were language- tagging and language-sensitive processing is not performed on a per field basis. In the European context, databases with names from multiple countries are a common use case. Database reports, including mail merge, would easily insert a "Swedish" name into a document that is otherwise not Swedish. I feel that having the default algorithm fail at lists of names or documents that contain names would be suboptimal, and that pointing this off to tailoring on the base of language is a non-starter.
Ed Note: The linebreak properties of U+3035 has been changed in 6.3
Date/Time: Fri May 10 21:36:28 CDT 2013
Contact: fantasai@inkedblade.net
Name: fantasai
Report Type: Error Report
Opt Subject: Split kana repeat mark grapheme cluster
Hi! The CSSWG has received an issue report about disallowing letter-spacing/justification between U+3033 and U+3034/U+3035. I believe this is actually an error in the Unicode spec--the pair should form a single grapheme cluster. See http://lists.w3.org/Archives/Public/www-style/2013Jan/0071.html and http://lists.w3.org/Archives/Public/www-style/2013May/0282.html
See also feedback from frommail@badral.net regarding NARROW_NO_BREAK_SPACE (202F) on the 6.3 beta PRI feedback page.