A question about the default grapheme cluster boundaries with U+0020 as the grapheme base

From: Konstantin Ritt <ritt.ks_at_gmail.com>
Date: Sat, 2 Jun 2012 07:22:01 +0300

It seems like there is an inconsistency between what the default
grapheme clusters specification says and what the test results are
expected to be:

The UAX#29 says:
> Another key feature (of default Unicode grapheme clusters) is that <b>default Unicode grapheme clusters are atomic units with respect to the process of determining the Unicode default line, word, and sentence boundaries</b>.
Also this mentioned in UAX#14:
> Example 6. Some implementations may wish to tailor the line breaking algorithm to resolve grapheme clusters according to Unicode Standard Annex #29, “Unicode Text Segmentation” [UAX29], as a first stage. <b>Generally, the line breaking algorithm does not create line break opportunities within default grapheme clusters</b>; therefore such a tailoring would be expected to produce results that are close to those defined by the default algorithm. However, if such a tailoring is chosen, characters that are members of line break class CM but not part of the definition of default grapheme clusters must still be handled by rules LB9 and LB10, or by some additional tailoring.

However, <U+0020 (SP), U+0308 (CM)> in the line breaking algorithm is
handled by the rules LB10+LB18 and produces a break opportunity while
GB9 prohibits break between <U+0020 (Other), U+0308 (Entend)>.
Section 9.2 "Legacy Support for Space Character as Base for Combining
Marks" in UAX#29 clarifies why there is a line break occurs, but the
fact that the statements above are false statements and introduce some
ambiguility.
In case the space character is not a grapheme base anymore the
grapheme cluster breaking rules need to be updated.

Kind regards,
Konstantin
Received on Fri Jun 01 2012 - 23:27:01 CDT

This archive was generated by hypermail 2.2.0 : Fri Jun 01 2012 - 23:27:08 CDT