From: Andy Heninger (andyh@jtcsv.com)
Date: Mon Sep 13 2004 - 17:39:18 CDT
In looking at how the proposed changes to the TR 29 word boundary rules
would be implemented in the ICU library, I came across an odd situation
in the rules.
My question actually has nothing to do with the new proposed changes to
TR29, but is something that has been in the word boundary rules all along.
Looking at the pertinent parts of TR 29, we have
ALetter = Alphabetic = TRUE
and some other stuff
Rule 3: Treat a grapheme cluster as if it were a single
character, the first char of the cluster.
Rule 5, 6, 7,9 Don't break between most letters
and a variety of other constructs.
ALetter x [whatever]
The issue is that there are characters that have both the Alphabetic
property = TRUE and the Grapheme Extend property = TRUE. These will
normally be part a grapheme cluster, and thus will not directly
participate in determining word boundaries. But if you do get an
alphabetic combining char by itself, either at the start of the text, or
following a character that does not accept combining characters, the
Alphabetic property comes into play and the combining char can join with
following letters in forming a "word".
If the orphaned alphabetic combining character were preceded by some
sort of space, which is what you would normally do if you wanted to
display the it without a base character, the space would determine the
word breaking and the combining char would not join with subsequent
letters to form a word. I suspect that a combining character with no
base would display in more or less the same way as one with a space as a
base.
What I think would make sense is to modify the definition of ALetter in
TR29 to look like this:
ALetter = Alphabetic = TRUE
[plus all the other stuff already in TR29]
AND NOT GRAPHEME EXTEND = true
And how did I come to be looking at this particular case? The ICU
library word breaker uses a regular-expression based set of rules and a
DFA style matching engine that is very clever in always finding the
longest match, which is to say, the longest possible word.
consider a sequence of four characters like this:
[ALetter] [MidLetter] [ALetter with GraphemeExtend] [Numeric]
In the correct interpretation of the TR29 word rules, the middle two
characters would combine to form a grapheme cluster, the word break
rules would then be applied to the reduced sequence
[ALetter] [MidLetter] [Numeric]
which matches none of the word rules.
The DFA Regex implementation, however, finds a longer word by
considering the [ALetter + Grapheme Extend] character as just an ALetter
[ALetter] [MidLetter] [ALetter] [Numeric]
which all sticks together through application of rules 6 and 9 from the TR.
While thinking about what to do about this, it struck me that it would
probably be more consistent all the way around to remove the Grapheme
Extend characters from the ALetter set. The only effect of this change
would be on the breaking behavior of combining characters with no base
character.
Any thoughts?
-- -- Andy Heninger heninger@us.ibm.com
This archive was generated by hypermail 2.1.5 : Mon Sep 13 2004 - 17:43:10 CDT