From: Andy Heninger (andyh@jtcsv.com)
Date: Mon Sep 13 2004 - 17:39:18 CDT
In looking at how the proposed changes to the TR 29 word boundary rules 
would be implemented in the ICU library, I came across an odd situation 
in the rules.
My question actually has nothing to do with the new proposed changes to 
TR29, but is something that has been in the word boundary rules all along.
Looking at the pertinent parts of TR 29, we have
    ALetter = Alphabetic = TRUE
                and some other stuff
     Rule 3:  Treat a grapheme cluster as if it were a single
              character, the first char of the cluster.
     Rule 5, 6, 7,9   Don't break between most letters
                       and a variety of other constructs.
                          ALetter x [whatever]
The issue is that there are characters that have both the Alphabetic 
property = TRUE and the Grapheme Extend property = TRUE.  These will 
normally be part a grapheme cluster, and thus will not directly 
participate in determining word boundaries.  But if you do get an 
alphabetic combining char by itself, either at the start of the text, or 
following a character that does not accept combining characters, the 
Alphabetic property comes into play and the combining char can join with 
following letters in forming a "word".
If the orphaned alphabetic combining character were preceded by some 
sort of space, which is what you would normally do if you wanted to 
display the it without a base character, the space would determine the 
word breaking and the combining char would not join with subsequent 
letters to form a word.  I suspect that a combining character with no 
base would display in more or less the same way as one with a space as a 
base.
What I think would make sense is to modify the definition of ALetter in 
TR29 to look like this:
    ALetter  =  Alphabetic = TRUE
                  [plus all the other stuff already in TR29]
                   AND NOT GRAPHEME EXTEND = true
And how did I come to be looking at this particular case?  The ICU 
library word breaker uses a regular-expression based set of rules and a 
DFA style matching engine that is very clever in always finding the 
longest match, which is to say, the longest possible word.
consider a sequence of four characters like this:
[ALetter] [MidLetter]  [ALetter with GraphemeExtend] [Numeric]
In the correct interpretation of the TR29 word rules, the middle two 
characters would combine to form a grapheme cluster, the word break 
rules would then be applied to the reduced sequence
        [ALetter] [MidLetter] [Numeric]
which matches none of the word rules.
The DFA Regex implementation, however, finds a longer word by 
considering the [ALetter + Grapheme Extend] character as just an ALetter
   [ALetter]  [MidLetter] [ALetter] [Numeric]
which all sticks together through application of rules 6 and 9 from the TR.
While thinking about what to do about this, it struck me that it would 
probably be more consistent all the way around to remove the Grapheme 
Extend characters from the ALetter set.  The only effect of this change 
would be on the breaking behavior of combining characters with no base 
character.
Any thoughts?
-- 
   -- Andy Heninger
      heninger@us.ibm.com
This archive was generated by hypermail 2.1.5 : Mon Sep 13 2004 - 17:43:10 CDT