TR29 Word Break awkwardness

From: Andy Heninger (andyh@jtcsv.com)
Date: Mon Sep 13 2004 - 17:39:18 CDT

  • Next message: rick@unicode.org: "New Public Review Issue posted"

    In looking at how the proposed changes to the TR 29 word boundary rules
    would be implemented in the ICU library, I came across an odd situation
    in the rules.

    My question actually has nothing to do with the new proposed changes to
    TR29, but is something that has been in the word boundary rules all along.

    Looking at the pertinent parts of TR 29, we have

        ALetter = Alphabetic = TRUE
                    and some other stuff

         Rule 3: Treat a grapheme cluster as if it were a single
                  character, the first char of the cluster.

         Rule 5, 6, 7,9 Don't break between most letters
                           and a variety of other constructs.
                              ALetter x [whatever]

    The issue is that there are characters that have both the Alphabetic
    property = TRUE and the Grapheme Extend property = TRUE. These will
    normally be part a grapheme cluster, and thus will not directly
    participate in determining word boundaries. But if you do get an
    alphabetic combining char by itself, either at the start of the text, or
    following a character that does not accept combining characters, the
    Alphabetic property comes into play and the combining char can join with
    following letters in forming a "word".

    If the orphaned alphabetic combining character were preceded by some
    sort of space, which is what you would normally do if you wanted to
    display the it without a base character, the space would determine the
    word breaking and the combining char would not join with subsequent
    letters to form a word. I suspect that a combining character with no
    base would display in more or less the same way as one with a space as a
    base.

    What I think would make sense is to modify the definition of ALetter in
    TR29 to look like this:

        ALetter = Alphabetic = TRUE
                      [plus all the other stuff already in TR29]
                       AND NOT GRAPHEME EXTEND = true

    And how did I come to be looking at this particular case? The ICU
    library word breaker uses a regular-expression based set of rules and a
    DFA style matching engine that is very clever in always finding the
    longest match, which is to say, the longest possible word.

    consider a sequence of four characters like this:
    [ALetter] [MidLetter] [ALetter with GraphemeExtend] [Numeric]

    In the correct interpretation of the TR29 word rules, the middle two
    characters would combine to form a grapheme cluster, the word break
    rules would then be applied to the reduced sequence
            [ALetter] [MidLetter] [Numeric]
    which matches none of the word rules.

    The DFA Regex implementation, however, finds a longer word by
    considering the [ALetter + Grapheme Extend] character as just an ALetter

       [ALetter] [MidLetter] [ALetter] [Numeric]

    which all sticks together through application of rules 6 and 9 from the TR.

    While thinking about what to do about this, it struck me that it would
    probably be more consistent all the way around to remove the Grapheme
    Extend characters from the ALetter set. The only effect of this change
    would be on the breaking behavior of combining characters with no base
    character.

    Any thoughts?

    -- 
       -- Andy Heninger
          heninger@us.ibm.com
    


    This archive was generated by hypermail 2.1.5 : Mon Sep 13 2004 - 17:43:10 CDT