Re: TR29 Word Break awkwardness

From: Peter Kirk (peterkirk@qaya.org)
Date: Tue Sep 14 2004 - 17:41:24 CDT

  • Next message: Peter Constable: "RE: Questions about diacritics"

    On 14/09/2004 22:44, Andy Heninger wrote:

    > Peter Kirk wrote:
    > > I have in mind certain situations found in Hebrew (Ketiv/Qere blended
    > > forms) in which anomalous (but quite frequently found) word forms
    > > begins
    > > with a spacing combining character. The currently specified way of
    > > supporting this situation is to use SPACE or NBSP followed by the
    > > combining character (as these combining characters do not have
    > > non-spacing clones). It would be highly undesirable to make a change
    > > here which would allow word breaks, line breaks etc after the
    > > combining
    > > character but before the rest of the word.
    >
    > The proposed change to word boundaries would have no effect on the
    > case you describe, but word boundaries may already not be doing what
    > you want. If you have a SPACE or NBSP preceding the combining
    > character, the grapheme cluster composed of the space plus the
    > combining char will behave as just a space, and be split off from the
    > remainder of the word.
    >
    > I found 16 Hebrew characters that would be affected by the change,
    > \u05B0 HEBREW POINT SHEVA through
    > \u05C2 HEBREW POINT SIN DOT
    > with a couple of holes in the middle of the range.
    >
    > To have these characters attach to a following word, an alphabetic
    > base character is needed.
    >
    These are the Hebrew characters I had in mind. But then wouldn't the
    Hebrew accents 0591-05AF also be affected in the same way? If these
    don't have Grapheme_Extend = true, why not?

    Well, all of this rather surprises me, because we have been through this
    one on this list before and others have assured me that there is a
    special rule by which spaces with combining marks are treated specially.
    But I see, that is in TR14 under line breaking, not in TR29 under word
    breaking: "If U+0020 SPACE is used as a base character, it is treated
    as ID instead of SP." Well, it is perhaps more critical that there
    should be no line break in these situations than that there should be no
    word break. I must say I am confused as to why line breaking and word
    breaking are considered such different issues that they are dealt with
    entirely separately, when at least in the scripts I am familiar with the
    rules should be almost identical.

    But this fact that SPACE or even NBSP with a combining character is
    treated as not part of a word for word boundary calculation is another
    strong argument that INVISIBLE LETTER is necessary, cf. Public Review
    Issue #41.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Tue Sep 14 2004 - 22:04:20 CDT