Re: Word Selection (was: RE: [indic] Unicode Processing Requirements for Tamil)

From: Mark Davis (mark.davis@icu-project.org)
Date: Fri Sep 02 2005 - 21:18:37 CDT

  • Next message: N. Ganesan: "Re: [indic] Unicode Processing Requirements for Tamil (was: 28th IUC paper - Tamil Unicode New)"

    Ken is completely correct. There is no 'one way' to do word breaking,
    and UAX#29 specifically points out that tailorings are needed for
    different environments and locales.

    Related to this topic, there is a proposal we've been working on in CLDR
    for being able to add to the repository tailorings of word boundaries.
    These would be accessed based on locale -- including the option of
    variants, much as CLDR has two different collation sequences for German,
    and many for Chinese. The proposal is based on UAX#29 format, but made
    machine-readable and inheritable. For more information, see

    http://dev.icu-project.org/cgi-bin/locale-bugs?findid=276

    (I just added a more concrete example of what the rules would look like
    as Reply 2.)

    Mark

    Kenneth Whistler wrote:
    > Kent Spielmann wrote:
    >
    >
    >>When double clicking a word, I would want the whole
    >>word to be selected, not broken up at one of these "modifiers". This is not
    >>the case in most word processing programs. There is no standard behavior.
    >
    >
    > Correct. But your expectation that there should be runs somewhat afoul
    > of the nature of the problem.
    >
    > There is no universal definition of "a word" in the first place,
    > that could be defined purely on the basis of a character encoding,
    > independent of considerations of particular languages and particular
    > orthographic conventions.
    >
    > [ Experimental data excised ]
    >
    >
    >>Note that no two pieces of software behave the same. It seems a standard
    >>behavior should be made clear in the Unicode standard
    >
    >
    > Well, I disagree in part about this assessment. How word processors choose
    > to implement double-click behavior is their concern, and may involve
    > a lot of factors and opinions regarding what is "right" and what
    > is "wrong" default selection behavior. It is not the place of
    > the Unicode Standard to dictate that, particularly in the
    > absence of any way of knowing what constraints implementations
    > may be operating under or what requirements their customers may
    > have.
    >
    > The Unicode Standard *does*, however, supply a specification of a default
    > word boundary detection algorithm (in UAX #29), which can be
    > used, but it is expected that implementations will, in most cases,
    > choose to tailor it in one way or another, or in other cases
    > simply implement their own word selection.
    >
    > If you work through that specification and apply it to the
    > particular characters you have chosen, you'll end up with
    > the following determinations:
    >
    > Class Aletter: 02B0, 02BC, 02C6, 02D0, 207F
    >
    > Class MidLetter: 0027, 003A
    >
    > Class Numeric: 0031
    >
    > Class Other: 00B9, 02C2, 02E9
    >
    > And the default word break determinations are as follows,
    > where "x" means don't break here, and "÷" means break here.
    >
    > ALetter x ALetter x ALetter
    > ALetter x MidLetter x ALetter
    > ALetter x Numeric x ALetter
    > ALetter ÷ Other ÷ ALetter
    >
    > which means by your chart, 00B9 (superscript 1), 02C2 (left
    > arrowhead), and 02E9 (extra-low tone bar) would not be
    > judged "letterlike" enough to be counted within the "word",
    > (gets an "L" in your chart) whereas the other characters would
    > be included within the "word" (gets a "W" in your chart).
    >
    > I actually think that is a pretty good default, as superscript
    > numerals, tone letters, and IPA non-letterlike diacritics
    > such as the left arrowhead are not common in actual, practical
    > orthographies. They occur occasionally, of course, and do
    > occur in transcriptional material, but I consider those to
    > be edge cases that I wouldn't expect generic software to have
    > to deal with. I don't expect a general purpose word processor
    > to allow me to double-click in the middle of a close
    > IPA transcription and correctly determine a "word" boundary
    > in such material, any more than I would expect it to be
    > able to parse out a mathematical expression or a particular
    > formal language construct. A special-purpose word processor
    > could, of course -- the way programming editors parse and
    > highlight C or Java constructs automatically. But that's
    > way beyond the requirements for something like Notepad.
    >
    > Except for U+003A COLON the UAX #29 specification matches, apparently,
    > the actual behavior of OpenOffice Writer, from your chart, from which
    > I surmise that it probably bases its word selection on
    > a WordBreak iterator class from ICU, based on implementation
    > of UAX #29 word boundary detection. And COLON is a true
    > edge case -- for most purposes it is probably better to break
    > around it, but it does get used in some languages, including
    > Swedish, as parts of words.
    >
    > WorldPad is similar, but doesn't show the later UAX #29 changes
    > for U+0027 and U+0031, so it might have been based on earlier
    > published word boundary detection suggestions from Unicode 3.0.
    >
    > --Ken
    >
    >
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri Sep 02 2005 - 21:21:34 CDT