Re: Word Selection (was: RE: [indic] Unicode Processing Requirements for Tamil)

From: Mark Davis (mark.davis@icu-project.org)
Date: Fri Sep 02 2005 - 21:18:37 CDT

Next message: N. Ganesan: "Re: [indic] Unicode Processing Requirements for Tamil (was: 28th IUC paper - Tamil Unicode New)"

Previous message: James Kass: "Re: [indic] Unicode Processing Requirements for Tamil (was: 28th IUC paper - Tamil Unicode New)"
In reply to: Kenneth Whistler: "Word Selection (was: RE: [indic] Unicode Processing Requirements for Tamil)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Ken is completely correct. There is no 'one way' to do word breaking,
and UAX#29 specifically points out that tailorings are needed for
different environments and locales.

Related to this topic, there is a proposal we've been working on in CLDR
for being able to add to the repository tailorings of word boundaries.
These would be accessed based on locale -- including the option of
variants, much as CLDR has two different collation sequences for German,
and many for Chinese. The proposal is based on UAX#29 format, but made
machine-readable and inheritable. For more information, see

http://dev.icu-project.org/cgi-bin/locale-bugs?findid=276

(I just added a more concrete example of what the rules would look like
as Reply 2.)

Mark

Kenneth Whistler wrote:
> Kent Spielmann wrote:
>
>
>>When double clicking a word, I would want the whole
>>word to be selected, not broken up at one of these "modifiers". This is not
>>the case in most word processing programs. There is no standard behavior.
>
>
> Correct. But your expectation that there should be runs somewhat afoul
> of the nature of the problem.
>
> There is no universal definition of "a word" in the first place,
> that could be defined purely on the basis of a character encoding,
> independent of considerations of particular languages and particular
> orthographic conventions.
>
> [ Experimental data excised ]
>
>
>>Note that no two pieces of software behave the same. It seems a standard
>>behavior should be made clear in the Unicode standard
>
>
> Well, I disagree in part about this assessment. How word processors choose
> to implement double-click behavior is their concern, and may involve
> a lot of factors and opinions regarding what is "right" and what
> is "wrong" default selection behavior. It is not the place of
> the Unicode Standard to dictate that, particularly in the
> absence of any way of knowing what constraints implementations
> may be operating under or what requirements their customers may
> have.
>
> The Unicode Standard *does*, however, supply a specification of a default
> word boundary detection algorithm (in UAX #29), which can be
> used, but it is expected that implementations will, in most cases,
> choose to tailor it in one way or another, or in other cases
> simply implement their own word selection.
>
> If you work through that specification and apply it to the
> particular characters you have chosen, you'll end up with
> the following determinations:
>
> Class Aletter: 02B0, 02BC, 02C6, 02D0, 207F
>
> Class MidLetter: 0027, 003A
>
> Class Numeric: 0031
>
> Class Other: 00B9, 02C2, 02E9
>
> And the default word break determinations are as follows,
> where "x" means don't break here, and "÷" means break here.
>
> ALetter x ALetter x ALetter
> ALetter x MidLetter x ALetter
> ALetter x Numeric x ALetter
> ALetter ÷ Other ÷ ALetter
>
> which means by your chart, 00B9 (superscript 1), 02C2 (left
> arrowhead), and 02E9 (extra-low tone bar) would not be
> judged "letterlike" enough to be counted within the "word",
> (gets an "L" in your chart) whereas the other characters would
> be included within the "word" (gets a "W" in your chart).
>
> I actually think that is a pretty good default, as superscript
> numerals, tone letters, and IPA non-letterlike diacritics
> such as the left arrowhead are not common in actual, practical
> orthographies. They occur occasionally, of course, and do
> occur in transcriptional material, but I consider those to
> be edge cases that I wouldn't expect generic software to have
> to deal with. I don't expect a general purpose word processor
> to allow me to double-click in the middle of a close
> IPA transcription and correctly determine a "word" boundary
> in such material, any more than I would expect it to be
> able to parse out a mathematical expression or a particular
> formal language construct. A special-purpose word processor
> could, of course -- the way programming editors parse and
> highlight C or Java constructs automatically. But that's
> way beyond the requirements for something like Notepad.
>
> Except for U+003A COLON the UAX #29 specification matches, apparently,
> the actual behavior of OpenOffice Writer, from your chart, from which
> I surmise that it probably bases its word selection on
> a WordBreak iterator class from ICU, based on implementation
> of UAX #29 word boundary detection. And COLON is a true
> edge case -- for most purposes it is probably better to break
> around it, but it does get used in some languages, including
> Swedish, as parts of words.
>
> WorldPad is similar, but doesn't show the later UAX #29 changes
> for U+0027 and U+0031, so it might have been based on earlier
> published word boundary detection suggestions from Unicode 3.0.
>
> --Ken
>
>
>
>
>
>

Next message: N. Ganesan: "Re: [indic] Unicode Processing Requirements for Tamil (was: 28th IUC paper - Tamil Unicode New)"
Previous message: James Kass: "Re: [indic] Unicode Processing Requirements for Tamil (was: 28th IUC paper - Tamil Unicode New)"
In reply to: Kenneth Whistler: "Word Selection (was: RE: [indic] Unicode Processing Requirements for Tamil)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Sep 02 2005 - 21:21:34 CDT