L2/02-047

To:	UTC
Re:	Default Word Boundary Definition
From:	Mark Davis
Date:	2001-01-29

The XML Query group is interested in the possibility of using the Unicode default specifications (Table 5-4 Word Boundaries) for word boundaries in their full text search work. However, this specification has not received nearly the attention -- and refinement -- of the default line boundary specification (UAX #14: Line Breaking Properties). The Query group is requesting that we review this specification and fix any problems so that it could be utilized by them as a default specification. (They would allow tailored word boundaries to be used as well, so that language-specific engines could do a better job; that's consistent with what we expect of default specifications.)

Background. The word boundaries are related to the line boundaries, but are distinct. Here is an example of word boundaries.

The

quick

brown

fox

can't

jump

32.3

feet

right

There is a boundary, for example, on either side of the word "brown".

The particular requirement that the Query group has is for proximity; seeing whether, for example, "monster" is within 3 words of "truck". That is done with the above boundaries by disregarding any sequence that does not contain a letter. Whether or not digits are considered numbers is left up to the implementation using the word boundaries. Thus for proximity the above is treated as if it were the following, so "fox" is within three words of "quick".

The

quick

brown

fox

can't

jump

32.3

feet

right

The current definitions in Table 5-4 Word Boundaries basically break between letters and non-letters, with combining marks considered part of the letter. Clusters of CJK characters or katakana are considered single words (including trailing sequences of hiragana).

There are some problems here.

What constitutes a letter is not well defined
Certain punctuation can bridge a word, such as "can't"
Numbers don't have wordbreaks on either side
Sequences of CJK count as a single word.

While the default definition can't do anything sophisticated with them (such as dictionary lookup), it would be better to have breaks around single CJK than to include a whole paragraph (potentially) as a single word.

Proposal. I propose that we address these issues with a revised specification, leveraging the definitions and character properties that we use in line-break where possible.

Note: As we do this, we must remember that we are supplying a default specification. As with our other default specifications, implementations are free to override (tailor) the results to meet the requirements of different environments or particular languages.

Here is a strawman proposal that we could use for the basis of further discussion.

Table 5-3. Default Grapheme Cluster Boundaries

Character Classes

Any of the Linebreak properties, plus:

Hiragana	Letter where script = HIRAGANA
Katakana	Letter where script = KATAKANA
Letter	(General Category = L* or Sk) AND ¬ (ID OR Hiragana OR Katakana)
MidLetter	U+00AD apostrophe, U+2019 curly apostrophe, U+003A colon (used in Swedish), U+0029 period
Ignorable	Join_Controls, Bidi_Controls, Word_Joiner, ZWNBSP, CGJ, All combining marks (General Category = M*)

Rules

[ED NOTE: I am liberally copying from the LineBreak rules, and have not tried to make the result pretty. The items marked (LB) are exactly from Linebreak.]

Break at the start and end of text:

÷ sot

eot ÷

Each Ignorable is treated as if it were the type of the previous letter.

X Ignorable => X X

Don’t break words across certain punctuation

Letter × MidLetter Letter

Letter MidLetter × Letter

(LB) Don’t break within ‘a9’, ‘3a’, or ‘H%’

ID × PO

AL × NU

NU × AL

Numbers are of the form PR ? ( OP | HY ) ? NU (NU | IS) * CL ? PO ?

Examples: $(12.35) 2,1234 (12)¢ 12.54¢

This is approximated with the following rules. (Some cases already handled above, like ‘9,’, ‘[9’.)

(LB) Don’t break between the following pairs of classes.

CL × PO

HY × NU

IS × NU

NU × NU

NU × PO

PR × AL

PR × HY

PR × ID

PR × NU

PR × OP

SY × NU

Example pairs: ‘$9’, ‘$[’, ‘$-‘, ‘-9’, ‘/9’, ‘99’, ‘,9’, ‘9%’ ‘]%’

Break transition between Numbers and anything else

NU ÷ ¬ NU

¬ NU ÷ NU

Break at a transition between Letters and anything else

Letter ÷ ¬ Letter

¬ Letter ÷ Letter

Break at a transition between Hiragana and anything else

Hiragana ÷ ¬ Hiragana

¬ Hiragana ÷ Hiragana

Break at a transition between Katakana and anything else

Katakana ÷ ¬ Katakana

¬ Katakana ÷ Katakana

Break around ideographs

ID ÷

÷ ID

Otherwise, don't break

Any × Any

Note: Thai is a case where, as in LineBreak, a good implementation could not just depend on the default word boundary specification, but should use a more sophisticated mechanism. We do have to choose some default, however, in the absence of such a mechanism. The above would treat any sequence of Thai letters as a single word, depending on the (logical or physical) insertion of ZWSP to break up the words. The alternative is to treat them as ID, breaking everywhere.