L2/02-047R2
To: | UTC |
Re: | Default Word Boundary Definition |
From: | Mark Davis |
Date: | 2001-01-29 |
The XML Query group is interested in the possibility of using the Unicode default specifications (Table 5-4 Word Boundaries) for word boundaries in their full text search work. However, this specification has not received nearly the attention -- and refinement -- of the default line boundary specification (UAX #14: Line Breaking Properties). The Query group is requesting that we review this specification and fix any problems so that it could be utilized by them as a default specification. (They would allow tailored word boundaries to be used as well, so that language-specific engines could do a better job; that's consistent with what we expect of default specifications.)
Background. The word boundaries are related to the line boundaries, but are distinct. Here is an example of word boundaries.
The | quick | (" | brown | ") | fox | can't | jump | 32.3 | feet | , | right | ? |
There is a boundary, for example, on either side of the word "brown". These are the boundaries that users would expect, for example, if they chose "Whole Word Search".
The particular requirement that the Query group has is for proximity; seeing whether, for example, "monster" is within 3 words of "truck". That is done with the above boundaries by extracting any words that contain a letter or digit (whether or not digits are included would be left up to the implementation). Thus for proximity we get the following, so "fox" is within three words of "quick".
The | quick | brown | fox | can't | jump | 32.3 | feet | right |
The current definitions in Table 5-4 Word Boundaries basically break between letters and non-letters, with combining marks considered part of the letter. Clusters of CJK characters or katakana are considered single words (including trailing sequences of hiragana).
There are some problems here.
While the default definition can't do anything sophisticated with them (such as dictionary lookup), it would be better to have breaks around single CJK than to include a whole paragraph (potentially) as a single word.
Proposal. We should address these issues with a revised default specification, leveraging the definitions and character properties that we use in line-break where possible.
Note: As we do this, we must remember that we are supplying a default specification. As with our other default specifications, implementations are free to override (tailor) the results to meet the requirements of different environments or particular languages.
Here is a strawman proposal that we could use for the basis of further discussion.
Table 5-4. Default Word Boundaries
sot | Start of Text |
eot | End of Text |
Hiragana | General_Category = Letter AND Script = HIRAGANA |
Katakana | General_Category = Letter AND Script = KATAKANA |
Letter | (General_Category = Letter OR General_Category = Modifier_Symbol) AND ¬ (Line_Break = Ideographic OR Hiragana OR Katakana) |
MidLetter | U+0027 (') apostrophe, U+2019 (’) curly apostrophe, U+003A (:) colon (used in Swedish), U+0029 (.) period, U+00AD () soft hyphen, U+05F3 (׳) geresh, U+05F4 (״) gershayim |
Ignorable | Join_Controls, Bidi_Controls, Word_Joiner, ZWNBSP, CGJ, OR (General_Category = Mark) |
other | Other categories are from Line_Break (using the long names from PropertyAliases |
Break at the start and end of text: |
|||
÷ | sot | ||
eot | ÷ | ||
Each Ignorable is treated as if it were the type of the previous letter. |
|||
X Ignorable => X X |
|||
Don’t break words across certain punctuation |
|||
Letter | × | MidLetter Letter | |
Letter MidLetter | × | Letter | |
Don’t break within ‘a9’, ‘3a’ |
|||
Alphabetic | × | Numeric | |
Numeric | × | Alphabetic | |
Don’t break within '-3.2' |
|||
Hyphen | × | Numeric | |
Numeric Infix_Numeric | × | Numeric | |
Numeric | × | Infix_Numeric Numeric | |
Prefix_Numeric | × | Numeric | |
Numeric | × | Postfix_Numeric | |
Break at a transition between numbers and anything else |
|||
Numeric | ÷ | ¬ Numeric | |
¬ Numeric | ÷ | Numeric | |
Break at a transition between letters and anything else |
|||
Letter | ÷ | ¬ Letter | |
¬ Letter | ÷ | Letter | |
Break at a transition between Hiragana and anything else |
|||
Hiragana | ÷ | ¬ Hiragana | |
¬ Hiragana | ÷ | Hiragana | |
Break at a transition between Katakana and anything else |
|||
Katakana | ÷ | ¬ Katakana | |
¬ Katakana | ÷ | Katakana | |
Break around ideographs |
|||
Ideographic | ÷ | ||
÷ | Ideographic | ||
Otherwise, don't break |
|||
Any | × | Any |
Notes:
Thai is a case where, as in LineBreak, a good implementation should not just depend on the default word boundary specification, but should use a more sophisticated mechanism. We do have to choose some default, however, in the absence of such a mechanism. The above would treat any sequence of Thai letters as a single word, depending on the (logical or physical) insertion of ZWSP to break up the words. The alternative is to treat them as ID, breaking everywhere.
The hard hyphen is a tricky case. It is quite common for separate words to be connected with a hyphen: out-of-the-box, under-the-table, Anglo-american, etc. A significant number are hyphenated names: Smith-Hawkins, etc. When people do a "Whole Word" search or query, they expect to find the word within those hyphens. While there are some cases where they are separate words -- usually to resolve some ambiguity such as re-sort vs resort, I think overall it's better to keep the hard hyphen out of the default definition.
Apostrophe is another one. Usually considered part of one word ("can't", "aujourd'hui") it may also be considered two ("l'objectif"). Also, one cannot easily distinguish the cases where it is used as a quotation mark from those where it is used as an apostrophe, so one should not include leading or trailing apostrophes.
Unfortunately we cannot resolve all of the issues across languages (or even within a language, since there are ambiguities). The goal is to have as workable a default as we can; tailored engines can be more sophisticated about these matters.
An alternative to the above would be to break within non-letters, such as:
The | quick | ( | " | brown | " | ) | fox | can't | jump | 32.3 | feet | , | right | ? |
There are pros and cons to this approach. The main advantage is that if someone did search for the literal string "feet," with a Whole Word search, it would return the right answer, since there would be a word boundary after the comma.