author: Marco Cimarosti
date: August 14, 2002
Rationale - The existing word-boundary rules of UTR#29 are designed to capture the meaning of apostrophes in English (and many other languages). I.e., apostrophes normally are inside a word, as in "don't" or "Marco's". The behavior of apostrophes is quite different in Italian and French (and other languages, e.g. Esperanto), where an apostrophe normally marks the deletion of the last vowel of a word which occur before a word starting with a vowel, e.g. "d'Unicode" (d' from de = "of"), or "l'Angleterre" (l' from la = "the"). The two words are graphically joined (no space before or after the apostrophe). The apostrophe is part of the first word, and an implicit word break comes after it. Implementing this behavior in the default definition of UTR#29 is important to accomodate the needs of the large French and Italian speaking communities, as well as the needs of the people writing in other languages, who often use loanwords or quotations from these popular languages.
Proposed euristic - The present proposal is based on the observation that French-style "splitting" apostrophes are always followed by a vowel, whereas English-style "joining" apostrophes are normally followed by a consonant. The issue is complicated by the fact that both French and Italian have mute H's which can interfere in the algorithm.
The proposal defines three new character classes: LatinVowels (containing all vowels meaningful in French, Italian, and Esperanto), LatinH (containing only the letter H in the two cases), and Apostrophe (containing the two characters used for apostrophe). The characters contained in the new classes are removed from the classes where the they used to belong (ALetter and MidLetter). The new classes are used to define two new rules (before current rule 6) for French-style apostrophes, which cover the cases "
Open issues - Although this proposal might enahnce the handling of some common cases in two common languages, there still are many remaining edge cases which can only be solved by tailoring the algorithm. For instance, the "c'h" trigraph of the Breton language would unduely be splitted by the default definition.
Note - The proposed changes are concentrated in Table 2 (Default Word Boundaries). Proposed additions are colored in green, proposed deletions are colored in red, and existing text remains in black.
...
Table 2. Default Word Boundaries
Format | General_Category = Format (Cf) |
Katakana | Script = KATAKANA, or Any of the following: U+30FC # KATAKANA-HIRAGANA PROLONGED SOUND MARK U+FF70 # HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK U+FF9E..U+FF9F # HALFWIDTH KATAKANA SOUND MARKS |
LatinVowel |
Any of the following: |
LatinH |
Any of the following: |
ALetter |
Alphabetic = true, or and
not listed in LatinVowel |
Apostrophe |
U+0027 (') apostrophe U+2019 (’) curly apostrophe |
MidLetter | Any of the following: U+0027 (') apostrophe U+00AD () soft hyphen U+05F4 (״) gershayim U+2019 (’) curly apostrophe |
MidNumLet | Any of the following: U+002E (.) period U+003A (:) colon (used in Swedish) |
MidNum | Line_Break = Infix_Numeric and not MidNumLet = true |
other | Other categories are from Line_Break (using the long names from PropertyAliases) |
Break at the start and end of text. |
|||
sot | ÷ | (1) | |
÷ | eot | (2) | |
Treat a grapheme cluster as if it were a single character: the first character of the cluster. |
|||
GC |
→ |
FB | (3) |
Ignore interior Format characters. That is, ignore Format characters in all subsequent rules (except the last rule). |
|||
X Format* | → | X | (4) |
Do not break between most letters. |
|||
ALetter | × | ALetter | (5) |
Break after an apostrophe preceding a Latin vowel (possibly preceded by a mute H). |
|||
ALetter Apostrophe | ÷ | LatinVowel | (5.a) |
ALetter Apostrophe | ÷ | LatinH LatinVowel | (5.b) |
Do not break letters across certain punctuation. |
|||
(ALetter | LatinVowel | LatinH) | × | (MidLetter | MidNumLet | Apostrophe) (ALetter | LatinH) | (6) |
(ALetter | LatinVowel | LatinH) (MidLetter | MidNumLet | Apostrophe ) | × | (ALetter | LatinVowel | LatinH) | (7) |
Do not break within sequences of digits, or digits adjacent to letters ('3a', or 'A3'). |
|||
Numeric | × | Numeric | (8) |
(ALetter | LatinVowel | LatinH) | × | Numeric | (9) |
Numeric | × | (ALetter | LatinVowel | LatinH) | (10) |
Do not break within sequences like: ‘3.2’ or '3,456.789'. |
|||
Numeric (MidNum | MidNumLet) | × | Numeric | (11) |
Numeric | × | (MidNum | MidNumLet) Numeric | (12) |
Do not break between Katakana. |
|||
Katakana | × | Katakana | (13) |
Otherwise, break everywhere (including around ideographs). |
|||
Any | ÷ | Any | (14) |