L2/02-303

Proposal to accommodate French and Italian elision rules in Unicode's DUTR#29

author: Marco Cimarosti
date: August 20, 2002
version: 4

Rationale - The existing word-boundary rules of DUTR#29 (version 2) are designed to capture the meaning of apostrophes in English (and many other languages), where apostrophes normally are inside a word, as in "don't" or "Marco's". The behavior of apostrophes is quite different in Italian and French, where an apostrophe normally marks elision, i.e. the deletion of the last vowel of a word that occurs before another word starting with a vowel. E.g. "d'Unicode" (d' elision of de = "of"), or "l'Angleterre" (l' from la = "the"), "d'un'altr'annata" (elision of di una altra annata: "of a past year"). The two (or more) words are graphically joined (no space before or after the apostrophe). The apostrophe is part of the word that precedes it, and an implicit word break comes after it. Implementing this behavior in the default definition of UTR#29 is important to accommodate the large French and Italian speaking communities, as well as the needs of the people writing in other languages, who often use loanwords or quotations from these popular languages.

Proposed heuristic - The present proposal is based on the observation that elision apostrophes are always followed by a vowel, whereas English-style "joining" apostrophes are normally followed by a consonant. The issue is complicated by the fact that both French and Italian have mute H's that can interfere in the algorithm. The proposal defines three new character classes: ElisionVowel (containing all the meaningful vowels in French and Italian), ElisionMute (containing only the letter H in upper and lower case), and ElisionApostrophe (containing the characters used as apostrophe). The characters contained in the new classes are removed from the classes where they used to be (ALetter and MidLetter). The new classes are used to define two new rules (before current rule 6) for elision apostrophes, which cover the cases C'V and C'hV (where C is a consonant and V is a vowel). Several rules are slightly changed because the former classes ALetter and MidLetter are now split in two or more classes.

Open issues - Although this proposal might enhance the handling of some common cases in two common languages, there still are many remaining edge cases that can only be solved by tailoring the algorithm for specific languages. For instance, the "c'h" trigraph of the Breton language, or the "g'" digraph of Uzbek, would unduly be split by the default definition, when followed by a vowel.

Discussion - This proposal has been discussed publicly on the Unicode Public E-mail List. I thank all the people who took part in the discussion. All the criticism I received was very valuable, and most of it has been incorporated in this version, in a form or another.

Note - The proposed changes are concentrated in Table 2 (Default Word Boundaries). Proposed additions are colored in green and underlined, proposed deletions are colored in red and struck through, and existing text remains in black.


...

Table 2. Default Word Boundaries

Character Classes
Format General_Category = Format (Cf)
Katakana Script = KATAKANA, or
Any of the following:
U+30FC # KATAKANA-HIRAGANA PROLONGED SOUND MARK
U+FF70 # HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK
U+FF9E..U+FF9F # HALFWIDTH KATAKANA SOUND MARKS
ElisionVowel Any of the following:
U+0041, U+0061 # LATIN CAPITAL/SMALL LETTER A
U+0045, U+0065 # LATIN CAPITAL/SMALL LETTER E
U+0049, U+0069 # LATIN CAPITAL/SMALL LETTER I
U+004F, U+006F # LATIN CAPITAL/SMALL LETTER O
U+0055, U+0075 # LATIN CAPITAL/SMALL LETTER U
U+0059, U+0079 # LATIN CAPITAL/SMALL LETTER Y
U+00C6, U+00E6 # LATIN CAPITAL/SMALL LETTER AE
U+0152, U+0153 # LATIN CAPITAL/SMALL LIGATURE OE
ElisionMute Any of the following:
U+0048, U+0068 # LATIN CAPITAL/SMALL LETTER H
ALetter Alphabetic = true, or
Any of the following modifier letters:
U+02B9..U+02BA # PRIME..DOUBLE PRIME
U+02C2..U+02CF # LEFT ARROWHEAD..LOW ACUTE ACCENT
U+02D2..U+02DF # CENTRED RIGHT HALF RING..CROSS ACCE
U+02E5..U+02ED # EXTRA-HIGH TONE BAR..UNASPIRATED
U+05F3 (׳) geresh

and not Ideographic = true
and not Katakana = true
and not Script = Thai
and not Script = Lao
and not Script = Hiragana

and not listed in ElisionVowel
and not listed in ElisionMute
ElisionApostrophe Any of the following:
U+0027 (') apostrophe
U+2019 (’) curly apostrophe
MidLetter Any of the following:
U+0027 (') apostrophe
U+00AD () soft hyphen
U+05F4 (״) gershayim
U+2019 (’) curly apostrophe
MidNumLet Any of the following:
U+002E (.) period
U+003A (:) colon (used in Swedish)
MidNum Line_Break = Infix_Numeric
and not MidNumLet = true
other Other categories are from Line_Break
(using the long names from PropertyAliases)

...

Rules
Break at the start and end of text.
sot ÷ (1)
÷ eot (2)
Treat a grapheme cluster as if it were a single character: the first character of the cluster.
GC FB (3)
Ignore interior Format characters. That is, ignore Format characters in all subsequent rules (except the last rule).
X Format* X (4)
Do not break between most letters.
(ALetter | ElisionVowel | ElisionMute) × (ALetter | ElisionVowel | ElisionMute) (5)
Break after an apostrophe following a consonant and preceding a vowel (possibly precede by a mute H).
(ALetter | ElisionMute) ElisionApostrophe ÷ ElisionVowel (5.a)
(ALetter | ElisionMute) ElisionApostrophe ÷ ElisionMute ElisionVowel (5.b)
Do not break letters across certain punctuation.
(ALetter | ElisionVowel | ElisionMute) × (MidLetter | MidNumLet | ElisionApostrophe) (ALetter | ElisionVowel | ElisionMute) (6)
(ALetter | ElisionVowel | ElisionMute) (MidLetter | MidNumLet | ElisionApostrophe ) × (ALetter | ElisionMute) (7)
Do not break within sequences of digits, or digits adjacent to letters ('3a', or 'A3').
Numeric × Numeric (8)
(ALetter | ElisionVowel | ElisionMute) × Numeric (9)
Numeric × (ALetter | ElisionVowel | ElisionMute) (10)
Do not break within sequences like: ‘3.2’ or '3,456.789'.
Numeric (MidNum | MidNumLet) × Numeric (11)
Numeric × (MidNum | MidNumLet) Numeric (12)
Do not break between Katakana.
Katakana × Katakana (13)
Otherwise, break everywhere (including around ideographs).
Any ÷ Any (14)

...