CLDR Ticket #6665(accepted data)
Apostrophes as word-breakers and rule WB5a
|Reported by:||clementr@…||Owned by:||andy|
|Component:||segmentation||Data Locale:||fr, it|
Currently, ICU's BreakIterator does not consider apostrophes as word-breakers. This is a problem in French and Italian, where some articles (le, ne, me) are written with apostrophes (l', n', m') if they are followed by a word starting with a vowel. So for example, ICU accurately considers "le restaurant" as a set of 2 words, and "l'hotel' as a single word.
I came across this document, in which I found:
The use of the apostrophe is ambiguous. It is usually considered part of one word (“can’t” or “aujourd’hui”) but it may also be considered as part of two words (“l’objectif”). A further complication is the use of the same character as an apostrophe and as a quotation mark. Therefore leading or trailing apostrophes are best excluded from the default definition of a word. In some languages, such as French and Italian, tailoring to break words when the character after the apostrophe is a vowel may yield better results in more cases. This can be done by adding a rule WB5a. Break between apostrophe and vowels (French, Italian). WB5a. apostrophe ÷ vowels and defining appropriate property values for apostrophe and vowels. Apostrophe includes U+0027 (') apostrophe and U+2019 (’) right single quotation mark (curly apostrophe). Finally, in some transliteration schemes, apostrophe is used at the beginning of words, requiring special tailoring.
I don't see a reason why this rule isn't always on for French and Italian.
I think that systematically treating apostrophes as word breakers in French and Italian would give better results, even without checking whether the next letter is a vowel. The word aujourd'hui is known to be the only exception where an apostrophe is used in a word, and technically, I don't think it would be wrong to treat aujourd'hui as a set of two words.
- Owner changed from anybody to andy
- Priority changed from assess to medium
- Type changed from unknown to enhancement
- Status changed from new to assigned
- Milestone changed from UNSCH to 25rc
- Phase changed from final to rc
- Milestone changed from 27 to 28