[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #6665(accepted data)

Opened 4 years ago

Last modified 19 months ago

Apostrophes as word-breakers and rule WB5a

Reported by: clementr@… Owned by: andy
Component: segmentation Data Locale: fr, it
Phase: rc Review:
Weeks: Data Xpath:



Currently, ICU's BreakIterator does not consider apostrophes as word-breakers. This is a problem in French and Italian, where some articles (le, ne, me) are written with apostrophes (l', n', m') if they are followed by a word starting with a vowel. So for example, ICU accurately considers "le restaurant" as a set of 2 words, and "l'hotel' as a single word.

I came across this document, in which I found:

The use of the apostrophe is ambiguous. It is usually considered part of one word (“can’t” or “aujourd’hui”) but it may also be considered as part of two words (“l’objectif”). A further complication is the use of the same character as an apostrophe and as a quotation mark. Therefore leading or trailing apostrophes are best excluded from the default definition of a word. In some languages, such as French and Italian, tailoring to break words when the character after the apostrophe is a vowel may yield better results in more cases. This can be done by adding a rule WB5a. Break between apostrophe and vowels (French, Italian). WB5a. apostrophe ÷ vowels and defining appropriate property values for apostrophe and vowels. Apostrophe includes U+0027 (') apostrophe and U+2019 (’) right single quotation mark (curly apostrophe). Finally, in some transliteration schemes, apostrophe is used at the beginning of words, requiring special tailoring.

I don't see a reason why this rule isn't always on for French and Italian.

I think that systematically treating apostrophes as word breakers in French and Italian would give better results, even without checking whether the next letter is a vowel. The word aujourd'hui is known to be the only exception where an apostrophe is used in a word, and technically, I don't think it would be wrong to treat aujourd'hui as a set of two words.


Change History

comment:1 Changed 4 years ago by markus

Seems useful and not difficult. Caveats:

  • Maintaining more near-duplicates of break iterators has a cost.
  • The language of text is often not known, or specified incorrectly, which makes language-specific rules a bit problematic, and less useful than one might wish.

comment:2 Changed 4 years ago by emmons

  • Owner changed from anybody to andy
  • Priority changed from assess to medium
  • Type changed from unknown to enhancement
  • Status changed from new to assigned
  • Milestone changed from UNSCH to 25rc

comment:3 Changed 3 years ago by srl

  • Xref set to 1033

comment:4 Changed 3 years ago by andy

The proposed WB5a rule will not be the best way to implement this for ICU. More likely is a modification to WB6 and WB7, such that they do not trigger and suppress a break in the positions we care about. In general, rules specifying a break (÷) are more complex to implement in ICU than those indicating no break (×).

comment:5 Changed 3 years ago by andy

  • Milestone changed from 25rc to 26rc

Moving this ticket to the next release. I started to do the corresponding ICU rules, and they are indeed a little tricky.

The existing ALetter class includes non-Latin scripts. Behavior of these around apostrophes needs to be thought about.

If the right side of the apostrophe rules is limited to Latin Vowels, normalization form of the input causes another complication.

What about h?

The corresponding ICU ticket is 10510

comment:6 Changed 3 years ago by andy

  • Milestone changed from 26rc to 27

comment:7 Changed 3 years ago by markus

  • Phase set to final

comment:8 Changed 2 years ago by emmons

  • Phase changed from final to rc
  • Milestone changed from 27 to 28

comment:9 Changed 2 years ago by markus

  • Type changed from enhancement to data

comment:10 Changed 2 years ago by srl

  • Status changed from assigned to accepted

comment:11 Changed 19 months ago by emmons

  • Milestone changed from 28 to 28roll

Moving all outstanding 28 tickets to 28roll. We will discuss disposition of these at the next CLDR TC.


Add a comment

Modify Ticket

as accepted

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.