[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #8771(accepted data)

Opened 21 months ago

Last modified 18 months ago

English sentence break suppressions missing "Dr."

Reported by: pedberg Owned by: srl
Component: segmentation Data Locale: en
Phase: rc Review:
Weeks: Data Xpath:
Xref:

Description

This is probably also a bug for ULI. The English sentence break suppressions data is missing "Dr.", which seems like an important and necessary addition.

Attachments

Change History

comment:1 Changed 21 months ago by pedberg

  • Cc pedberg added
  • Owner changed from anybody to srl
  • Priority changed from assess to medium
  • Status changed from new to accepted
  • Milestone changed from UNSCH to 28

The issue is that "Dr." could either be "doctor" (should suppress break) or "drive" (as part of an address, should not suppress break).

Perhaps we need different suppression sets to handle this.

comment:2 follow-up: ↓ 3 Changed 21 months ago by pedberg

Usually "Dr." as part of an address would either have a line break after it (in which case there will be a sentence break regardless), or it will have a comma and continue with the rest of the address, in which case it should not have a break. So it would seem default suppression of this break would be better for both the doctor and drive cases.

comment:3 in reply to: ↑ 2 Changed 21 months ago by shervin

Replying to pedberg:

Usually "Dr." as part of an address would either have a line break after it (in which case there will be a sentence break regardless), or it will have a comma and continue with the rest of the address, in which case it should not have a break. So it would seem default suppression of this break would be better for both the doctor and drive cases.

I agree. Also, the USPS convention for mailing addresses is to avoid most punctuations altogether; see https://goo.gl/Gkl3ZJ

I think we should change the data to be able to be used with generic text, not specific textual data like addresses which need specific handling and parsing rules anyway. It's a safe guess that anyone handling mailing address knows that they shouldn't segment it using generic rules.

comment:4 Changed 18 months ago by emmons

  • Milestone changed from 28 to 28roll

Moving all outstanding 28 tickets to 28roll. We will discuss disposition of these at the next CLDR TC.

View

Add a comment

Modify Ticket

Action
as accepted
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.