[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #3987(accepted data)

Opened 7 years ago

Last modified 3 years ago

Consider removing 'colon' from WordBreak=MidLetter in root and tailoring Swedish ?

Reported by: jungshik Owned by: jungshik
Component: main Data Locale:
Phase: Review:
Weeks: 0.2 Data Xpath:
Xref:

Description

According to UAX #29, ':' is one of 'Word_Break = MidLetter' ( http://goo.gl/ozcFq ). So, there'll be no word break at colon if it's surrounded by Letter (WB6 and WB 7).

Apparently, ':' (colon) is included in the set because Swedish uses it in the middle of a word.

I wonder if it's better to do the following instead:

  1. In Root, get words to break at colon
  1. In sv tailoring, do what the current root does.

This is the opposite of what I asked for in a couple of other tickets (3974 and 3975) where what's good for ja and fi are applicable to all languages as well.

Attachments

Change History

comment:1 Changed 7 years ago by jungshik

  • Status changed from new to closed
  • Resolution set to duplicate

comment:2 Changed 6 years ago by jungshik

  • Status changed from closed to reopened
  • Resolution duplicate deleted

I don't remember why I closed this as a duplicate. I couldn't find any duplicate of this bug.

Anyway, I guess this has to be brought up at UTC (if it's not already taken up there).

comment:3 Changed 6 years ago by emmons

  • Owner changed from somebody to jungshik
  • Status changed from reopened to assigned
  • Milestone changed from UNSCH to 22

comment:4 Changed 6 years ago by jungshik

  • Summary changed from Consider removing 'colon' from WordBreak=MidLetter ? to Consider removing 'colon' from WordBreak=MidLetter in root and tailoring Swedish ?

comment:5 Changed 6 years ago by kent.karlsson14@…

"... because Swedish uses it in the middle of a word"; well, it is used in a few particular abbreviations, when the middle of the word is abbreviated away. There are very few such abbreviations in general use, "c:a" (for "cirka"), "k:a" (for "kyrka", church), "s:t" (for "sankt"), "g:a" (for "gamla", old). (B.t.w., Danish and Norwegian uses the abbreviation "ca." for "cirka".)

Colon is also used when adding inflections to abbreviated names, e.g. "tv:n" (this seems to be used for Finnish as well), "USA:s" (this seems to be used, at least sometimes, also in Norwegian and (Northern?) Sami), "UFO:t", "UFO:na", or to numbers, e.g. "3:e", "3:ans". Colon is also used between digits, in e.g. currency values (like "12:50") and time values (as it is for many languages), and some other cases.

So even it this use may be more prominent for Swedish, I would not limit it to just Swedish; and indeed the limitation to "letter colon letter" is too limiting.

comment:6 Changed 6 years ago by kent.karlsson14@…

I would suggest updating the following rules in UAX 29:

WB6. ALetter × (MidLetter | MidNumLet) ALetter
WB7. ALetter (MidLetter | MidNumLet) × ALetter

to

WB6. (Numeric | ALetter) × (MidLetter | MidNumLet) ALetter
WB7. (Numeric | ALetter) (MidLetter | MidNumLet) × ALetter

in order to handle number inflections (like 3:e (for tredje), 3:ans (for treans)).

And change (first one editorial):

U+003A ( : ) COLON (used in Swedish)

to

U+003A ( : ) COLON

and move the colon-like characters from MidLetter to MidNumLet (to handle numerals like "3:50" as one "word").

UAX 29 text changes (editorial):

Change:

Certain cases such as colons in words (c:a) are included in the default even though they may be specific to relatively small user communities (Swedish) because they do not occur otherwise, in normal text, and so do not cause a problem for other languages.

to

Certain cases such as colons in abbreviated words (e.g., "c:a") and inflections (e.g., "3:ans", "tv:n") are included in the default even though they may be specific to relatively small user communities (Swedish and other languages) because they do not occur otherwise, in normal text, and so do not cause a problem for languages that do not use this convention.

and

It includes characters that may not be appropriate for identifiers, and some that would not be parts of words. It also permits some characters that may be part of words in a broad sense, but not part of names, such as in "c:a" in Swedish, or hyphenation points used in dictionary words.

to

It includes characters that may not be appropriate for identifiers, and some that would not be parts of words. It also permits some characters that may be part of words in a broad sense, but not part of names, such as in some abbreviations like "c:a" and some inflections like "USA:s" and "3:e" in Swedish, or hyphenation points used in dictionary words.

comment:7 Changed 6 years ago by mark

  • Keywords google added

comment:8 Changed 6 years ago by mark

  • Weeks set to 0.2

I still think that we should limit this to the languages that use (according to the above, Swedish, Finnish, maybe Norwegian?), since the behavior is unexpected in other languages. We can broaden it out for other languages.

comment:9 Changed 4 years ago by emmons

  • Cc changed from mark,andy to mark, andy
  • Milestone changed from future to UNSCH

Merging future and UNSCH

comment:10 Changed 3 years ago by markus

  • Type changed from defect to data

comment:11 Changed 3 years ago by srl

  • Status changed from assigned to accepted
View

Add a comment

Modify Ticket

Action
as accepted
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.