[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #8248(closed: fixed)

Opened 3 years ago

Last modified 3 years ago

Danish collation tailoring may need all case variants of aa

Reported by: pedberg Owned by: markus
Component: xxx-spec Data Locale: da
Phase: final Review: pedberg
Weeks: Data Xpath:
Xref:

Description

This is based on investigation by Markus into http://bugs.icu-project.org/trac/ticket/11423

The Danish standard collation tailoring needs to specify collation for all case variants of "aa":

  • Currently it has "...å<<<Å<<<aa<<<Aa<<<AA"
  • In should have "...å<<<Å<<<aa<<<aA<<<Aa<<<AA"

The latter is in the draft="unconfirmed" alt="proposed" type="standard" version (which also has one other change).

Markus also asks: "I don't know whether CLDR consistently has tailoring data for all case variants of contractions. Should it?" Probably.

Attachments

Change History

comment:1 follow-up: ↓ 2 Changed 3 years ago by markus

Note that I did not conclude that it “needs to specify collation for all case variants of "aa"”. I suggested “either "workingasdesigned" or "moved-to-CLDR"”.

We need to discuss & decide whether we care to support strangely-cased words like "aÅron" in the ICU ticket's test cases.

comment:2 in reply to: ↑ 1 Changed 3 years ago by pedberg

Replying to markus:

Note that I did not conclude that it “needs to specify collation for all case variants of "aa"”. I suggested “either "workingasdesigned" or "moved-to-CLDR"”.

We need to discuss & decide whether we care to support strangely-cased words like "aÅron" in the ICU ticket's test cases.

Right, I should have said: We should consider having the Danish standard collation tailoring specify collation for all case variants of "aa"

(Sorry for the overly hasty disposition)

comment:3 Changed 3 years ago by pedberg

  • Summary changed from Danish collation tailoring needs all case variants of aa to Danish collation tailoring may need all case variants of aa

comment:4 follow-up: ↓ 5 Changed 3 years ago by kent.karlsson14@…

First, aÅron does not have two "a" (of any case) in a row anyway (except formally, if decomposed, but that must be regarded as irrelevant, also in collation). Secondly, Aaron (written casewise as it usually is), is pronounced as <long a>ron (and should be collated under A... Also in Danish/Norwegian; ZWNBSP, WJ...). Nowhere near Åron (which to an English ear would sound almost like Oron, or even Ooron).

And no, I don't see any point in regarding aA as collatable as an Å.

comment:5 in reply to: ↑ 4 Changed 3 years ago by markus

Replying to kent.karlsson14@…:

First, aÅron does not have two "a" (of any case) in a row anyway (except formally, if decomposed, but that must be regarded as irrelevant, also in collation).

This might be your opinion, but the Unicode collation algorithm (UCA) is defined in terms of NFD.

Secondly, Aaron (written casewise as it usually is), is pronounced as <long a>ron (and should be collated under A... Also in Danish/Norwegian; ZWNBSP, WJ...). Nowhere near Åron (which to an English ear would sound almost like Oron, or even Ooron).

Collation by algorithm works on sequences of letters, not on their human interpretation.
If someone wants to prevent "aa" in danish to sort as a contraction, then they need to use a CGJ: http://www.unicode.org/reports/tr10/#Combining_Grapheme_Joiner

And no, I don't see any point in regarding aA as collatable as an Å.

I am not sure we need it either.

comment:6 Changed 3 years ago by emmons

  • Status changed from new to assigned
  • Component changed from data-collation to spec
  • Priority changed from assess to minor
  • Phase changed from rc to final
  • Milestone changed from UNSCH to 27
  • Owner changed from anybody to markus

comment:7 Changed 3 years ago by markus

  • Cc mark, pedberg, emmons, yoshito added
  • Keywords collation added
  • Status changed from assigned to reviewing
  • Review set to pedberg

At the very end of the CLDR collation guidelines we had

Case Combinations. Normally all combinations of case need to be supplied for contractions. That is, if ch is a contraction, then you would have the rules ... ch < cH < Ch < CH. The reason for this is so that all case variants sort at the same primary level: thus lowercasing a string will not affect its primary order. Cases such as McHugh are handled like other instances where contractions should be blocked.

In the 2015mar11 CLDR team meeting, there was consensus that lower-upper contractions like aA and cH are not desirable because they are unlikely to represent the contraction. I changed the guidelines text to

Case Combinations. The lowercase, titlecase, and uppercase variants of contractions need to be supplied, with tertiary differences in that order (regardless of the caseFirst setting). That is, if ch is a contraction, then you would have the rules ... ch <<< Ch <<< CH. Other case variants such as cH are excluded because they are unlikely to represent the contraction, for example in McHugh. (Therefore, mchugh and McHugh will be primary different if ch adds a primary difference.)

I added a link from the collation tailorings spec to the guidelines.

comment:8 Changed 3 years ago by pedberg

  • Status changed from reviewing to closed
  • Resolution set to fixed
View

Add a comment

Modify Ticket

Action
as closed
Next status will be 'new'
Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.