[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #6275(closed defect: fixed)

Opened 2 years ago

Last modified 22 months ago

AddLikelySubtags as defined in TR35 appears inconsistent.

Reported by: mpvl@… Owned by: mark
Component: xxx-spec Data Locale:
Phase: Review: markus
Weeks: Data Xpath:

Description (last modified by markus) (diff)

I would expect that if addLikelyTags(und_XX ) produces xx_Scrp_XX, und_Scrp_XX results in xx_Scrp_XX as well. However, consider the following data in likelySubtags.xml:

		<likelySubtag from="und_BJ" to="fr_Latn_BJ"/>

There is no from "und_Latn_BJ". According to step 2 of the algorithm in TR35, we would first check und_Latn_BJ, which does not match. Then, in step 2.2, we find a match for und_Latn. Based on this, we would expect an output of en_Latn_BJ. This would break the principle mentioned before, though.
Instead, I would expect fr_Latn_BJ. ICU, btw, does "the right thing" and returns fr_Latn_BJ.


Change History

comment:1 Changed 2 years ago by emmons

  • Owner changed from anybody to mark
  • Status changed from new to assigned
  • Milestone changed from UNSCH to 24rc

comment:2 Changed 2 years ago by mark

  • Milestone changed from 24rc to 24final

comment:3 Changed 22 months ago by mark

  • Review set to markus

Added clarifications. However, 'und_Latn' shouldn't have an entry in the CLDR table, because it gets removed in construction. The reason is that 'und' alone maps to 'en_Latn_US'.

We might be able to construct a different problem case by using a script other than Latn, but I can't think of one right now.

comment:4 Changed 22 months ago by mpvl@…

Good point. However, the problem is still there. Below I included what I believe is an exhaustive list of the cases where the inconsistency appears. In these cases, ICU does NOT give a consistent answer!

und_AF -> fa_Arab_AF; und_Arab_AF hits und_Arab -> ar_Arab_EG; fa != ar
und_BG -> bg_Cyrl_BG; und_Cyrl_BG hits und_Cyrl -> ru_Cyrl_RU; bg != ru
und_BT -> dz_Tibt_BT; und_Tibt_BT hits und_Tibt -> bo_Tibt_CN; dz != bo
und_BY -> be_Cyrl_BY; und_Cyrl_BY hits und_Cyrl -> ru_Cyrl_RU; be != ru
und_ER -> ti_Ethi_ER; und_Ethi_ER hits und_Ethi -> am_Ethi_ET; ti != am
und_IR -> fa_Arab_IR; und_Arab_IR hits und_Arab -> ar_Arab_EG; fa != ar
und_KG -> ky_Cyrl_KG; und_Cyrl_KG hits und_Cyrl -> ru_Cyrl_RU; ky != ru
und_MK -> mk_Cyrl_MK; und_Cyrl_MK hits und_Cyrl -> ru_Cyrl_RU; mk != ru
und_MN -> mn_Cyrl_MN; und_Cyrl_MN hits und_Cyrl -> ru_Cyrl_RU; mn != ru
und_NP -> ne_Deva_NP; und_Deva_NP hits und_Deva -> hi_Deva_IN; ne != hi
und_RS -> sr_Cyrl_RS; und_Cyrl_RS hits und_Cyrl -> ru_Cyrl_RU; sr != ru
und_TJ -> tg_Cyrl_TJ; und_Cyrl_TJ hits und_Cyrl -> ru_Cyrl_RU; tg != ru
und_UA -> uk_Cyrl_UA; und_Cyrl_UA hits und_Cyrl -> ru_Cyrl_RU; uk != ru
und_UZ -> uz_Cyrl_UZ; und_Cyrl_UZ hits und_Cyrl -> ru_Cyrl_RU; uz != ru

comment:5 Changed 22 months ago by markus

  • Description modified (diff)

The changes in r9343 look reasonable to me, but I see bad cross-links among LDML spec parts; I submitted ticket:6683 for that issue.

About comment:4, I don't know if this should be fixed with CLDR data changes or with ICU code changes. Mark please take a look.

comment:6 Changed 22 months ago by mpvl@…

I would suggest fixing it in code. It is a very easy algorithmic fix. I use it in the Go library. If it were to be fixed in data I would actually remove it again on my table builder and keep the code as is. For my implementation this would be faster and result in smaller tables. In fact, for the data structures I use in the Go implementation, the adjusted algorithm is slightly simpler than the original!

comment:7 Changed 22 months ago by mark

As it turns out, we'd have to add 5-6 lines of data; it can't be completely fixed in code (without some hacks that would be equivalent to adding the data).

I wrote a comprehensive test, and tweaked the code and spec to minimize the failures. We're too late to change the data, but the (small number of) exceptions are at least documented (we should be able to fix in the next dot release by adding a few items of data).


Last edited 22 months ago by mark (previous) (diff)

comment:8 Changed 22 months ago by markus

  • Status changed from assigned to closed
  • Resolution set to fixed

Add a comment

Modify Ticket

as closed
The ticket will be disowned. The resolution will be deleted. Next status will be 'new'
Next status will be 'closed'

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.