[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #8169(closed: fixed)

Opened 4 years ago

Last modified 4 years ago

Update the languageMatching element with parentLocales

Reported by: mark Owned by: mark
Component: xxx-tools Data Locale:
Phase: rc Review: emmons
Weeks: Data Xpath:



We've modified the parentLocales data, and should make sure that those are reflected in the languageMatching (which is in the file called languageInfo.xml). Also, check over the matching data, since sometimes it is (incorrectly) the inverse of the distance.

In particular I suggest we do it this way:

  1. parent relation costs 1 unit: eg distance(es-AR, es-419) = 1
  2. en-GB is treated like es-001, so distance(en-GB,en_AU) = 1
  3. other siblings (same parent) are 2: distance(es-AR,es-CO) = distance(es-AR,es-419)+distance(es-AR,es-419)
  4. other regional differences should are 4.

The current data is old, and only has a few of the parent relations.

We should also set up a test to make sure that the data properly matches.


Change History

comment:1 Changed 4 years ago by emmons

  • Status changed from new to assigned
  • Component changed from unknown to tools
  • Priority changed from assess to medium
  • Phase changed from dsub to rc
  • Milestone changed from UNSCH to 27
  • Owner changed from anybody to mark

comment:2 Changed 4 years ago by mark

  • Keywords working added

comment:3 Changed 4 years ago by mark

  • Xref set to 7092

comment:4 Changed 4 years ago by mark

  • Status changed from assigned to reviewing
  • Review set to emmons

this partially fixes the problem, but it couldn't be completed. Filed another bug for that.


comment:5 Changed 4 years ago by emmons

  • Status changed from reviewing to accepted

I put in http://unicode.org/cldr/trac/changeset/11292 to get the build running again by dodging the failing unit test until you get a chance to figure out what is going on.

comment:6 Changed 4 years ago by mark

  • Status changed from accepted to reviewing

Figured out what it was; a problem in the test. Fixed and committed.

comment:7 Changed 4 years ago by mark

When cleaning up the language matching info, I found that the weights were a bit of a muddle, following two different paradigms.

  • "distance" (0 is exact match, 100 is worst)
  • "match" (100 is exact match, 0 is worst).

As far as code goes, the "distance" metric is easier to deal with, so I went with that. It is also much easier when making changes to the data file to deal with, since most of those are small. Easy to see that a distance of 2 is 3x closer than a distance of 6; not as clear with 98 and 94.

And the code is simpler, since you can add distances X + Y (pinned to 0..100), but for matching you have to do (1- ((1-X)+(1-Y))) = 1-(2-X-Y) = X+Y-1. So matchPlus(98+94) = 92.

(For ICU this is a hidden internal.)

However, what we document is the "match" style. So much as I would like to see a distance instead, I think I should take the inverse (100 - x) for all of the data (and change the tests appropriately—the test change is centralized).


comment:8 Changed 4 years ago by mark

Committee agreed to change to "match" style.

comment:9 Changed 4 years ago by emmons

  • Status changed from reviewing to closed
  • Resolution set to fixed

Add a comment

Modify Ticket

as closed
Next status will be 'new'
Next status will be 'closed'

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.