[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #6783(accepted data)

Opened 4 years ago

Last modified 4 days ago

Add digits to exemplars

Reported by: mark Owned by: mark
Component: main Data Locale:
Phase: dsub Review:
Weeks: Data Xpath:
Xref:

Description

It would be useful to also know which digits are in customary modern use. Right now, it is possible to approximate that information, but it is clumsy and only approximate. Moreover, we don't use that in our tests, so we can't check for mismatches.

Here is an approximation.

  1. get the defaultNumberingSystem from the locale
  2. lookup the numberingSystem in supplemental data
  3. if there are digits, add them to the main exemplars
  4. get all the <symbols numberSystem="xxx">
  5. if xxx is not the default, get the digits as above, and add to the aux exemplars.

This is only approximate, because some of the numbering systems may be in common customary use, and belong in main instead of aux.

I suggest writing a CLDR modify pass to do the above, adding the digits to the exemplars as specified. Then in the next Survey Tool phase, ask translators to look at the digits in the main and aux, and file a ticket if changes need to be made.

FYI, the ones that would be in main would be the following, based on current data. BTW, It was surprising to me that Urdu in India uses Arabic(ext), while in Pakistan it uses Latin...

locale	exemplars	numbering system
ar_DZ	[0-9]	latn
ar_EH	[0-9]	latn
ar_LY	[0-9]	latn
ar_MA	[0-9]	latn
ar_TN	[0-9]	latn
ur	[0-9]	latn
ar	[٠-٩]	arab
fa	[۰-۹]	arabext
ks	[۰-۹]	arabext
pa_Arab	[۰-۹]	arabext
ps	[۰-۹]	arabext
ur_IN	[۰-۹]	arabext
uz_Arab	[۰-۹]	arabext
as	 [০-৯]	beng
bn	 [০-৯]	beng
mr	[०-९]	deva
ne	[०-९]	deva
my	[၀-၉]	mymr
dz	[༠-༩]	tibt

We can then augment our tests to check that
a) digits in the text of a locale are in main or aux.
b) the digits in the main are in the default numbering system
c) the digits in all other numbering systems are in either main or aux.

Attachments

TestExemplarSet.java (1.2 KB) - added by mihnita@… 4 days ago.

Change History

comment:1 Changed 4 years ago by mark

Agreed to add, but with different type, type="numbers".

Test is required.

Spec change to describe the new type, usage, and relation to numberSystems.

Last edited 4 years ago by mark (previous) (diff)

comment:2 Changed 4 years ago by mark

  • Status changed from new to assigned
  • Component changed from unknown to data
  • Priority changed from assess to medium
  • Milestone changed from UNSCH to 25rc
  • Owner changed from anybody to mark
  • Type changed from unknown to enhancement

comment:3 Changed 3 years ago by mark

  • Milestone changed from 25rc to 26dsub

This needs to be added before a data submission phase, so people can double-check.

comment:4 Changed 3 years ago by mark

  • Milestone changed from 26dsub to 27

comment:5 Changed 3 years ago by markus

  • Phase set to final

comment:6 Changed 2 years ago by mark

  • Phase changed from final to dsub
  • Milestone changed from 27 to 28

comment:7 Changed 2 years ago by markus

  • Type changed from enhancement to data

comment:8 Changed 2 years ago by srl

  • Status changed from assigned to accepted

comment:9 Changed 21 months ago by mark

  • Milestone changed from 28 to 29

comment:10 Changed 20 months ago by emmons

  • Milestone changed from 29 to upcoming

Auto move of all 29 -> upcoming

comment:11 Changed 20 months ago by mark

  • Milestone changed from upcoming to 29

comment:12 Changed 16 months ago by mark

  • Milestone changed from 29 to 30

comment:13 Changed 13 months ago by mark

  • Phase changed from dsub to rc

comment:14 Changed 9 months ago by mark

  • Milestone changed from 30 to 31

comment:15 Changed 4 months ago by mark

  • Phase changed from rc to dsub
  • Priority changed from medium to major
  • Milestone changed from 31 to 32

Bumping again. Requires new structure, and we're past the deadline for that.

Priority is higher, since more people are using the data (eg for caption checking).

Changed 4 days ago by mihnita@…

comment:16 Changed 4 days ago by mihnita@…

When using the current API for "en-US" to get all possible exemplar characters this are left out, even from the ASCII range (code attached):

" $%+0123456789<=>\_`{|}~\u007F"

In general I would expect that if there is an APIs to get the characters used by a locale,
it would really include all characters used by that locale, including space, digits, %,
its own currency symbol, etc.

View

Add a comment

Modify Ticket

Action
as accepted
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.