[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #8223(accepted datatest)

Opened 2 years ago

Last modified 22 months ago

Clean up CheckForExemplars

Reported by: mark Owned by: mark
Component: main Data Locale:
Phase: rc Review:
Weeks: Data Xpath:
Xref:

ticket:7953

Description

Broken off from ticket:7953

The check we do is overly broad, and allows essentially any non-letter in fields. Now that we have the punctuation exemplars, we can have a much cleaner check, which will help us catch problems.

We should change the code to:

  1. Add the punctuation characters.
  2. Add the default numbering system's digits
  3. Change AlwaysOK to just be [
    u0020
    u00A0]
  4. Replace the ¤ in currency patterns by the STAND_IN.

For example, after making these changes, here are some examples:

# ka [Georgian] Summary modern Total missing from general exemplars: [’ % ‰ + × ∞]
# ne [Nepali] Summary modern Total missing from general exemplars: [– … ‘ ’ “ ” / % ‰ + × ∞ 0 1 6 8 a i n-p s]
# pl [Polish] Summary modern Total missing from general exemplars: [’ ‰ + × ∞ á ã í ú]
# zh_Hant [Traditional Chinese] Summary modern Total missing from general exemplars: [+ × ∞ c-e g i n o 穀 綽 蟄 霜]

We can reduce these by adding the characters in the following paths (for the locale) to the punctuation characters. Presumably these have been carefully vetted already.

Numbers|Symbols|Symbols_|plusSign◂ 〈+〉
Numbers|Symbols|Symbols_|percentSign◂ 〈%〉
Numbers|Symbols|Symbols_|perMille◂ 〈‰〉 【】 〈‰〉
Numbers|Symbols|Symbols_|superscriptingExponent◂ 〈×〉
Numbers|Symbols|Symbols_|infinity◂ 〈∞〉 【】 〈∞〉

We might change missing letters in the same script to be errors instead of warnings. That would catch characters like 穀 in zh_Hant: either those should be in the exemplars, or they should be replaced in the fields where they occur.

Of the above, the use of the other characters needs to be carefully vetted by native speakers.

For example, either ã should be in the Polish exemplars, or it needs to be replaced by 'a' in the field. (It occurs in Timezones|Africa|Cities_and_Regions|Sao_Tome◂ 〈São Tomé〉)

Same with … in Nepali, etc.

For some, we might want to special-case certain path values, eg:
zh_Hant Locale_Display_Names|Keys|collation|collation-big5han◂ 〈繁體中文排序 - Big5〉

Or change them to parenthesized, since we already have that logic, eg 〈繁體中文排序 (Big5)〉

Note: many warnings will show up for comprehensive, since there are ascii characters in root: Japanese era names, "last quarter", cyclic fields, etc. I don't think we need to worry about these.

Attachments

Change History

comment:1 Changed 2 years ago by mark

  • Owner changed from anybody to mark
  • Status changed from new to assigned

comment:2 Changed 2 years ago by mark

  • Component changed from data-main to test

comment:3 Changed 2 years ago by markus

  • Type set to unittest
  • Component changed from test to unknown

comment:4 Changed 2 years ago by srl

  • Status changed from assigned to accepted

comment:5 Changed 2 years ago by emmons

  • Phase changed from dsub to rc
  • Type changed from unittest to datatest
  • Component changed from unknown to main

comment:6 Changed 2 years ago by mark

  • Milestone changed from 28 to 29

comment:7 Changed 22 months ago by emmons

  • Milestone changed from 29 to upcoming

Auto move of all 29 -> upcoming

View

Add a comment

Modify Ticket

Action
as accepted
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.