CLDR Ticket #8223(accepted datatest)
Clean up CheckForExemplars
|Reported by:||mark||Owned by:||mark|
Broken off from ticket:7953
The check we do is overly broad, and allows essentially any non-letter in fields. Now that we have the punctuation exemplars, we can have a much cleaner check, which will help us catch problems.
We should change the code to:
- Add the punctuation characters.
- Add the default numbering system's digits
- Change AlwaysOK to just be [
- Replace the ¤ in currency patterns by the STAND_IN.
For example, after making these changes, here are some examples:
# ka [Georgian] Summary modern Total missing from general exemplars: [’ % ‰ + × ∞]
# ne [Nepali] Summary modern Total missing from general exemplars: [– … ‘ ’ “ ” / % ‰ + × ∞ 0 1 6 8 a i n-p s]
# pl [Polish] Summary modern Total missing from general exemplars: [’ ‰ + × ∞ á ã í ú]
# zh_Hant [Traditional Chinese] Summary modern Total missing from general exemplars: [+ × ∞ c-e g i n o 穀 綽 蟄 霜]
We can reduce these by adding the characters in the following paths (for the locale) to the punctuation characters. Presumably these have been carefully vetted already.
Numbers|Symbols|Symbols_|perMille◂ 〈‰〉 【】 〈‰〉
Numbers|Symbols|Symbols_|infinity◂ 〈∞〉 【】 〈∞〉
We might change missing letters in the same script to be errors instead of warnings. That would catch characters like 穀 in zh_Hant: either those should be in the exemplars, or they should be replaced in the fields where they occur.
Of the above, the use of the other characters needs to be carefully vetted by native speakers.
For example, either ã should be in the Polish exemplars, or it needs to be replaced by 'a' in the field. (It occurs in Timezones|Africa|Cities_and_Regions|Sao_Tome◂ 〈São Tomé〉)
Same with … in Nepali, etc.
For some, we might want to special-case certain path values, eg:
zh_Hant Locale_Display_Names|Keys|collation|collation-big5han◂ 〈繁體中文排序 - Big5〉
Or change them to parenthesized, since we already have that logic, eg 〈繁體中文排序 (Big5)〉
Note: many warnings will show up for comprehensive, since there are ascii characters in root: Japanese era names, "last quarter", cyclic fields, etc. I don't think we need to worry about these.
- Owner changed from anybody to mark
- Status changed from new to assigned
- Phase changed from dsub to rc
- Type changed from unittest to datatest
- Component changed from unknown to main