[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #8407(accepted data)

Opened 2 years ago

Last modified 5 weeks ago

Improve readability and maintainability of coverageLevels.xml

Reported by: mark Owned by: mark
Component: supplemental Data Locale:
Phase: dsub Review:
Weeks: Data Xpath:
Xref:

Description

  1. Whenever we see long lists of items, it is hard to know when they are exactly the same, except for minute inspection. Much better for maintenance to use variables. In approvalRequirement, for example, replace strings like the following by the use of variables.

"ar ca cs da de el es fi fr he hi hr hu it ja ko nb nl pl pt pt_PT ro ru sk sl sr sv th tr uk vi zh zh_Hant"


  1. Regular expressions that are just purely lists of items should be expressed as lists, rather than optimized. They are otherwise quite difficult to parse, review, and change (without error). See:

value="(length-(picometer|light-year)|pressure-(hectopascal|inch-hg|millibar)|acceleration-g-force|angle-(degree|minute|second)|area-(acre|hectare|square-(foot|kilometer|meter|mile))|power-(horsepower|kilowatt|watt)|speed-meter-per-second|volume-cubic-(mile|kilometer))"

If we need to optimize them, we should have a separate internal method to do that. One I'd recommend having an alternate attribute like the following that takes a space delimited list.

list="length-picometer list-light-year..."

When the variable is encountered, the list can internally be optimized as a regex (if necessary). Such optimization can do a much better job than hand-optimization.


  1. The lists can be more self-documenting if we introduce some pre-set variables. For example, the list of all cldr organization languages can be fetched instead of written in a variable (that can fall out of date).

I suggest %% syntax for those. Example:

%%cldr-modern

internally gets set at startup to:
StandardCodes.make().getLocaleCoverageLocales(Organization.cldr.name(), EnumSet.of(Level.MODERN));

We can do this for many cases. For example:

%%scripts
To get the non-private use, non-deprecated region codes.

That would let us replace many of the variables used for modern coverage by a full, always-up-todate list, such as replacing

<coverageVariable key="%script100" value="(Afak|Aghb|Ahom|Armi|Avst|Bali|Bamu|Bass|Batk|Blis|Brah|Bugi|Buhd|Cakm|Cans|Cari|Cham|Cher|Cirt|Copt|Cprt|Cyrs|Dsrt|Dupl|Egy[dhp]|Elba|Geok|Glag|Goth|Gran|Hatr|Hano|Hluw|Hmng|Hrkt|Hung|Inds|Ital|Java|Jurc|Kali|Khar|Khoj|Kpel|Kthi|Lana|Lat[fg]|Lepc|Limb|Lin[ab]|Lisu|Loma|Ly[cd]i|Mahj|Man[di]|Maya|Mend|Mer[co]|Modi|Moon|Mroo|Mtei|Mult|Narb|Nbat|Nkgb|Nkoo|Nshu|Ogam|Olck|Orkh|Osma|Palm|Pauc|Perm|Phag|Phl[ipv]|Phnx|Plrd|Prti|Rjng|Roro|Runr|Samr|Sar[ab]|Saur|Sgnw|Shaw|Shrd|Sidd|Sind|Sora|Sund|Sylo|Syr[cejn]|Tagb|Takr|Tal[eu]|Tang|Tavt|Teng|Tfng|Tglg|Tirh|Ugar|Vaii|Visp|Wara|Wole|Xpeo|Xsux|Yiii|Zinh|Zmth)"/>

by

<coverageVariable key="%script100" value="%%scripts""/>


  1. This is probably for later on, but for many attributes, we know exactly from the DtdData or supplementalMetadata.xml what the possible values are. So we can populate variables with those values. So we could automatically set variables like:

%%ldml_day_type

instead of manually setting

<coverageVariable key="%dayTypes" value="(sun|mon|tue|wed|thu|fri|sat)"/>

Attachments

Change History

comment:1 Changed 2 years ago by emmons

  • Status changed from new to accepted
  • Component changed from unknown to supplemental
  • Priority changed from assess to medium
  • Milestone changed from UNSCH to 29
  • Owner changed from anybody to mark
  • Type changed from unknown to data

comment:2 Changed 2 years ago by emmons

  • Milestone changed from 29 to upcoming

comment:3 Changed 5 weeks ago by mark

Note: the current formulation for the coverage variables is also not optimal:

Current

aa|ace|ad[ay]|ain|al[et]|anp?|arp|ast|av|awa|ay|ba[ns]|bho|bin?|bla|bug|byn|ceb|ch[kmoy]?|crs|cu|da[kr]|dgr|dv|dzg|efi|eka|ewo|ff|fon|fur|gaa|gd|gez|gil|gor|gwi|hil|hmn|hup|hz|ia|ib[ab]|ilo|inh|io|jbo|ka[cj]|kbd|kcg|kfo|kha|kj|kkj|kmb|kpe|kr[clu]?|ksh|kum|kv|lad|lez|li|loz|lu[ans]|ma[dgik]|mdf|men|mh|mi[cn]|mni|mos|mu[ls]|mwl|myv

Unflattened

aa|ace|ada|ady|ain|ale|alt|an|anp|arp|ast|av|awa|ay|ban|bas|bho|bi|bin|bla|bug|byn|ceb|ch|chk|chm|cho|chy|crs|cu|dak|dar|dgr|dv|dzg|efi|eka|ewo|ff|fon|fur|gaa|gd|gez|gil|gor|gwi|hil|hmn|hup|hz|ia|iba|ibb|ilo|inh|io|jbo|kac|kaj|kbd|kcg|kfo|kha|kj|kkj|kmb|kpe|kr|krc|krl|kru|ksh|kum|kv|lad|lez|li|loz|lua|lun|lus|mad|mag|mai|mak|mdf|men|mh|mic|min|mni|mos|mul|mus|mwl|myv

Optimized

a([avy]|ce|d[ay]|in|l[et]|np?|rp|st|wa)|b(a[ns]|ho|in?|la|ug|yn)|c(eb|h[kmoy]?|rs|u)|d(a[kr]|gr|v|zg)|e(fi|ka|wo)|f(f|on|ur)|g(aa|d|ez|il|or|wi)|h(il|mn|up|z)|i([ao]|b[ab]|lo|nh)|jbo|k([jv]|a[cj]|bd|cg|fo|ha|kj|mb|pe|r[clu]?|sh|um)|l(ad|ez|i|oz|u[ans])|m(a[dgik]|df|en|h|i[cn]|ni|os|u[ls]|wl|yv)

The optimized expression group as much as possible, while having no backup and just going forwards in the alternations

View

Add a comment

Modify Ticket

Action
as accepted
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.