[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #8879(accepted data)

Opened 2 years ago

Last modified 22 months ago

Consider a wildcard syntax for subdivision validity

Reported by: doug@… Owned by: mark
Component: supplemental Data Locale: en
Phase: rc Review:
Weeks: Data Xpath:
Xref:

Description

The region validity file uses a convenient "range" syntax to enumerate valid regions:

<id type='region' idStatus='regular'>
	AC~G AI AL~M AO AQ~U AW~X AZ
	BA~B BD~J BL~O BQ~T BV~W BY~Z
	CA CC~D CF~I CK~P CR CU~Z
	...

But in the subdivision validity file, the "special" codes (ZZZZ, unknown or invalid subdivision) can only be expressed by repeating each valid region code explicitly:

<id type='subdivision' idStatus='special'>
	AC-ZZZZ AD-ZZZZ AE-ZZZZ AF-ZZZZ AG-ZZZZ AI-ZZZZ AL-ZZZZ AM-ZZZZ AO-ZZZZ AQ-ZZZZ AR-ZZZZ AS-ZZZZ AT-ZZZZ AU-ZZZZ AW-ZZZZ AX-ZZZZ AZ-ZZZZ
	BA-ZZZZ BB-ZZZZ BD-ZZZZ BE-ZZZZ BF-ZZZZ BG-ZZZZ BH-ZZZZ BI-ZZZZ BJ-ZZZZ BL-ZZZZ BM-ZZZZ BN-ZZZZ BO-ZZZZ BQ-ZZZZ BR-ZZZZ BS-ZZZZ BT-ZZZZ BV-ZZZZ BW-ZZZZ BY-ZZZZ BZ-ZZZZ
	CA-ZZZZ CC-ZZZZ CD-ZZZZ CF-ZZZZ CG-ZZZZ CH-ZZZZ CI-ZZZZ CK-ZZZZ CL-ZZZZ CM-ZZZZ CN-ZZZZ CO-ZZZZ CP-ZZZZ CR-ZZZZ CU-ZZZZ CV-ZZZZ CW-ZZZZ CX-ZZZZ CY-ZZZZ CZ-ZZZZ
	...

This defeats the purpose of the concise range syntax, and introduces a risk of inconsistent data if either file is modified such that the two no longer match.

A modified range syntax involving a wildcard might solve this problem:

<id type='subdivision' idStatus='special'>
	*-ZZZZ
</id>

where * is defined to represent any valid region (so ZZ-ZZZZ would still not be included).

Attachments

Change History

comment:1 Changed 2 years ago by mark

I agree that it would be nicer to have a shorter set. The syntax permits ranges at any level, so we could collapse as in the following. That avoid special syntax for *.

AC-ZZZZ AD-ZZZZ AE-ZZZZ AF-ZZZZ AG-ZZZZ
to
AC-ZZZZ~G-ZZZZ

comment:2 Changed 2 years ago by doug@…

I'm not sure about the suggested syntax. I understand what AC-ZZZZ~G-ZZZZ is meant to imply, but it looks like e.g. AG-AAAA is included within the range. What would it mean if the subdivision parts were different? (e.g. AC-YYYY~G-ZZZZ) Alternatively, if the two can't be different, why state them twice?

I'd suggest AC~G-ZZZZ instead, to mean "regions AC through AG, subdivision ZZZZ only."

This still doesn't solve the problem of maintaining the valid-regions data in both the region and subdivision files.

comment:3 Changed 2 years ago by emmons

  • Status changed from new to accepted
  • Component changed from unknown to supplemental
  • Priority changed from assess to medium
  • Phase changed from dsub to rc
  • Milestone changed from UNSCH to 29
  • Owner changed from anybody to mark
  • Type changed from unknown to data

comment:4 Changed 2 years ago by mark

We defined http://unicode.org/repos/cldr/trunk/specs/ldml/tr35.html#String_Range to have no special knowledge of the type of data involved.

comment:5 Changed 2 years ago by doug@…

OK, I understand the syntax now. Please disregard comment 2, except for the last sentence about maintaining the same data in two places. (And disregard that too if you don't foresee a problem.)

comment:6 Changed 2 years ago by mark

Code has been added to StringRange to do the further compaction. However, since we have a release candidate, not changing the current generation to use it. That can be done at the start of v29.

Sample, reduces size from 2047 characters to 1302 characters.

AC-ZZZZ~G-ZZZZ AI-ZZZZ AL-ZZZZ~M-ZZZZ AO-ZZZZ AQ-ZZZZ~U-ZZZZ AW-ZZZZ~X-ZZZZ AZ-ZZZZ BA-ZZZZ~B-ZZZZ BD-ZZZZ~J-ZZZZ BL-ZZZZ~O-ZZZZ BQ-ZZZZ~T-ZZZZ BV-ZZZZ~W-ZZZZ BY-ZZZZ~Z-ZZZZ CA-ZZZZ CC-ZZZZ~D-ZZZZ CF-ZZZZ~I-ZZZZ CK-ZZZZ~P-ZZZZ CR-ZZZZ CU-ZZZZ~Z-ZZZZ DE-ZZZZ DG-ZZZZ DJ-ZZZZ~K-ZZZZ DM-ZZZZ DO-ZZZZ DZ-ZZZZ EA-ZZZZ EC-ZZZZ EE-ZZZZ EG-ZZZZ~H-ZZZZ ER-ZZZZ~T-ZZZZ FI-ZZZZ~K-ZZZZ FM-ZZZZ FO-ZZZZ FR-ZZZZ GA-ZZZZ~B-ZZZZ GD-ZZZZ~I-ZZZZ GL-ZZZZ~N-ZZZZ GP-ZZZZ~U-ZZZZ GW-ZZZZ GY-ZZZZ HK-ZZZZ HM-ZZZZ~N-ZZZZ HR-ZZZZ HT-ZZZZ~U-ZZZZ IC-ZZZZ~E-ZZZZ IL-ZZZZ~O-ZZZZ IQ-ZZZZ~T-ZZZZ JE-ZZZZ JM-ZZZZ JO-ZZZZ~P-ZZZZ KE-ZZZZ KG-ZZZZ~I-ZZZZ KM-ZZZZ~N-ZZZZ KP-ZZZZ KR-ZZZZ KW-ZZZZ KY-ZZZZ~Z-ZZZZ LA-ZZZZ~C-ZZZZ LI-ZZZZ LK-ZZZZ LR-ZZZZ~V-ZZZZ LY-ZZZZ MA-ZZZZ MC-ZZZZ~H-ZZZZ MK-ZZZZ~Z-ZZZZ NA-ZZZZ NC-ZZZZ NE-ZZZZ~G-ZZZZ NI-ZZZZ NL-ZZZZ NO-ZZZZ~P-ZZZZ NR-ZZZZ NU-ZZZZ NZ-ZZZZ OM-ZZZZ PA-ZZZZ PE-ZZZZ~H-ZZZZ PK-ZZZZ~N-ZZZZ PR-ZZZZ~T-ZZZZ PW-ZZZZ PY-ZZZZ QA-ZZZZ RE-ZZZZ RO-ZZZZ RS-ZZZZ RU-ZZZZ RW-ZZZZ SA-ZZZZ~E-ZZZZ SG-ZZZZ~O-ZZZZ SR-ZZZZ~T-ZZZZ SV-ZZZZ SX-ZZZZ~Z-ZZZZ TA-ZZZZ TC-ZZZZ~D-ZZZZ TF-ZZZZ~H-ZZZZ TJ-ZZZZ~O-ZZZZ TR-ZZZZ TT-ZZZZ TV-ZZZZ~W-ZZZZ TZ-ZZZZ UA-ZZZZ UG-ZZZZ UM-ZZZZ US-ZZZZ UY-ZZZZ~Z-ZZZZ VA-ZZZZ VC-ZZZZ VE-ZZZZ VG-ZZZZ VI-ZZZZ VN-ZZZZ VU-ZZZZ WF-ZZZZ WS-ZZZZ XK-ZZZZ YE-ZZZZ YT-ZZZZ ZA-ZZZZ ZM-ZZZZ ZW-ZZZZ

comment:7 Changed 22 months ago by emmons

  • Milestone changed from 29 to upcoming

Auto move of all 29 -> upcoming

View

Add a comment

Modify Ticket

Action
as accepted
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.