[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #11451(closed: fixed)

Opened 2 months ago

Last modified 7 weeks ago

Incorrect description of valueType & related fixes

Reported by: mark Owned by: mark
Component: locale-codes-names Data Locale:
Phase: final Review: markus
Weeks: Data Xpath:
Xref:

Description

The description of valueType is incorrect.

single Only a single type value is allowed. This is the default if no valueType attribute is present.

It should be:

single Either exactly one type value, or no type value (but only if the value of "true" would be valid). This is the default if no valueType attribute is present.

Note: the canonical form has "true" removed: see http://unicode.org/reports/tr35/#u_Extension

Attachments

Change History

comment:1 Changed 2 months ago by mark

FYI: the following has valueType="single" (implicitly, "single" is the default).

<key name="kn" description="Collation parameter key for numeric handling" alias="colNumeric">

<type name="true" description="A sequence of decimal digits is sorted at primary level with its numeric value" alias="yes"/>
<type name="false" description="No special handling for numeric ordering" alias="no"/>

</key>

comment:2 Changed 2 months ago by mark

Also noticed a couple of areas where http://unicode.org/repos/cldr/trunk/specs/ldml/tr35.html#BCP_47_Conformance needs to be cleaned up to make clear which features are not BCP47-compatible.

comment:3 Changed 2 months ago by mark

  • Status changed from new to reviewing
  • Cc markus added
  • Priority changed from assess to major
  • Milestone changed from UNSCH to 34
  • Owner changed from anybody to mark
  • type changed from unknown to spec
  • Review set to pedberg

comment:5 Changed 2 months ago by mark

  • Summary changed from Incorrect description of valueType to Incorrect description of valueType & related fixes

comment:6 Changed 2 months ago by mark

While looking at the spec, I realized two other things.

http://unicode.org/repos/cldr/trunk/specs/ldml/tr35.html#Unknown_or_Invalid_Identifiers

  1. The value for unknown for subdivisions is incorrect, should be <region>zzzz.
  2. We should make sure that we document clearly each of the special values (we do almost, but not all of them).
lstrType Validity Status Code LSTR
language special mis {Description=Uncoded languages, Added=2005-10-16, Scope=special}
language special mul {Description=Multiple languages, Added=2005-10-16, Scope=special}
language special zxx {Description=No linguistic content▪Not applicable, Added=2006-03-08, Scope=special}
script special Qaag {Description=Private use, Added=2005-10-16}
script special Zmth {Description=Mathematical notation, Added=2007-12-05}
script special Zsye {Description=Symbols (Emoji variant), Added=2016-01-04}
script special Zsym {Description=Symbols, Added=2007-12-05}
script special Zxxx {Description=Code for unwritten documents, Added=2005-10-16}
region special XA {Description=Private use, Added=2005-10-16}
region special XB {Description=Private use, Added=2005-10-16}
language unknown und {Description=Undetermined, Added=2005-10-16, Scope=special}
script unknown Zzzz {Description=Code for uncoded script, Added=2005-10-16}
region unknown ZZ {Description=Private use, Added=2005-10-16}
currency unknown XXX <unavailable>
subdivision unknown .*zzzz <unavailable>

comment:7 Changed 2 months ago by mark

Note: the scope of the BCP 47 changes grew on the basis of feedback from Markus and Addison.

Reordered and added 2 defined terms to
http://www.unicode.org/reports/tr35/proposed.html#BCP_47_Conformance

Reordered and tightened up the language in the following to make it much clearer what the various relationships are. Also regularized the format of the examples, and split out the conversion for the compatibility form.
http://www.unicode.org/reports/tr35/proposed.html#BCP_47_Language_Tag_Conversion

comment:8 Changed 2 months ago by mark

The changes from comment:7 are much easier viewed in the browser than in the diffs.

comment:9 Changed 2 months ago by mark

  • Cc addison@… added

comment:10 Changed 2 months ago by pedberg

  • Review changed from pedberg to markus

comment:11 follow-up: ↓ 12 Changed 2 months ago by markus

  • Cc pedberg added
  • Phase changed from dsub to final
  • Status changed from reviewing to reviewfeedback
  • Component changed from unknown to bcp47

3.2

A Unicode locale identifier is composed of a Unicode language identifier plus (optional) locale extensions (U and T). It has the following structure. The semantics of the U and T extensions are explained in Section 3.6 Unicode BCP 47 U Extension and Section 3.7 Unicode BCP 47 T Extension.

--> Out of date now that the syntax includes all types of extensions.
--> Suggestion: Remove "(U and T)". Add something about other extensions.

A Unicode locale identifier is composed of a Unicode language identifier plus (optional) locale extensions. It has the following structure. The semantics of the U and T extensions are explained in Section 3.6 Unicode BCP 47 U Extension and Section 3.7 Unicode BCP 47 T Extension. Other extensions and private use extensions are supported for pass-through without specific structure.


3.4

The private use subtags from XA..XZ will normally never be given specific semantics in Unicode identifiers, and are thus safe for use for other purposes by other applications.

--> CLDR defines XA & XB for pseudolocales.
--> Change to "The private use subtags from XC..XZ will ..."


3.4

The CLDR provides data for normalizing territory/region codes, including mapping overlong codes like "eng-840" or "eng-USA" to the correct code "en-US".

I know this is old, but I don't quite understand how one is supposed to even recognize "eng-USA" when canonicalizing a locale ID: The "USA" subtag has the syntactic shape of an extlang, not of a region subtag, so I would expect extlang conversion to toss it out, rather than region canonicalization to map it to "US".


3.6.5.1 Validity

en-CA-u-sd-gbsct is invalid — the region "CA" doesn't not match the first part of "gbsct".

--> Change "doesn't not" to "does not".

comment:12 in reply to: ↑ 11 ; follow-up: ↓ 13 Changed 2 months ago by mark

Replying to markus:

3.2

A Unicode locale identifier is composed of a Unicode language identifier plus (optional) locale extensions (U and T). It has the following structure. The semantics of the U and T extensions are explained in Section 3.6 Unicode BCP 47 U Extension and Section 3.7 Unicode BCP 47 T Extension.

--> Out of date now that the syntax includes all types of extensions.
--> Suggestion: Remove "(U and T)". Add something about other extensions.

A Unicode locale identifier is composed of a Unicode language identifier plus (optional) locale extensions. It has the following structure. The semantics of the U and T extensions are explained in Section 3.6 Unicode BCP 47 U Extension and Section 3.7 Unicode BCP 47 T Extension. Other extensions and private use extensions are supported for pass-through without specific structure.

Done, except omitted "without specific structure".


3.4

The private use subtags from XA..XZ will normally never be given specific semantics in Unicode identifiers, and are thus safe for use for other purposes by other applications.

--> CLDR defines XA & XB for pseudolocales.
--> Change to "The private use subtags from XC..XZ will ..."

Changed to the following formulation (in language and script subtag cells also):

The private use codes listed in Section 3.5.3 Private Use Codes


3.4

The CLDR provides data for normalizing territory/region codes, including mapping overlong codes like "eng-840" or "eng-USA" to the correct code "en-US".

I know this is old, but I don't quite understand how one is supposed to even recognize "eng-USA" when canonicalizing a locale ID: The "USA" subtag has the syntactic shape of an extlang, not of a region subtag, so I would expect extlang conversion to toss it out, rather than region canonicalization to map it to "US".

It would only be useful for implementations that don't accept extlang. Might revisit this later, since overlong region codes are an unusual case; either that or add a note.


3.6.5.1 Validity

en-CA-u-sd-gbsct is invalid — the region "CA" doesn't not match the first part of "gbsct".

--> Change "doesn't not" to "does not".

whoa!
done

comment:13 in reply to: ↑ 12 Changed 2 months ago by markus

  • Status changed from reviewfeedback to closed
  • Resolution set to fixed

Replying to mark:

3.4

The CLDR provides data for normalizing territory/region codes, including mapping overlong codes like "eng-840" or "eng-USA" to the correct code "en-US".

I know this is old, but I don't quite understand how one is supposed to even recognize "eng-USA" when canonicalizing a locale ID: The "USA" subtag has the syntactic shape of an extlang, not of a region subtag, so I would expect extlang conversion to toss it out, rather than region canonicalization to map it to "US".

It would only be useful for implementations that don't accept extlang. Might revisit this later, since overlong region codes are an unusual case; either that or add a note.

I submitted ticket:11473

comment:14 Changed 7 weeks ago by mark

  • Component changed from bcp47 to locale-codes
View

Add a comment

Modify Ticket

Action
as closed
Next status will be 'new'
Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.