[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #6745(accepted data)

Opened 4 years ago

Last modified 2 years ago

fix UCA_Rules.txt

Reported by: markus Owned by: markus
Component: collation Data Locale: root
Phase: rc Review:
Weeks: 1 Data Xpath:

Description (last modified by markus) (diff)

There are several problems with UCA_Rules.txt:

  1. It tailors code points like U+FFFD, U+FFFE, U+FFFF which are not allowed to be tailored.
  1. It tailors primary-after ignorables, which does not work any more because there is now a CE with the first possible primary weight, for the start-of-spaces boundary. (There is now a first-primary boundary for the start of each reordering group and for each script.) Instead, UCA_Rules.txt should place spaces after the last space in the root collation.
  1. It modifies the [variable top] with this syntax which we plan to deprecate as soon as we add the maxVariable setting into the spec. Instead, it should place punctuation after [last variable].
  1. It has rules with extension strings whose mappings are changed later, which makes the earlier mappings not match as expected. This used to work with ICU when its builder evaluated extensions after other CEs had been assigned, but we confirmed and documented that each rule should be affected by all of the preceding rules and none of the following ones.

For example:

 <	 '!'
   <<<	 !
   <<<	 ‼ / '!'
   <<<	 ⁉ / '?'
   <<<	 ﹗
   <<<	 ︕
 <	 ¡
 <	 ՜
 <	 ߹
 <	 ᥄
 <	 '?'

When the rule <<< ⁉ / '?' is processed with a conformant builder, '?' still has its root collator mapping, and that is copied into the second CE for ⁉, but a few rules later '?' is modified. As a result, ⁉ becomes primary-less-than '!?' rather than tertiary-greater-than '!?'. Therefore, when building a collator from UCA_Rules.txt, it will not pass the conformance tests.

This problem might be tricky to resolve. We might need to postpone any rules that contain extensions to a later section, like we postpone Thai and Lao reordering mappings. It might also be better to use normal resets for expansions, for example &'!?'<<<⁉ -- they are easier to understand anyway.

Note that the "UCA rules" are only an approximation of the root collation (see IcuBug:9512 and IcuBug:9589). If we did not have one known user of the "UCA rules", it would be easiest to stop generating and testing them...

We could also agree not to fix some of these problems (make it build but don't fix the expansions), affirm that the file provides only an approximation, and in ICU we would stop testing it with the conformance test files. We already test it only with the "non-ignorable" test file, not with "shifted", due to long-standing problems.


Change History

comment:1 Changed 4 years ago by markus

  • Description modified (diff)

comment:2 Changed 4 years ago by emmons

  • Owner changed from anybody to markus
  • Status changed from new to assigned
  • Milestone changed from UNSCH to 25rc

comment:3 Changed 4 years ago by markus

  • Milestone changed from 25rc to 26rc

comment:4 Changed 4 years ago by markus

  • Milestone changed from 26rc to 27rc

comment:5 Changed 4 years ago by markus

  • Phase set to rc
  • Milestone changed from 27rc to 27

comment:6 Changed 3 years ago by markus

  • Milestone changed from 27 to 28

comment:7 Changed 3 years ago by markus

  • Data Locale set to root
  • Type changed from defect to data
  • Component changed from uca to collation

comment:8 Changed 3 years ago by srl

  • Status changed from assigned to accepted

comment:9 Changed 3 years ago by markus

  • Milestone changed from 28 to 29

comment:10 Changed 2 years ago by emmons

  • Milestone changed from 29 to upcoming

Automatic move of all 29 -> upcoming


Add a comment

Modify Ticket

as accepted

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.