CLDR Ticket #6745(accepted data)
|Reported by:||markus||Owned by:||markus|
Description (last modified by markus) (diff)
There are several problems with UCA_Rules.txt:
- It tailors code points like U+FFFD, U+FFFE, U+FFFF which are not allowed to be tailored.
- It tailors primary-after ignorables, which does not work any more because there is now a CE with the first possible primary weight, for the start-of-spaces boundary. (There is now a first-primary boundary for the start of each reordering group and for each script.) Instead, UCA_Rules.txt should place spaces after the last space in the root collation.
- It modifies the [variable top] with this syntax which we plan to deprecate as soon as we add the maxVariable setting into the spec. Instead, it should place punctuation after [last variable].
- It has rules with extension strings whose mappings are changed later, which makes the earlier mappings not match as expected. This used to work with ICU when its builder evaluated extensions after other CEs had been assigned, but we confirmed and documented that each rule should be affected by all of the preceding rules and none of the following ones.
< '!' <<< ！ <<< ‼ / '!' <<< ⁉ / '?' <<< ﹗ <<< ︕ < ¡ < ՜ < ߹ < ᥄ < '?'
When the rule <<< ⁉ / '?' is processed with a conformant builder, '?' still has its root collator mapping, and that is copied into the second CE for ⁉, but a few rules later '?' is modified. As a result, ⁉ becomes primary-less-than '!?' rather than tertiary-greater-than '!?'. Therefore, when building a collator from UCA_Rules.txt, it will not pass the conformance tests.
This problem might be tricky to resolve. We might need to postpone any rules that contain extensions to a later section, like we postpone Thai and Lao reordering mappings. It might also be better to use normal resets for expansions, for example &'!?'<<<⁉ -- they are easier to understand anyway.
Note that the "UCA rules" are only an approximation of the root collation (see IcuBug:9512 and IcuBug:9589). If we did not have one known user of the "UCA rules", it would be easiest to stop generating and testing them...
We could also agree not to fix some of these problems (make it build but don't fix the expansions), affirm that the file provides only an approximation, and in ICU we would stop testing it with the conformance test files. We already test it only with the "non-ignorable" test file, not with "shifted", due to long-standing problems.
- Owner changed from anybody to markus
- Status changed from new to assigned
- Milestone changed from UNSCH to 25rc
- Data Locale set to root
- Type changed from defect to data
- Component changed from uca to collation