[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #5192(closed enhancement: fixed)

Opened 3 years ago

Last modified 23 months ago

improve DUCET tertiary weights

Reported by: markus Owned by: markus
Component: uca Data Locale:
Phase: Review: mark
Weeks: Data Xpath:
Xref:

Description

DUCET tertiary weights are constructed with conflicting goals leading to issues which are being worked around. It looks like we could improve it by simplifying the initial DUCET generation and then adding further processing.

DUCET tertiary weights are derived from character properties and collation-modified decomposition mappings. For example, weight 04 is used for general compatibility decompositions, weight 08 for normal uppercase, etc. See http://www.unicode.org/reports/tr10/#Tertiary_Weight_Table

Without adjustment, this can lead to some cases where strings a & b sort differently but ab=ba. For example, we might get:

002E  ; [*0273.0020.0002] # FULL STOP
2024  ; [*0273.0020.0004] # ONE DOT LEADER
2025  ; [*0273.0020.0004][*0273.0020.0004] # TWO DOT LEADER

\u2024 sorts less than \u2025, but \u2024\u2025=\u2025\u2024

Therefore, for several UCA versions, the tertiary weight of the trailing collation element has been set to 1F, introducing an additional tertiary distinction. This adjustment used to be buggy but is being fixed in UCA 6.2.

When this adjustment is applied for all such expansions, then some distinctions are erased. For example:

01F1  ; [.1617.0020.000A][.187B.0020.001F] # LATIN CAPITAL LETTER DZ
01F2  ; [.1617.0020.000A][.187B.0020.001F] # LATIN CAPITAL LETTER D WITH SMALL LETTER Z

By setting the last CE's tertiary to 1F, the distinction between the small and capital z is lost and we get \u01F1=\u01F2 which is wrong. There are several cases like this which are being fixed for UCA 6.2.

Ideally, we would define tertiary differences manually (as in \u002E <<< \u2024 <<< \u2025 | \u002E), but that would be a large change to the process, and require a lot more work to define the sort order for new characters.

It seems like we could revert the initial DUCET generation to the tertiary weight assignment without setting 1F weights, and add another processing step to recompute all tertiary weights. A particular weight would not be assigned for a particular character property any more, but distinctions between same-property strings can then be made as needed.

Here is one quick idea for what such processing might look like:

  • parse allkeys.txt
  • sort all mappings by their CE sequences
  • for all adjacent CE sequences x & y:
    • if x is a prefix of y then fractionally increment y's first CE's tertiary weight

and then

  • for all CEs anywhere:
    • per primary+secondary combination:
      • list all tertiary weights in sorted order
      • remap: lowest=02, increment from there
      • change all CEs everywhere using this tertiary remapping

Attachments

Change History

comment:1 Changed 3 years ago by richard.wordingham@…

The algebraic characterisation of misbehaviour looks nice, but I would remark that "s" sorts before "ss" but concat("s", "ss") and concat("ss", "s") sort the same!

I challenge, 'Therefore, for several UCA versions, the tertiary weight of the trailing collation element has been set to 1F, introducing an additional tertiary distinction.' The rule in UCA 6.1.0 is that the 3rd and subsequent non-null CEs get the tertiary weight 1F.

I can see four technical issues with what would otherwise be an improvement:

1) DUCET tertiary weights are used to record precise casing information. Can we guarantee that casing information for DUCET can be extracted from FractionalUCA.txt? Does the ability need to be made a 'stability' straitjacket?

2) Tertiary weights are also used for asymmetric matching (UCA 6.1.0 Section 8.2). However it has been agreed that the rule that 'minimum value means unmarked' needs modification modification so that normal hiragana, not small hiragana, is treated as unmarked amongst hiragana and katakana.

3) Non-NFSD* entries would need to be removed before reweighting, and then updated from their NFSD forms. (This may seem obvious, but this updating, even for NFD, is not always the final step in processing.)

4) Sorting would probably have to be character by character rather than level by level, and some careful handling may be necessary for variable weightings - use shifted with only three levels for comparison? For example, we have the following sequence in DUCET 6.1.0 when the old 4th level is removed:

216E ; [.1616.0020.000A] # ROMAN NUMERAL FIVE HUNDRED
1F113 ; [*02FB.0020.0004][.1616.0020.000A][*02FC.0020.001F] # PARENTHESIZED LATIN CAPITAL LETTER D
1F1E9 ; [.1616.0020.000A] # REGIONAL INDICATOR SYMBOL LETTER D
00D0 ; [.1616.0020.000A][.0000.0139.0004] # LATIN CAPITAL LETTER ETH
A779 ; [.1616.0020.000A][.0000.013A.0004] # LATIN CAPITAL LETTER INSULAR D
01F2 ; [.1616.0020.000A][.187A.0020.0004] # LATIN CAPITAL LETTER D WITH SMALL LETTER Z
01F1 ; [.1616.0020.000A][.187A.0020.000A] # LATIN CAPITAL LETTER DZ
01C5 ; [.1616.0020.000A][.187A.0020.0004][.0000.0041.001F] # LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
01C4 ; [.1616.0020.000A][.187A.0020.000A][.0000.0041.001F] # LATIN CAPITAL LETTER DZ WI

*NFSD = 'Normal Form Sorting-Decomposed'. It's like NFD, but uses the additional or alternative non-compatibility decompositions implied by current DUCET behaviour, such as U+00F8 LATIN SMALL LETTER O WITH STROKE to <U+006F, U+0338>.

comment:2 follow-up: ↓ 5 Changed 3 years ago by markus

"s" sorts before "ss" but concat("s", "ss") and concat("ss", "s") sort the same!

This makes sense to me. Maybe the right thing to do is not to worry about cases like \u2024\u2025=\u2025\u2024.

It does seem surprising that a dz/Dz/DZ ligature might share CEs with other unusual versions of 'd' and 'z'. Maybe we should just review the mapping from decomposition type to tertiary weights, or maybe we decide that the sort order is ok?

Consider the follwing data for "dz".
(This is from the final? UCA 6.2 but I am omitting the obsolete 4th weights here for clarity.)

01F3  ; [.1631.0020.0004][.1895.0020.0004] # LATIN SMALL LETTER DZ
02A3  ; [.1631.0020.0004][.1895.0020.0004] # LATIN SMALL LETTER DZ DIGRAPH
01C6  ; [.1631.0020.0004][.1895.0020.0004][.0000.0041.0004] # LATIN SMALL LETTER DZ WITH CARON

vs.

0369  ; [.1631.0020.0004] # COMBINING LATIN SMALL LETTER D
217E  ; [.1631.0020.0004] # SMALL ROMAN NUMERAL FIVE HUNDRED

1DE6  ; [.1895.0020.0004] # COMBINING LATIN SMALL LETTER Z

There are also the following versions of 'd' but they add secondary or variable-primary differences. (Similar for 'z'.)

249F  ; [*02FB.0020.0004][.1631.0020.0004][*02FC.0020.001F] # PARENTHESIZED LATIN SMALL LETTER D
00F0  ; [.1631.0020.0004][.0000.0139.0004] # LATIN SMALL LETTER ETH
1DD9  ; [.1631.0020.0004][.0000.0139.0004] # COMBINING LATIN SMALL LETTER ETH
1DD8  ; [.1631.0020.0004][.0000.013A.0004] # COMBINING LATIN SMALL LETTER INSULAR D
A77A  ; [.1631.0020.0004][.0000.013A.0004] # LATIN SMALL LETTER INSULAR D

So in UCA 6.2, the following sort the same:

  • dz LATIN SMALL LETTER DZ
  • ʣ LATIN SMALL LETTER DZ DIGRAPH
  • ͩᷦ COMBINING LATIN SMALL LETTER D + COMBINING LATIN SMALL LETTER Z
  • ⅾᷦ SMALL ROMAN NUMERAL FIVE HUNDRED + COMBINING LATIN SMALL LETTER Z

The combining small letters don't look right here. Should they have secondary CEs, or secondary differences, rather than twiddling with the tertiary weights?

We should also say in http://www.unicode.org/reports/tr10/#Tertiary_Weight_Table that the <sort> type in decomps.txt behaves like <compat>.

comment:3 follow-up: ↓ 4 Changed 3 years ago by markus

I just found that UCA 6.2's reduced application of max=1F tertiary weights has another benefit: For the Thai/Lao order-reversing contractions we now get consonant+prevowel == prevowel+consonant like we did when the reversal was done in code, while the contractions UCA 6.1 and earlier had max=1F tertiary differences from consonant+prevowel.

comment:4 in reply to: ↑ 3 Changed 3 years ago by Richard Wordingham <richard.wordingham@…>

Replying to markus:

For the Thai/Lao order-reversing contractions we now get consonant+prevowel == prevowel+consonant like we did when the reversal was done in code, while the contractions UCA 6.1 and earlier had max=1F tertiary differences from consonant+prevowel.

I had assumed that the now-erased non-identity level distinction between consonant+soft-hyphen+preposed-vowel+space (ก­เ ) and preposed-vowel+consonant+space (เก ) in Thai, Lao and Tai Viet was intentional.

comment:5 in reply to: ↑ 2 Changed 3 years ago by Richard Wordingham <richard.wordingham@…>

Replying to markus:

The combining small letters don't look right here. Should they have secondary CEs, or secondary differences, rather than twiddling with the tertiary weights?

Combining small letters are just normal letters placed in a funny position. The difference is no more (and perhaps less) significant than subscripting or superscripting, so I fear they should be tertiary. It's a shame, for using an additional secondary element to say 'combining' allows the full use of the tertiary subtleties, such as for case.

Unfortunately, we need many more tertiary weights - several different <font> tertiaries, and extra general purpose <compat> tertiaries. You're probably going to say there isn't a spare bit for packed weights, though again the characters meriting special treatment are generally rare. In some cases, the ambiguity may not matter. For example, I was going to point out that U+217C SMALL ROMAN NUMERAL FIFTY has weight [.1711.0020.0004] while U+1EFB LATIN SMALL LETTER MIDDLE-WELSH LL has weight [.1711.0020.0004][.1711.0020.0004], and this combination has not been fixed. However, the former is deprecated (or so I've been told) and does not occur doubled in normal text. (I suppose it might occur doubled in a discussion of addition using Roman numerals.)

Again, on the topic of compatibility characters for Roman numerals, you've reshuffled things so that, expressing things using the compatibility decompositions, ii no longer sorts as i and i, but iii now sorts as i and ii, whereas before it didn't!

comment:6 Changed 3 years ago by emmons

  • Owner changed from anybody to markus
  • Status changed from new to assigned
  • Milestone changed from UNSCH to 22.1

comment:7 Changed 3 years ago by markus

  • Milestone changed from 22.1 to 23

comment:8 Changed 2 years ago by markus

  • Milestone changed from 23 to 24

Review this ticket for UCA 6.3/CLDR 24.

UCA 6.3 is dropping the max=1F tertiary weights, and the preliminary FractionalUCA.txt has the tertiary weights distributed per primary+secondary combination.

comment:9 Changed 2 years ago by markus

  • Cc mark, yoshito, pedberg, emmons added
  • Status changed from assigned to accepted
  • Review set to emmons

New data see ticket:5568 r9301, ​​​http://unicode.org/repos/cldr/trunk/common/uca/FractionalUCA.txt
with changes noted in comment:8.

The DUCET allkeys.txt is unchanged.

comment:10 Changed 23 months ago by emmons

  • Review changed from emmons to mark

comment:11 Changed 23 months ago by mark

  • Status changed from accepted to closed
  • Resolution set to fixed
View

Add a comment

Modify Ticket

Action
as closed
The ticket will be disowned. The resolution will be deleted. Next status will be 'new'
Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.