CLDR Ticket #5546(accepted data)
follow DUCET with other numbers among symbols
Reported by: | markus | Owned by: | markus |
---|---|---|---|
Component: | collation | Data Locale: | root |
Phase: | Review: | ||
Weeks: | 0.1 | Data Xpath: | |
Xref: |
Description
I propose that we follow the DUCET in how we order "other numbers". That is, I propose we stop reordering them from the symbol group into the digit group.
Details:
Compared with the DUCET, "CLDR groups the numbers together after currency symbols, instead of splitting them with some before and some after." (see the LDML spec).
There are about 200 "other number" characters that CLDR modifies, for example U+0BF0 ௰ Tamil Number Ten and U+2180 ↀ Roman Numeral 1000 CD. On the DUCET symbol chart they are the characters from 09F4 to 1D371.
CLDR sorts all of these in the "digit" reordering group, just before digit 0. They do not sort in the order of numeric values, they are not digits, and they do not decompose to digits.
With numeric sorting on, and with computed primary weights for numeric sorting at the beginning of the digit group like we defined in LDML 22, the "other number" characters sort between the digits-as-numbers and the compatibility digits.
The current reordering puts all of the characters together that have General_Category=Number, but I do not see that this order is better, in any practical sense, than their DUCET order.
I think it is desirable to reduce the difference between the DUCET and the CLDR root, to reduce surprises for users and to reduce our tooling and documentation burden.
Attachments
Change History
Changed 5 years ago by markus
- Attachment sort-with-digit-1.txt added
characters in UCA 6.3 generated allkeys_DUCET.txt that have the same primary weight as ASCII digit 1
comment:2 Changed 5 years ago by markus
I agree there's cruft in the DUCET, I am just not sure it's worth reordering it, or so much of it.
We could collect numeric cruft at the end of the digit group, or we could move numeric cruft from the digit group next to the numeric cruft that the DUCET has in the symbol group; or leave the numeric cruft where it is.
I attached a file with all of the characters that sort with "1" in the UCA 6.3 DUCET. (There are none with a primary weight between "1" and "2".)
comment:3 Changed 5 years ago by markus
- Owner changed from anybody to markus
- Status changed from new to assigned
Need to review together with other diffs between DUCET & CLDR root.
I'm sympathetic, but have some concerns. Currently the UCA has the following order:
☺ general symbols
ↀ some strange non-decimal numbers
€ currency signs
0 digits
⓪ variants of digits
1 digits
⓵ variants
𒐴 other strange non-decimal numbers
½ fractions
① ② sequences
12 ...
⑫ ...
2 ...
A letters
L Nl values interleaved with digits.
I think the least surprising order for numeric sorting would be to have all of the items that can be interpreted as decimal numbers, sorted as decimal numbers, all in one group, and all the other numbers (other than Nl), sorted in another group.
☺ general symbols
€ currency signs
following in numeric order
0 digits
½ fractions
⓪ variants of digits
1 digits
⓵ variants
2 ...
① ② sequences
12 ...
⑫ ...
ↀ some strange non-decimal numbers
𒐴 other strange non-decimal numbers
A letters
L Nl values interleaved with digits.
The UCA interleaves some questionable items in with digits, like ½ between 1 and 2. For example, the characters that have or contain the same primary weight as "1" in http://www.unicode.org/Public/UCA/6.3.0/allkeys-6.3.0d1.txt include the following:
[⑴ ⑽-⒆ 1１𝟏𝟙𝟣𝟭𝟷①⓵❶➀➊¹ ₁١۱𐹠߁፩𐒡१১੧૧୧௧౧౹౼೧൧꯱꣑᥇ ᧑᧚᪁᪑๑໑༡༪᱁꤁၁႑𑄷១៱꩑᭑꧑᮱᠑᱑ ꘡𑃱𐄇𐅂𐅘-𐅚𐌠𐏑𒐕𒐞𒐬𒐴𒑏𒑘𐩽𐤖𐡘𐭘𐭸𑇑 𑛁𑁧𑁒𐩀𝍠 🄂 ⒈ ⅟ ⅒ ½ ⅓ ¼ ⅕ ⅙ ⅐ ⅛ ⅑ ⑩⓾❿➉➓㉈ ⒑ ㏩ ㋉ ㍢ ⑪⓫ ⒒ ㏪ ㋊ ㍣ ⑫⓬ ⒓ ㏫ ㋋ ㍤ ⑬⓭ ⒔ ㏬ ㍥ ⑭⓮ ⒕ ㏭ ㍦ ⑮ ⓯ ⒖ ㏮ ㍧ ⑯⓰ ⒗ ㏯ ㍨ ⑰⓱ ⒘ ㏰ ㍩ ⑱⓲ ⒙ ㏱ ㍪ ⑲⓳ ⒚ ㏲ ㍫ ㏠ ㋀ ㍙ ㉑ ㏴ ㍭ ㉛ ㏾ ㊶ 〡]
But maybe we just don't care much about the outlying items, like:
ↀ some strange non-decimal numbers
𒐴 other strange non-decimal numbers