[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #7382(closed enhancement: fixed)

Opened 3 years ago

Last modified 3 years ago

collation: reorder single scripts

Reported by: markus Owned by: markus
Component: uca Data Locale:
Phase: rc Review: mark
Weeks: 0.4 Data Xpath:
Xref:

Description

I propose that we enable single-script reordering, rather than reordering scripts in the current groups. This would solve a few problems, at minimal cost.

None of this changes anything about the space, punct, symbol, currency, digit, Latn, Hani reordering groups.

We currently take the Unicode scripts (alphabets etc.) in DUCET order, declare each Recommended Script as a sort of "anchor script", and create groups of scripts such that each group starts with such an anchor script. We give each group one primary-weight lead byte. We document script reordering as building a permutation of primary lead bytes.

We group scripts together because there are too many Unicode scripts to give each one a whole lead byte, and a lead byte permutation is very simple.

Issues:

  • The DUCET script order, together with the set of Recommended Scripts, causes very imbalanced groups of scripts (see the "top_byte" table in FractionalUCA.txt). The largest ones tend to "overflow", requiring splits at not-Recommended Scripts (we added Cherokee as an anchor script in CLDR 24), or smaller gaps between primaries than we would like.
  • More scripts will be added to Unicode, so we will have to revisit this again.
  • Several Recommended Scripts get a whole lead byte but have only a small number of primary weights.
  • Some of the groups contain unrelated scripts.
  • We like to group related scripts together, so that they move together; on the other hand, one might prefer a different order specifically of related scripts (e.g., among the Philippine scripts) which is not currently possible.
  • It is difficult to come up with a script order that is much "better", because relationships between scripts are complicated, and the Recommended Scripts are not the best anchors from a relatedness perspective.

If we reorder single scripts, then we do not need to justify the groups, we can freely allocate appropriate portions of the primary weight space, we do not need "related" scripts next to each other (and figure out what "related" means), and we do not need to care about the default order of scripts. Usability and documentation would be simpler.

In FractionalUCA.txt, I propose that we use whole bytes for a few very common scripts, and allocate one or more sixteenth of a lead byte for each of the other scripts. Script reordering would index by the top 12 primary bits. (This can be a small table by using a single offset value for whole lead bytes, and 16 values only for split bytes that do not all move by the same offset.)

For an implementation (like ICU) that writes sort keys as byte sequences, the reordering offset needs to be by whole bytes to avoid problems (with single-byte primaries, primary compression, and sort key byte validity). Reordering partial-byte scripts can be done by splitting the scripts that share such lead bytes, for which a small number of lead bytes would be reserved. Reorderings could not be completely arbitrary in that case, but it would be much more flexible than reordering whole groups.

Some scripts that currently use less than a sixteenth of a lead byte would use more space, but that is balanced by reducing some small scripts from whole bytes to a few sixteenths. (We would continue to use two-byte primary weights for almost all of the Recommended Scripts that use them now.)

Attachments

Change History

comment:1 Changed 3 years ago by emmons

  • Owner changed from anybody to markus
  • Priority changed from assess to medium
  • Status changed from new to assigned
  • Milestone changed from UNSCH to 26rc

comment:2 Changed 3 years ago by markus

  • Milestone changed from 26rc to 27rc

comment:3 Changed 3 years ago by markus

  • Phase set to rc
  • Milestone changed from 27rc to 27

comment:5 Changed 3 years ago by markus

  • Status changed from assigned to reviewing
  • Review set to mark

Scripts can start on any two-byte boundary. High-frequency scripts use whole lead bytes, for fast lead byte permutation. ICU will support split lead bytes via a list of primary-weight ranges.

comment:6 Changed 3 years ago by markus

Note: Before Unicode 5.2, FractionalUCA.txt always used a whole primary lead byte per script. If script reordering had been specified at that time, it would have naturally reordered single scripts rather than groups.

comment:7 Changed 3 years ago by markus

Further notes on script allocation and implementation notes see IcuBug:11449.

comment:8 Changed 3 years ago by mark

  • Status changed from reviewing to closed
  • Resolution set to fixed
View

Add a comment

Modify Ticket

Action
as closed
Next status will be 'new'
Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.