Re: Size of Weights in Unicode Collation Algorithm

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Sat, 16 Mar 2013 09:29:07 -0700

On Sat, Mar 16, 2013 at 4:09 AM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> Please give an example of how the low/high split would fail. With the
> primary collation weights 20, 21, 21 80 and 22 I get the following
> primary collation weight sequences for one and two collating elements,
> marking boundaries of collating elements with commas:
>

The problem is that if you have 21 and 21 80, and another primary starts
with 80, you can't distinguish the sequence 21 | 80 from the one weight 21
80.

For most uses, in particular, those in DUCET, the trailing units must
> not be mistakable for variable primary collation elements.

You have to know which one is a trailing unit. I suppose you could do it
via ranges like in UTF-8, but that means you can use fewer byte values per
position and thus yields longer weights, and longer sort keys. It is more
efficient to get leading vs. trailing information from the data structure.

markus
Received on Sat Mar 16 2013 - 11:35:07 CDT

This archive was generated by hypermail 2.2.0 : Sat Mar 16 2013 - 11:35:10 CDT