Re: PRI #203: Proposed Update UTS #10: Unicode Collation Algorithm

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Wed, 7 Sep 2011 05:14:51 +0200

I've just tried this idea, and this is working magically as I wanted.
It generates extremely compact collation keys, notably if I no longer
restrict collation levels so that they use a limited range: instead I
explicitly use a level separator, to which I assign the final range of
rationals [7/16 .. 16/16], that I can represent as 4 bits (1111).

Then each level can use the collation weights in the half-open range
[0/16 .. 7/16[, which can be splitted in as many elements as needed. I
have also used the [0/16 .. 1/16[ half-open range for the weight
representing ignorable elements: this will use 4 bits(0000) in the
collation keys. This leaves the half-open range [1/16 .. 7/16[ for
representing all non-ignorable weights for this level.

For the primary level, I can predefined fixed ranges of rationals for
representing variable elements, whitespaces, symbols, punctuations,
numbers, and then letters for each script (also with fixed ranges for
each script).

Loading tailorings becomes extremely easy when using rationals, the
algorithms become MUCH simpler.

And finally, once you have assigned rationals to all collation
elements for a specific tailoring (including for the DUCET itself), it
is an extremely easy transform to infer integer weights. This means
that you no longer need to use gaps in the sequences of default
weights, or to given them special meanings. But my opinion is that
this representation using integers is much more difficult to compress
in collation keys.

In fact, in a specific locale, where collation elements have some good
estimation of the frequency of use of collation elements, it is easy
to assign them an associated fixed rational, and to place all other
collation elements by spliting the intervals as needed to fit. And you
get extremely compact collation keys represented eigher by Huffman
encoding (or by arithmetic codding if you have a licence for it from
IBM, but this does not reduces much the length of collation keys).

Philippe.

2011/9/7 Philippe Verdy <verdy_p_at_wanadoo.fr>:
> 2011/8/31  <announcements_at_unicode.org>:
>> PRI #203    Proposed Update for UTS #10: Unicode Collation Algorithm
>> http://www.unicode.org/review/pri203/
>
> I just have had an idea, which may be interesting. Currently, the
> collation weights defined in the DUCET for each level are maintained
> as a list of integers only.
>
> The main problem is that at each revision of the UCS (when characters
> get added), or for the need to correct the relative weights of
> characters, the collation weights need to be renumbered (notably for
> the primary level, but in fact as well for the subsequent levels, if
> there's a need to make more numeric space available for higher
> collation levels that normally have lower collation weight values).
>
> Why not using a representation not based on integers, but instead on
> *rationals* ? We can indefinitely split a rational interval without
> having to renumber all other nearby collation elements at the same or
> lower collation levels.
>
> And when designing a tailoring, instead of complex renumbering
> strategies, we can as well subdivide intervals into as many  rational
> subintervals as we want (just choose the denominator of the inserted
> rationals as the product of the denominator D for the existing level
> you want to split, by the number N of (element to insert, plus one),
> and then you can insert between the rational weights (x)/(D) and
> (X+1)/(D), whose value does not have to be changed, the (N-1) new
> distinct rational collation weights (XN+1)/(ND) to (XN+N-1)/(ND).
>
> Let's just remember that the absolute value of collation weights does
> not matter, only their relative order in the sequence for any specific
> collation level (as well, if collation weights are allocated so that
> you don't need to insert a level separator in the generated collation
> keys, the set of collation weights for level N+1 just has to be fully
> lower than the set of collation weights for level N. And rational
> numbers perfectly fit this need.
>
> Once you have computed all rationals for all levels needed for a
> specific tailoring (or even for the DUCET) a very basic integer
> counting of the ordered sequence of rationals can convert them into
> compact integers that are then easy to insert in collation keys.
>
> Using rationals would also have some benefits: it allows compression
> according to statistics of use: you can start by definining an
> interval of integers (i.e. rationals with denominator 1) from 0 to
> 255: you'll split the sequence of all collation weights according to
> their relative frequency. For more collation weights between two
> elements, count the number you need to insert, compute the new
> denominator, and insert them there.
>
> More formally, you can just start by the rational inclusive interval
> [0/1,1/1] which contains the full sequence of all possible collation
> weights. The position at which you'll split this interval is
> arbitrary, so you can use technics similar to the Arithmetic coding
> (used for data compression), or to the Huffmann coding, to split it
> conveniently with a varying number of bits for each subdivision, based
> on the estimated statistics of use.
>
> With this concept, you'll be able to build extremely compact collation
> keys that will use variable numbers of bits): just append those
> sequences of bits, level by level for all collation elements, then add
> arbitrary padding null bits to fill the collation key into an integer
> number of bytes if you want.
>
> In addition, you'll be able to define *very stable* collation weights
> in the DUCET, now represented as rationals instead of just integers.
>
> Notably, you'll be able to assign *permanent* collation weights for
> the pseudo-collation elements that represent the script separation.
>
> What do you think of this marvelous idea ?
>
> --Philippe.
>
Received on Tue Sep 06 2011 - 22:17:59 CDT

This archive was generated by hypermail 2.2.0 : Tue Sep 06 2011 - 22:17:59 CDT