Re: Size of Weights in Unicode Collation Algorithm

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sun, 17 Mar 2013 03:14:41 +0100

2013/3/16 Richard Wordingham <richard.wordingham_at_ntlworld.com>:
> If you start with my start = low, continuation = high scheme, you can
> convert it in an order-preserving manner to a no-prefix scheme by
> the following simple transform:
>
> If a simple weight precedes a continuation weight, add 0·8 ('·' is
> serving as the hexadecimal point) to it.
>
> Thus 21, 21 81 and 21 81 81 become 21, 21·8 81 and 21·8 81·8 81. If
> you don't like semi-integral values, discard the high bit and double,
> yielding 42, 43 02 and 43 03 02. You may recognise your non-final v.
> final scheme! (I replaced '80' by '81' to avoid confusing zeroes.)
>
> If you're still not convinced, please show me what goes wrong.

OK so you're adding 1 extra bit, i.e. you're changing your
start/continuation encocing scheme to a different encoding that
suppresses codes that are prefixes of another.

Exactyly what I described (you use variable number of bits), except
that your scheme is highly suboptimal, compared to an Huffman coding
or the optimal artithmetic coding (for which you can generate
statistics of frequencies (precomputed from some initial text corpus
in the language for which the collation will be used, and filled with
extra entries with low frequency representing the weights of all other
collation elements not found in your corpus) in order to generate the
discrionary.that will be mapping ordered weights into bit sequences
with the same relative order.

With this dictionnary (statically generated for the collator, so it is
not a problem when computing individual collation elements or
comparing texts) you can then directly convert all weights into the
codes without having to check for possible codes which may be prefixes
for another.

Note that Huffmann and arithmetic coding generate an dictionnary of
patterns with variable number of bits. If you're not genreating
collation keys, but just comparing two texts, the bit stuffing woul
not be needed and you'll just compare the values of codes coming from
the dictionnary : these codes should then be stored in the dictionnary
in a left-aligned integer, padded with bits set to zero to fill the
code unit, the dictionnary will indicate how many bits are needed when
generating collation keys but this info can be safely ignored when
just comparing two plain texts.

I've used this technic to generate extremely small collation keys
(which are the same order of storage size as the original plain-text
(UTF-8 encoded) but in fact almost always shorter than plain-text !

If you use this encoding with the collation level set to its maximum
level (which preserves all code point distinctions), this encoding
becomes another encoding of plain-text, and is fully reversivle to
UTF-8, except that it is binary sortable. This means that the encoding
can also be used to store the plain-text itself, without needing to
duplicate it in the store for also storing the collation keys
(provided that the static dictionnary is preserved) : storing the
collation keys is enough. The scheme becomes a valid UTF; but it is
more compact than UTF-8 and binary-sortable.
Received on Sat Mar 16 2013 - 21:20:43 CDT

This archive was generated by hypermail 2.2.0 : Sat Mar 16 2013 - 21:20:45 CDT