Re: Size of Weights in Unicode Collation Algorithm

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Thu, 14 Mar 2013 23:09:11 +0000

On Thu, 14 Mar 2013 14:49:18 -0700
Markus Scherer <markus.icu_at_gmail.com> wrote:

> However, it does not make a lot of sense to set the variable top to
> something above the currency symbols range -- it's basically an
> option for an "ignore punctuation" mode, and you wouldn't want to
> ignore nearly every assigned character in Unicode.

There are a lot of characters in the SIP! While variableTop="u2FD5"
would probably be a mistake or a mischievous experiment, some might be
tempted to blot out all non-Han characters! I don't think there is a
real problem yet, but it is an annoying fact that there can be a
difference depending on whether one uses 16- or 32-bit weights. The
good news is that there is a solution, namely to introduce fractional
weights to the allkeys format under the headings of 'large weights' and
'escape hatch'.

The issue first occurred to me when I realised a minor threat to the
large weight scheme, for which there's a notional 50,000-character
example in UTS#10. This is the rising number of variable codepoints
(stretching terminology) encoded in Unicode. The highest variable
weight in DUCET has risen even faster than the number of variable
codepoints. The implicit weights steer well clear of this problem by
only taking blocks of 32,768 characters for each initial primary
weight.

> However, we have agreed to replace the
> hard-to-use variableTop attribute with an easy-to-use maxVariable
> attribute, so this whole discussion will become moot at that point:
> http://unicode.org/cldr/trac/ticket/5016

Actually, you've only proposed deprecating it.

Richard.
Received on Thu Mar 14 2013 - 18:15:03 CDT

This archive was generated by hypermail 2.2.0 : Thu Mar 14 2013 - 18:15:06 CDT