RE: Size of Weights in Unicode Collation Algorithm

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Thu, 14 Mar 2013 21:01:10 +0000

Richard Wordingham wrote:
 
> Actually, there is a subtle and nasty difference, but probably one that
> will very rarely strike practical use. It's most obvious manifestation
> is in the application of the UCA parametric tailoring
> topVariable="u2FD5". U+2FD5 KANGXI RADICAL FLUTE is the last symbol in
> UnicodeData.txt by collating order and has a compatibility
> decomposition to U+9FA0 and therefore the same primary weights.

I really have no idea what you are on about here.

The parametric tailoring in question is "variableTop", not "topVariable", and it would be
expressed "u00u2FD5", not "u2FD5". But U+2FD5 has *never* been a variable in
the UCA DUCET tables, anyway. Furthermore, since UCA 6.2.0 was published,
the variableTop parameter documentation was moved into LDML, because it
is only used in CLDR, and isn't a part of UCA per se at all.

> Although I can't find a clear official definition of the semantics of
> 'topVariable',

"variableTop" is now defined in the LDML spec. See the proposed update for UTS #37.
 
> I do remember being told that it simply uses the first
> positive primary in the collation key as the maximum variable weight.

No, it isn't.

The default value derived for variableTop from DUCET would be "u01uD371", because
U+1D371 COUNTING ROD TENS DIGIT NINE has the highest variable primary
weight (*15A7) in DUCET for UCA 6.2.0. (The first *non*-variable primary weight
is 15A8 for U+02D0 MODIFIER LETTER TRIANGULAR COLON.)

For the CLDR root collation, that value should be reset to "u01u11C7" for U+111C7
SHARADA ABBREVIATION SIGN, which has the highest variable primary weight for
a *punctuation* mark (*040E). The first variable primary weight for *symbols*
is *040F for U+0060 GRAVE ACCENT.

Meaningful tailorings for variableTop might move it somewhat higher to treat
more symbols with variable weights like punctuation. But it wouldn't make any
sense at all to try to set it to some value for a character with a non-variable
primary weight. An implementation that supports a variableTop parametric
tailoring would, I presume, either raise an exception in trying to process such
an attempt, or would simply default back to the character with the highest actual variable
primary weight for variableTop.

> Now in allkeys.txt, U+2FD5 expands to two collation elements. However,
> in FractionalUCA.txt, which specifies 32-bit (fractional) weights, it
> has a single collation element. Consequently, the effect of this
> tailoring will be different depending on how the collation elements are
> expressed!

It is a meaningless tailoring in the first place.

--Ken

> Richard.
Received on Thu Mar 14 2013 - 16:05:43 CDT

This archive was generated by hypermail 2.2.0 : Thu Mar 14 2013 - 16:05:44 CDT