Re: Size of Weights in Unicode Collation Algorithm

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Thu, 14 Mar 2013 23:58:14 +0000

On Thu, 14 Mar 2013 21:01:10 +0000
"Whistler, Ken" <ken.whistler_at_sap.com> wrote:

> Richard Wordingham wrote:

> > ...UCA parametric tailoring topVariable="u2FD5" ...
 
> The parametric tailoring in question is "variableTop", not
> "topVariable",

Sorry.

> and it would be expressed "u00u2FD5", not "u2FD5".

No - though your being confused merits feedback. The example given
specifies variableTop by means of a *string* - the 'string value' for
the variable top. The equivalent basic syntax for variableTop =
"uXXuYYYY" is:

 & \u00XX\uYYYY < [variable top]

which is clear enough if <U+00XX, U+YYYY> has a single collation
element.

> Furthermore, since UCA 6.2.0 was published, the variableTop parameter
> documentation was moved into LDML, because it is only used in CLDR,
> and isn't a part of UCA per se at all.

It is part of the 'standard UCA parametric tailoring'.

> > Although I can't find a clear official definition of the semantics
> > of 'topVariable',
 
> "variableTop" is now defined in the LDML spec. See the proposed
> update for UTS #37.

I take it you mean UTS#35.

> > I do remember being told that it simply uses the first
> > positive primary in the collation key as the maximum variable
> > weight.

> No, it isn't.

> The default value derived for variableTop from DUCET would be
> "u01uD371", because U+1D371 COUNTING ROD TENS DIGIT NINE has the
> highest variable primary weight (*15A7) in DUCET for UCA 6.2.0. (The
> first *non*-variable primary weight is 15A8 for U+02D0 MODIFIER
> LETTER TRIANGULAR COLON.)

I think you're not understanding me. Given a tailoring of DUCET or
CLDr root by variableTop="u2049" (U+2049 is EXCLAMATION QUESTION MARK)
or variableTop="u21u3F", the maximum variable primary weight would
be the primary weight of U+0021 EXCLAMATION MARK, and not the higher
primary weight of U+003F QUESTION MARK.

Let us look at the 'definition' in the proposed table in the Collation
page of UTS#35.

'Sets the string value for the variable top. All the code points with
primary strengths less than or equal to that string will be considered
variable, and thus affected by the alternate handling.'

With alternate="non-ignore":

U+3220 PARENTHESIZED IDEOGRAPH ONE <[1] U+1D371
<[1] U+2488 DIGIT ONE FULL STOP

Neither of U+3220 and U+2488 has a canonical decomposition, but with
DUCET the primary weights of both are changed by selecting variable
weighting, but are not reduced to zero.

I'm afraid the definition is not clear. One has to guess or ask.

> Meaningful tailorings for variableTop might move it somewhat higher
> to treat more symbols with variable weights like punctuation. But it
> wouldn't make any sense at all to try to set it to some value for a
> character with a non-variable primary weight.

Note that FractionalUCA does not define the concept of non-variable
primary weight - it assigns variable weights to fewer characters
than does DUCET, but may be overridden by tailoring. Ignoring special
collation elements, there is no expression of any limit on which
weights may be made variable.

> An implementation that
> supports a variableTop parametric tailoring would, I presume, either
> raise an exception in trying to process such an attempt, or would
> simply default back to the character with the highest actual variable
> primary weight for variableTop.

I see no basis for that. Markus reports that ICU can't accept certain
primary weights, but this depends on how many bytes are used to express
the weight, not how high it is.

> It is a meaningless tailoring in the first place.

Fortunately, it is not a *useful* tailoring.

Richard.
Received on Thu Mar 14 2013 - 19:02:29 CDT

This archive was generated by hypermail 2.2.0 : Thu Mar 14 2013 - 19:02:30 CDT