From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Aug 02 2010 - 18:43:58 CDT
Philippe Verdy said:
> Implicit weights for unassigned code points and other characters that
> are NOT ill-formed are suboptimal, as noted in the proposed update.
To follow up on Mark's response on this thread...
>
> It should take into account their existing default properties, notably :
[ long list snipped: includes surrogates, noncharacters, and
unassigned characters in various ranges ]
> Currently, if the Unicode scalar value (or invalid code unit) is NNNN
> (unsigned 32-bit value), then they are treated as expansions to
> ignorable collation elements:
> [.0000.0000.0000.NNNN]
That statement is incorrect. The UCA currently specifies that
ill-formed code unit sequences and *noncharacters* are mapped
to [.0000.0000.0000.], but unassigned code points are not.
> If we want to be smarter, we should not treat ALL the cases above as
> fully ignorable at the first three levels, and should get primary
> weights notably:
Hmmm, if we want to be smarter, we should read what the actual
specification says.
> > 5. Unassigned code points that are in allocated blocks for
> > non-Sinographs, non-Special, and with default RTL directionality
> > (in the BMP or SMP).
> > 6. Unassigned code points that are in allocated blocks for
> > non-Sinographs, non-Special, and with default RTL directionality
> > (in the BMP or SMP).
> >> When they will be allocated, most of them will NOT be fully ignorable,
> >> and its probably best to give them appropriate implicit primary weights
They already are, but...
> so that they with primary weights lower than than those used for
> characters in the same block, but still higher that encoded characters
> from other blocks have that lower primary weights than assigned
> characters in the block. Gaps should be provided in the DUCET at
> the begining of ranges for these blocks so that they can all fit
> in them. The benefit being also that other blocks after them will
> keep their collation elements stable and won't be affected by the
> new allocations in one block.
That particular way of assigning implicit weights for unassigned
characters would be a complete mess to implement for the default
table.
A. It would substantially increase the size of the default table
for *all* users, because it would assign primary weights for
all unassigned code points inside blocks -- code points which
now simply get implicit weights assigned by rule.
B. The assumptions about better default behavior are erroneous,
because they presuppose things which are not necessarily true. In
particular, the main goal appears to be to assure well-behavedness
for future additions on a per-script basis, since primary weight
orders are relevant to scripts. However, several of the most important
scripts are now, for historical reasons, encoded in multiple
blocks. A rule which assigns default primary weights on a per
block basis for unassigned characters would serve no valid purpose
in such cases.
C. In addition to randomizing primary weight assignments for
scripts in the case of multiple-block scripts, such a rule would
also introduce *more* unpredictability in cases of the punctuation
and symbols which are scattered around among many, many blocks,
as well.
In general this proposal fails to understand that the default
weights for DUCET (as expressed in allkeys.txt) has absolutely
nothing whatsover to do with block identities or block
ranges in the standard. The weighting algorithm knows absolutely
nothing about block values.
> The other categories above (for code units exceeding the range of
> valid scalar values if they are not treated as errors, or for code
> points with valid scalar values and assigned to non-characters if they
> are not treated as errors, or for code points with valid scalar values
> assigned or reserved in the special supplementary plane) can be kept
> as fully ignorable, using null weights on the (fully ignorable) first
> three levels, and the implicit (last level) weights for scalar value
> or code unit binary weights.
Except that such treatment is not optimal for the noncharacters.
As noted in the review note in the proposed update for UTS #10,
noncharacters should probably be given implicit weights, rather
than being treated as ignorables by default. That is a proposed
change to the specification.
> Note that valid PUAs are not concerned here: they have not in the
> DUCET, even if they are subject to possible private tailorings to make
> them fully ignorable or use any other weights (including with
> contractions or expansions). Without such known private convention,
> they should still be treated as fully ignorable (using the implivit
> weights for the last level sorting by scalar values).
No, they should not.
> But the UCA
> algorithm completely forgets to speak about them, so it treats them
> with BASE=0xFBC0, giving non-zero primary weights and making them sort
> after all Sinographs and before 'Trailing weights'...
Correct. But is by design -- not because the algorithm completely
forgets to speak about them.
Although I agree that it would be a good idea to call the PUA
out explicitly as subject to the implicit weighting, so that
people are not unclear about this.
--Ken
This archive was generated by hypermail 2.1.5 : Mon Aug 02 2010 - 18:47:23 CDT