From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Aug 01 2010 - 13:54:49 CDT
Implicit weights for unassigned code points and other characters that
are NOT ill-formed are suboptimal, as noted in the proposed update.
It should take into account their existing default properties, notably :
> 1. Code units exceeding the valid range for code points with scalar values (such as 0x110000 or 0xFFFFFFFF, when handling invalid UTF-32), if they are not handled as errors for collation.
> 2. Code points with valid scalar values that are permanently assigned to non-characters, if they are not handled as errors for collation:
> > 2.1. surrogates; or:
> > 2.2. others scalar values (such as U+FFFF).
> 3. Their presence in a block or plane assigned to Sinographs ("Unified Ideographs"), either:
> > 3.1. Unassigned code points that are in allocated "Core" Sinographs blocks (currently, "CJK compatibility" or "CJK unified", all in the BMP), or:
> > 3.2. Unassigned code points that are in allocated "Other" Sinographs blocks or planes (currently, "CJK Unified Ideographs Extension A" in the BMP, and all reserved code points in the SIP).
> 4. Unassigned code points that are assigned to "Special" character (notably in the supplementary special plane (SSP) starting at U+E0000).
> 5. Unassigned code points that are in allocated blocks for non-Sinographs, non-Special, and with default RTL directionality (in the BMP or SMP).
> 6. Unassigned code points that are in allocated blocks for non-Sinographs, non-Special, and with default LTR directionality (in the BMP or SMP).
Currently, if the Unicode scalar value (or invalid code unit) is NNNN
(unsigned 32-bit value), then they are treated as expansions to
ignorable collation elements:
[.0000.0000.0000.NNNN]
This means that they will always be ignored, except at the final
implicit level comparing scalar values in binary order. However this
is not reasonnable for many of them.
Note that the weight for last implicit binary level is included in the
DUCET, but it exceeds the 16-bit capacity for weights, and this level
is probably split in several successive collation elements, using a
mechanism similar to surrogates (except that surrogates don't have the
correct binary order); as this has to take into account the
possibility of code units exceeding the capacity of valid scalar
values but accepting any unsigned 32-bit code unit, this could be
simply:
> if (NNNN in 0x0000..0xFFFF), then only one collation element is needed: [.0000.0000.0000.NNNN]; otherwise
> if (NNNN >= 0xFFFF), use three collation elements: [.0000.0000.0000.FFFF][.0000.0000.0000.HHHH][.0000.0000.0000.LLLL], where HHHH=(NNNN>>16) and LLLL=(NNNN&0xFFFF).
Note that for this fourth (last implicit) collation level, run-length
compression does not apply, as it is present as well for all valid
encoded character, contractions or expansions, and will be used as
well for all the cases above (treating them as ignorables on the first
3 levels).
If we want to be smarter, we should not treat ALL the cases above as
fully ignorable at the first three levels, and should get primary
weights notably:
> 3.1. Unassigned code points that are in allocated "Core" Sinographs blocks (currently, "CJK compatibility" or "CJK unified", all in the BMP).
> > When they will be allocated, they will sort using the implicit weights, my opinion is that they should use the mechanism exposed using:
> > [.AAAA.0020.0002.][.BBBB.0000.0000.]
> > where AAAA=0xFB40+(NNNN>>15) and BBBB=0x8000+(NNNN&0x7FFF);
> > There's no reason to maintain their unstable collation elements depending on Unicode versions, when we can already predict what will be their collation elements.
> 3.2. Unassigned code points that are in allocated "Other" Sinographs blocks or planes (currently, "CJK Unified Ideographs Extension A" in the BMP, and all reserved code points in the SIP).
> > When they will be allocated, they will sort using the implicit weights, my opinion is that they should use the mechanism exposed using:
> > [.AAAA.0020.0002.][.BBBB.0000.0000.]
> > where AAAA=0xFB80+(NNNN>>15) and BBBB=0x8000+(NNNN&0x7FFF);
> > There's no reason to maintain their unstable collation elements depending on Unicode versions, when we can already predict what will be their collation elements.
> 5. Unassigned code points that are in allocated blocks for non-Sinographs, non-Special, and with default RTL directionality (in the BMP or SMP).
> 6. Unassigned code points that are in allocated blocks for non-Sinographs, non-Special, and with default RTL directionality (in the BMP or SMP).
>> When they will be allocated, most of them will NOT be fully ignorable, and its probably best to give them appropriate implicit primary weights so that they with primary weights lower than than those used for characters in the same block, but still higher that encoded characters from other blocks have that lower primary weights than assigned characters in the block. Gaps should be provided in the DUCET at the begining of ranges for these blocks so that they can all fit in them. The benefit being also that other blocks after them will keep their collation elements stable and won't be affected by the new allocations in one block.
The other categories above (for code units exceeding the range of
valid scalar values if they are not treated as errors, or for code
points with valid scalar values and assigned to non-characters if they
are not treated as errors, or for code points with valid scalar values
assigned or reserved in the special supplementary plane) can be kept
as fully ignorable, using null weights on the (fully ignorable) first
three levels, and the implicit (last level) weights for scalar value
or code unit binary weights.
Note that valid PUAs are not concerned here: they have not in the
DUCET, even if they are subject to possible private tailorings to make
them fully ignorable or use any other weights (including with
contractions or expansions). Without such known private convention,
they should still be treated as fully ignorable (using the implivit
weights for the last level sorting by scalar values). But the UCA
algorithm completely forgets to speak about them, so it treats them
with BASE=0xFBC0, giving non-zero primary weights and making them sort
after all Sinographs and before 'Trailing weights'...
Philippe.
This archive was generated by hypermail 2.1.5 : Sun Aug 01 2010 - 14:01:18 CDT