Re: UTS#10 (UCA) 7.1.3 Implicit Weights, Unassigned and Other Code Points

From: Mark Davis ☕ (mark@macchiato.com)
Date: Sun Aug 01 2010 - 18:49:26 CDT

  • Next message: CE Whitehead: "RE: Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)"

    The issue of noncharacter code points is called out on
    http://www.unicode.org/reports/tr10/proposed.html#Unassigned_And_Other. If
    you have something to say on that subject, please submit a response on
    http://www.unicode.org/reporting.html

    <http://www.unicode.org/reporting.html>As to most of the rest of what you
    say, I don't see where it is coming from: it does not seem to be in
    alignment with the specification on implicit weights:
    http://www.unicode.org/reports/tr10/proposed.html#Implicit_Weights . For
    example, PUA characters *ARE* covered by UCA, as described in those
    sections. (To clarify, it might be helpful would be to move the last
    sentence of 7.1.2 into the first part of 7.1.3.)

    You might think that it would be better not to ignore ill-formed sequences
    (when they are not treated as errors, 7.1.1). If so, you should also submit
    a response on that topic.

    Please make all of your responses on http://www.unicode.org/reporting.html,
    short, concise, and on a single topic. Otherwise the UTC is not likely to
    take the time to decipher them.

    Mark

    *— Il meglio è l’inimico del bene —*

    On Sun, Aug 1, 2010 at 11:54, Philippe Verdy <verdy_p@wanadoo.fr> wrote:

    > Implicit weights for unassigned code points and other characters that
    > are NOT ill-formed are suboptimal, as noted in the proposed update.
    >
    > It should take into account their existing default properties, notably :
    >
    > > 1. Code units exceeding the valid range for code points with scalar
    > values (such as 0x110000 or 0xFFFFFFFF, when handling invalid UTF-32), if
    > they are not handled as errors for collation.
    > > 2. Code points with valid scalar values that are permanently assigned to
    > non-characters, if they are not handled as errors for collation:
    > > > 2.1. surrogates; or:
    > > > 2.2. others scalar values (such as U+FFFF).
    > > 3. Their presence in a block or plane assigned to Sinographs ("Unified
    > Ideographs"), either:
    > > > 3.1. Unassigned code points that are in allocated "Core" Sinographs
    > blocks (currently, "CJK compatibility" or "CJK unified", all in the BMP),
    > or:
    > > > 3.2. Unassigned code points that are in allocated "Other" Sinographs
    > blocks or planes (currently, "CJK Unified Ideographs Extension A" in the
    > BMP, and all reserved code points in the SIP).
    > > 4. Unassigned code points that are assigned to "Special" character
    > (notably in the supplementary special plane (SSP) starting at U+E0000).
    > > 5. Unassigned code points that are in allocated blocks for
    > non-Sinographs, non-Special, and with default RTL directionality (in the BMP
    > or SMP).
    > > 6. Unassigned code points that are in allocated blocks for
    > non-Sinographs, non-Special, and with default LTR directionality (in the BMP
    > or SMP).
    >
    > Currently, if the Unicode scalar value (or invalid code unit) is NNNN
    > (unsigned 32-bit value), then they are treated as expansions to
    > ignorable collation elements:
    > [.0000.0000.0000.NNNN]
    >
    > This means that they will always be ignored, except at the final
    > implicit level comparing scalar values in binary order. However this
    > is not reasonnable for many of them.
    >
    > Note that the weight for last implicit binary level is included in the
    > DUCET, but it exceeds the 16-bit capacity for weights, and this level
    > is probably split in several successive collation elements, using a
    > mechanism similar to surrogates (except that surrogates don't have the
    > correct binary order); as this has to take into account the
    > possibility of code units exceeding the capacity of valid scalar
    > values but accepting any unsigned 32-bit code unit, this could be
    > simply:
    > > if (NNNN in 0x0000..0xFFFF), then only one collation element is needed:
    > [.0000.0000.0000.NNNN]; otherwise
    > > if (NNNN >= 0xFFFF), use three collation elements:
    > [.0000.0000.0000.FFFF][.0000. 0000.0000.HHHH][.0000.0000.0000.LLLL], where
    > HHHH=(NNNN>>16) and LLLL=(NNNN&0xFFFF).
    >
    > Note that for this fourth (last implicit) collation level, run-length
    > compression does not apply, as it is present as well for all valid
    > encoded character, contractions or expansions, and will be used as
    > well for all the cases above (treating them as ignorables on the first
    > 3 levels).
    >
    > If we want to be smarter, we should not treat ALL the cases above as
    > fully ignorable at the first three levels, and should get primary
    > weights notably:
    >
    > > 3.1. Unassigned code points that are in allocated "Core" Sinographs
    > blocks (currently, "CJK compatibility" or "CJK unified", all in the BMP).
    > > > When they will be allocated, they will sort using the implicit weights,
    > my opinion is that they should use the mechanism exposed using:
    > > > [.AAAA.0020.0002.][.BBBB.0000. 0000.]
    > > > where AAAA=0xFB40+(NNNN>>15) and BBBB=0x8000+(NNNN&0x7FFF);
    > > > There's no reason to maintain their unstable collation elements
    > depending on Unicode versions, when we can already predict what will be
    > their collation elements.
    >
    > > 3.2. Unassigned code points that are in allocated "Other" Sinographs
    > blocks or planes (currently, "CJK Unified Ideographs Extension A" in the
    > BMP, and all reserved code points in the SIP).
    > > > When they will be allocated, they will sort using the implicit weights,
    > my opinion is that they should use the mechanism exposed using:
    > > > [.AAAA.0020.0002.][.BBBB.0000. 0000.]
    > > > where AAAA=0xFB80+(NNNN>>15) and BBBB=0x8000+(NNNN&0x7FFF);
    > > > There's no reason to maintain their unstable collation elements
    > depending on Unicode versions, when we can already predict what will be
    > their collation elements.
    >
    > > 5. Unassigned code points that are in allocated blocks for
    > non-Sinographs, non-Special, and with default RTL directionality (in the BMP
    > or SMP).
    > > 6. Unassigned code points that are in allocated blocks for
    > non-Sinographs, non-Special, and with default RTL directionality (in the BMP
    > or SMP).
    > >> When they will be allocated, most of them will NOT be fully ignorable,
    > and its probably best to give them appropriate implicit primary weights so
    > that they with primary weights lower than than those used for characters in
    > the same block, but still higher that encoded characters from other blocks
    > have that lower primary weights than assigned characters in the block. Gaps
    > should be provided in the DUCET at the begining of ranges for these blocks
    > so that they can all fit in them. The benefit being also that other blocks
    > after them will keep their collation elements stable and won't be affected
    > by the new allocations in one block.
    >
    > The other categories above (for code units exceeding the range of
    > valid scalar values if they are not treated as errors, or for code
    > points with valid scalar values and assigned to non-characters if they
    > are not treated as errors, or for code points with valid scalar values
    > assigned or reserved in the special supplementary plane) can be kept
    > as fully ignorable, using null weights on the (fully ignorable) first
    > three levels, and the implicit (last level) weights for scalar value
    > or code unit binary weights.
    >
    > Note that valid PUAs are not concerned here: they have not in the
    > DUCET, even if they are subject to possible private tailorings to make
    > them fully ignorable or use any other weights (including with
    > contractions or expansions). Without such known private convention,
    > they should still be treated as fully ignorable (using the implivit
    > weights for the last level sorting by scalar values). But the UCA
    > algorithm completely forgets to speak about them, so it treats them
    > with BASE=0xFBC0, giving non-zero primary weights and making them sort
    > after all Sinographs and before 'Trailing weights'...
    >
    > Philippe.
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Sun Aug 01 2010 - 18:56:24 CDT