From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Mon Mar 26 2007 - 17:48:30 CST
Doug Ewell wrote on Monday, March 26, 2007 2:55 PM
> Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:
>> It would be wrong for an application implicitly claiming not to change
>> the text to strip variation selectors out of ideographic selectors
>> without any by your leave. (By contrast, normalisation does not change
>> the text for Unicode-compliant processes - some round-tripping is
>> inherently not Unicode-compliant.)
> This doesn't sound right to me. Normalization is all about changing one
> character or sequence to another.
It boils down to the interpretation of conformance clauses C6 and C7:
'C6: A process shall not assume that the interpretations of two
canoncial-equivalent character sequences are distinct.'
'C7: When a process purports not to modify the interpretation of a valid
coded character sequence, it shall make no change to that character sequence
other than the possible replacement of character sequences by their
canonical-equivalent sequences or the deletion of noncharacter code points.'
There was an inconclusive discussion about it in late 2003, referred to in
UTN 14, back when the clauses were C9 and C10, on the topic of whether
compressing text by converting it to NFC constituted a change to the text.
A significant argument was that Unicode-encoded text would often be used by
processes that were not 'Unicode-compliant' - more precisely C6-compliant.
(And Unicode-compliant default upper-casing - Clause C20 - is not quite
compliant with Clause C6, though the default upper-casing seems to be wrong
anyway for all the cases of discrepancy I can assign a plausible meaning
to.)
> -- especially if compatibility normalization (NFKC or NFKD) is involved.
A red herring. The explanation of C7 states, 'Replacement of a character
sequence by a compatibility-equivalent sequence _does_ modify the
interpretation of the text.'
A key point is that C6-compliant processes cannot care whether the data has
been transformed in a manner preserving canonical equivalence with the
original. Round-trip conversion is not a C6-compliant process if it relies
on compatibility characters with canonical decompositions - so nor is a
renderer that respects the differences between CJK compatibility ideographs
and their singleton decompositions. CJK compatibility ideographs serve no
useful purpose if they are only interpeted by Unicode-compliant processes!
This immediately and unfortunately implies that if:
1) Round-trip conversion from a 'legacy' character set required CJK
compatibility ideographs before the advent of IVS;
2) One does not use mark-up to preserve the distinctions lost in normalised
Unicode; and
3) One intends to display the text using Unicode-compliant processes
Then IVS is the only way to preserve the graphic distinctions.
>> For a file consisting mostly of CJK text, appending U+E0100 to every
>> unified ideograph would bloat the UTF-16 storage requirement from
>> typically one code unit per character to typically three code units per
>> character! Doug Ewell's survey of Unicode compression (
>> http://www.unicode.org/notes/tn14/ ) rather suggests that many standard
>> compression techniques would not counteract such bloat effectively.
> This is true for compression techniques that operate on one code point at
> a time, such as SCSU and BOCU and Huffman coding. It may not be true for
> dictionary-based techniques like LZ.
LZ77 performs about 20% better on SCSU-compressed text from small alphabets
than on the text in UTF-16. I will agree that compressors using the
Burrows-Wheeler algorithm will probably counteract the bloat very
effectively.
> The question of how desirable it is to append a variation selector to
> every character in the first place is perhaps more generally interesting.
Which is why I chose the evaluative term 'bloat'.
Richard.
This archive was generated by hypermail 2.1.5 : Mon Mar 26 2007 - 17:50:40 CST