From: Doug Ewell (dewell@adelphia.net)
Date: Mon Mar 26 2007 - 07:55:50 CST
Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:
> It would be wrong for an application implicitly claiming not to change
> the text to strip variation selectors out of ideographic selectors
> without any by your leave. (By contrast, normalisation does not
> change the text for Unicode-compliant processes - some round-tripping
> is inherently not Unicode-compliant.)
This doesn't sound right to me. Normalization is all about changing one
character or sequence to another. A Unicode-compliant process is not
supposed to assume that two canonical-equivalent sequences will be
treated differently, but that is not the same as saying the text has not
changed -- especially if compatibility normalization (NFKC or NFKD) is
involved.
> On the other hand, it might not be unreasonable for an application to
> compress such text by transferring the information in the variation
> selectors to a 'higher level protocol'. For a file consisting mostly
> of CJK text, appending U+E0100 to every unified ideograph would bloat
> the UTF-16 storage requirement from typically one code unit per
> character to typically three code units per character! Doug Ewell's
> survey of Unicode compression ( http://www.unicode.org/notes/tn14/ )
> rather suggests that many standard compression techniques would not
> counteract such bloat effectively.
This is true for compression techniques that operate on one code point
at a time, such as SCSU and BOCU and Huffman coding. It may not be true
for dictionary-based techniques like LZ. The question of how desirable
it is to append a variation selector to every character in the first
place is perhaps more generally interesting.
-- Doug Ewell * Fullerton, California, USA * RFC 4645 * UTN #14 http://users.adelphia.net/~dewell/ http://www1.ietf.org/html.charters/ltru-charter.html http://www.alvestrand.no/mailman/listinfo/ietf-languages
This archive was generated by hypermail 2.1.5 : Mon Mar 26 2007 - 07:58:12 CST