From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Fri Mar 23 2007 - 17:05:40 CST
John Knightley <vunzndi@vfemail.net> wrote on Wednesday, March 21, 2007
3:29 PM
> Quoting Eric Muller <emuller@adobe.com>:
>> Not "normalization" proper, but rather "removal of default ignorable".
>> That second operation is vastly more unlikely than normalization. For
>> example, the W3C recommends the (early) normalization of XML documents
>> but they certainly don't advocate that default ignorable be removed.
> Since these are only recommendations this could happen in either case,
> and still be 100% unicode compliant. Which means on still can not have
> ones cake and eat it.
Blanket removal of default ignorable characters is a transformation of the
text, as it would strip out CGJ, ZWJ, ZWNJ, WJ, ZWSP and bidi controls, and
is 'Unicode compliant' in the same way as case folding can be. (Normalising
to NFD and then replacing every base character by 'x' and removing the rest
is also a Unicode-compliant process.) Being 'default ignorable' means that
in rendering the character can be ignored if the application does not
support it; it does not mean that it can be dropped when text is
transformed. It would be wrong for an application implicitly claiming not
to change the text to strip variation selectors out of ideographic selectors
without any by your leave. (By contrast, normalisation does not change the
text for Unicode-compliant processes - some round-tripping is inherently not
Unicode-compliant.)
On the other hand, it might not be unreasonable for an application to
compress such text by transferring the information in the variation
selectors to a 'higher level protocol'. For a file consisting mostly of CJK
text, appending U+E0100 to every unified ideograph would bloat the UTF-16
storage requirement from typically one code unit per character to typically
three code units per character! Doug Ewell's survey of Unicode compression
( http://www.unicode.org/notes/tn14/ ) rather suggests that many standard
compression techniques would not counteract such bloat effectively.
Richard.
This archive was generated by hypermail 2.1.5 : Fri Mar 23 2007 - 17:09:46 CST