From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Fri Mar 23 2007 - 17:05:40 CST
John Knightley  <vunzndi@vfemail.net> wrote on Wednesday, March 21, 2007 
3:29 PM
> Quoting Eric Muller <emuller@adobe.com>:
>> Not "normalization" proper, but rather "removal of default ignorable".
>> That second operation is vastly more unlikely than normalization. For
>> example, the W3C recommends the (early) normalization of XML documents
>> but they certainly don't advocate that default ignorable be removed.
> Since these are only recommendations this could happen in either case, 
> and still be 100% unicode compliant. Which means on still can not have 
> ones cake and eat it.
Blanket removal of default ignorable characters is a transformation of the 
text, as it would strip out CGJ, ZWJ, ZWNJ, WJ, ZWSP and bidi controls, and 
is 'Unicode compliant' in the same way as case folding can be.  (Normalising 
to NFD and then replacing every base character by 'x' and removing the rest 
is also a Unicode-compliant process.)  Being 'default ignorable' means that 
in rendering the character can be ignored if the application does not 
support it; it does not mean that it can be dropped when text is 
transformed.  It would be wrong for an application implicitly claiming not 
to change the text to strip variation selectors out of ideographic selectors 
without any by your leave.  (By contrast, normalisation does not change the 
text for Unicode-compliant processes - some round-tripping is inherently not 
Unicode-compliant.)
On the other hand, it might not be unreasonable for an application to 
compress such text by transferring the information in the variation 
selectors to a 'higher level protocol'.  For a file consisting mostly of CJK 
text, appending U+E0100 to every unified ideograph would bloat the UTF-16 
storage requirement from typically one code unit per character to typically 
three code units per character!  Doug Ewell's survey of Unicode compression 
( http://www.unicode.org/notes/tn14/ ) rather suggests that many standard 
compression techniques would not counteract such bloat effectively.
Richard.
This archive was generated by hypermail 2.1.5 : Fri Mar 23 2007 - 17:09:46 CST