Re: Comment on PRI 98: IVD Adobe-Japan1 (pt.2)

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Fri Mar 23 2007 - 17:05:40 CST

  • Next message: vunzndi@vfemail.net: "Re: Comment on PRI 98: IVD Adobe-Japan1 (pt.2)"

    John Knightley <vunzndi@vfemail.net> wrote on Wednesday, March 21, 2007
    3:29 PM

    > Quoting Eric Muller <emuller@adobe.com>:

    >> Not "normalization" proper, but rather "removal of default ignorable".
    >> That second operation is vastly more unlikely than normalization. For
    >> example, the W3C recommends the (early) normalization of XML documents
    >> but they certainly don't advocate that default ignorable be removed.

    > Since these are only recommendations this could happen in either case,
    > and still be 100% unicode compliant. Which means on still can not have
    > ones cake and eat it.

    Blanket removal of default ignorable characters is a transformation of the
    text, as it would strip out CGJ, ZWJ, ZWNJ, WJ, ZWSP and bidi controls, and
    is 'Unicode compliant' in the same way as case folding can be. (Normalising
    to NFD and then replacing every base character by 'x' and removing the rest
    is also a Unicode-compliant process.) Being 'default ignorable' means that
    in rendering the character can be ignored if the application does not
    support it; it does not mean that it can be dropped when text is
    transformed. It would be wrong for an application implicitly claiming not
    to change the text to strip variation selectors out of ideographic selectors
    without any by your leave. (By contrast, normalisation does not change the
    text for Unicode-compliant processes - some round-tripping is inherently not
    Unicode-compliant.)

    On the other hand, it might not be unreasonable for an application to
    compress such text by transferring the information in the variation
    selectors to a 'higher level protocol'. For a file consisting mostly of CJK
    text, appending U+E0100 to every unified ideograph would bloat the UTF-16
    storage requirement from typically one code unit per character to typically
    three code units per character! Doug Ewell's survey of Unicode compression
    ( http://www.unicode.org/notes/tn14/ ) rather suggests that many standard
    compression techniques would not counteract such bloat effectively.

    Richard.



    This archive was generated by hypermail 2.1.5 : Fri Mar 23 2007 - 17:09:46 CST