Re: UTF-7 - I'm not really smarter

From: Otto Stolz (Otto.Stolz@uni-konstanz.de)
Date: Tue Mar 28 2006 - 11:14:37 CST

  • Next message: Keutgen, Walter: "RE: UTF-7 - I'm not really smarter"

    Hello,

    Kornkreismuster@web.de schrieb:
    > Reading this [RFC 2152], I got the feeling it only encodes UTF-16 encoded Texts,
    > but I think that's not true.

    The description in RFC 2152, chapter 4, is probably misleading to
    the uninitiated. The key to understanding is that all UTFs are
    equivalent: they encode the same character set, viz. the whole Uni-
    code, and any string encoded in one UTF can be easily transformed
    into any other.

    So, all references in chapter 4 of RFC 2152 to UTF-16, and to
    16-bit code elements, are only meant to facilitate the description
    of the algorithm. You can describe the UTF-7 encoding algorithm
    (with a grain of salt) thusly:
    1. encode the source string in UTF-16 (regardless of its previous
        encoding);
    2. convert every three UTF-16 code units into 8 bytes using a modified
        base-64 algorithm (hence, every byte encodes 6 bit);
    3. enclose the result between a plus and a minus sign.
    Alternatively, runs of "harmless" characters may be encoded in ASCII,
    instead of applying steps 1..3, above.

    The latter alternative renders UTF-7 indeterminate: a character
    string may be encoded in several ways, cf. my example in
    <http://www.systems.uni-konstanz.de/Otto/Vortrag/Charset/Unicode-Grundlagen.html#UU-7>
    -- in contrast to UTF-8, UTF-16, and UTF-32. I guess, this is the
    main reason for not having UTF-7 in the Uncode standard.

    Regards,
       Otto Stolz



    This archive was generated by hypermail 2.1.5 : Tue Mar 28 2006 - 11:18:58 CST