Re: [icu-support] complete binary/utf mapping

From: Doug Ewell (dewell@roadrunner.com)
Date: Sat Sep 08 2007 - 00:43:51 CDT

  • Next message: Mahesh T. Pai: "Re: [indic] Re: Feedback on PR-104"

    Mark E. Shoulson <mark at kli dot org> wrote:

    >> Somebody wanted to build that capability into an extension to UTF-8,
    >> so it could faithfully represent invalid garbage. We were never able
    >> to get him to work through what he wanted to do with the garbage thus
    >> preserved.
    >
    > Is there an obvious reason we couldn't just treat the garbage UTF-8 as
    > a string of 8-bit characters (might be part of a binary file or
    > something) and base-64 encode them? That'll definitely preserve
    > round-trippedness.

    Not quite; you would no longer be able to tell the garbage UTF-8 from
    the base-64 characters used to encode it.

    Consider the following sequences of bytes (with annotations in
    parentheses):

    41 41 41
       (three valid UTF-8 characters)

    C3 80 C3 81 C3 82
       (three more valid UTF-8 characters)

    C0 C1 C2
       (above three in ISO 8859-1; invalid UTF-8)

    41 4D 41 41 77 51 44 43
       (above three values encoded in base64)

    The last eight bytes could just as easily be valid ASCII (i.e. UTF-8)
    text on their own. Indeed, some base64 sequences do spell out
    natural-language words (but this is left as an exercise for the reader).
    The point is that ASCII-as-base64 could not be distinguished from
    ASCII-as-real-text.

    The original requirement was to be able to represent any valid UTF-8
    sequence unambiguously, and *also* to preserve invalid UTF-8 sequences.
    One way of doing this would be by UTF-8-encoding the bytes from 0x80 to
    0xFF as if they were characters in the range U+110000 to U+11007F.
    Other possibilities exist. Note that although I describe the
    requirement, for the sake of discussion, I don't in any way agree that
    UTF-8 should be extended or modified to satisfy it.

    --
    Doug Ewell * Fullerton, California, USA * RFC 4645 * UTN #14
    http://users.adelphia.net/~dewell/
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages
    


    This archive was generated by hypermail 2.1.5 : Sat Sep 08 2007 - 00:46:23 CDT