Re: Roundtripping Solved

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Dec 16 2004 - 16:08:04 CST

  • Next message: Philippe Verdy: "Re: Is it roundtripping or transfer-encoding (was: RE: Roundtripping Solved)"

    From: "Arcane Jill" <arcanejill@ramonsky.com>
    > Lars's current implementation of this scheme is that his "f" "escapes" the
    > binary octet 1bbbbbbb to 11101110 1011101b 10bbbbbb (or equivalently, byte
    > x becomes the character U+EE00 + x). He is unhappy with this because
    > characters in the range U+EE80 to U+EEFF might be found in real text. So
    > you and I have, between us, suggested three alternative escaping
    > functions, in an attempt to find an escape sequence with a vanishingly
    > small probability of being found in real text. I'm not quite sure why Lars
    > isn't happy with these suggestions - maybe his goal has still not been
    > clearly stated - but either way, since nobody is proposing an amendment to
    > UTFs, it surely isn't the business of the UTC.

    What Lars wants has a name: it's a "transfer-encoding-syntax", to allow
    transporting any code unit sequences into a more restricted environment.
    This is not a new thing, but this is not specified by Unicode.

    It is specified in specific interfaces or APIs, as part of a protocol
    accepted by two compliant-parties. Such Transfer-Encoding-Syntaxes are used:
    - in MIME for transporting non-plain-text documents: Base64, UUEncoding,
    Hex, Quoted-Printable...
    - in programming languages: the special "\" prefix used to escape some
    characters (including '\' itself) with a sequence whose meaning is specified
    in the language itself, or doubling occurences of single-quotes in quoted
    SQL string constants.
    - in many protocols: notably COBS (that allow escaping any restricted byte
    such as 0x00); many variations of the COBS technic are used
    - in HTML: for example "&quot;" to escape the double-quote character

    Remember that all this is a notation. What makes it a
    Transfer-Encoding-Syntax is that this notation is published and easily
    implementable by various processes, because the specification is wellknown
    and can be easily agreed between two distinct processes that accept the
    notation with a well-defined name.

    A Transfer-Encoding-Syntax does not alter the meaning or encoding of the
    original document, and it is by necessity completely bijective: given an
    arbitrary code unit sequence x in a value set F, it transforms it into valid
    code unit sequences y=f(x) in a value set G, and is reversible back to x
    with a second "decoding" function g, so that x=g(y)=g(f(x)).

    A Transfer-Encoding-Syntax is fully bijective between the two definition
    domains of f() and g(): any valid code unit sequence y in G (the definition
    domain of g) MUST be decodable without error to F (the definition domain of
    f), so that y=g(f(y)) for any valid y (in G).

    Note that F and G will are almost always distinct even if, often but not
    always, F includes G (F will not include G for example if f() transforms any
    sequence of bytes F="[\x00-\xFF]*" in a sequence of valid UTF-32 code units
    G="[\U00000000-\U0010FFFF]*").

    There's a way to create such pair of functions f() and g():
    - G must be the complete valid value range of Unicode codepoints as
    indicated above.
    - F must be the complete valid value range of bytes as indicated above.
    - f() transforms each invalid byte '\xnn' into codepoint U+EEnn (note that
    as all \x00-\x7F are valid, only U+EE80 to U+EEFF will be used).
    - f() MUST also transform any valid byte sequence normally encoding
    codepoints U+EE80 to U+EEFF, by mapping these VALID bytes '\xnn' with the
    codepoints U+EEnn.

    Note that the UTF-8 encoding of U+EE80 to U+EEFF is:
    source bits: 1110 11101b bbbbbb
    UTF-8 bits: 11101110 1011101b 10bbbbbb
    UTF-8 bytes: [\xEE][\xBA-\xBB][\x80-\xBF]

    For example, consider this NOT-UTF-8 sequence of bytes:
        \x20\xC0\x80\x21\xEE\xB8\x80\x22

    You want to escape it to valid UTF-8. It decomposes as:
    - \x20 : valid UTF-8,
        no change, code as \x20 (which encodes U+0020 in UTF-8)
    - \xC0\x80 : not UTF-8, escape it as:
        \xC0 becomes \xEE\xBB\x80 (which encodes U+EEC0 in UTF-8)
        \x80 becomes \xEE\xBA\x80 (which encodes U+EE80 in UTF-8)
    - \x21: valid UTF-8,
        no change, code as \x21 (which encodes U+0021 in UTF-8)
    - \xEE\xBA\x80: valid UTF-8, but it would encode U+EE80, escape it:
        \xEE becomes \xEE\xBB\xBE (which encodes U+EEEE in UTF-8)
        \xBA becomes \xEE\xBA\xBA (which encodes U+EEBA in UTF-8)
        \x80 becomes \xEE\xBA\x80 (which encodes U+EE80 in UTF-8)
    - \x22: valid UTF-8,
        no change, code as \x21 (which encodes U+0021 in UTF-8)

    The generated sequence is 10-bytes longer, but it is now all valid UTF-8. To
    get it back to the original NON-UTF-8 sequence, you just need to convert
    back any occurence of [\xEE][\xBA-\xBB][\x80-\xBF] back to a byte in
    [\x80-\xFF].

    You could as well have generated valid UTF-16 or UTF-32 with the SAME
    algorithm:

    * Escaping to UTF-16:
    - \x20 : valid UTF-8,
        no change, code as \u0020 (which encodes U+0020 in UTF-16)
    - \xC0\x80 : not UTF-8, escape it as:
        \xC0 becomes \uEEC0 (which encodes U+EEC0 in UTF-16)
        \x80 becomes \uEE80 (which encodes U+EE80 in UTF-16)
    - \x21: valid UTF-8,
        no change, code as \x21 (which encodes U+0021 in UTF-16)
    - \xEE\xBA\x80: valid UTF-8, but it would encode U+EE80, escape it:
        \xEE becomes \uEEEE (which encodes U+EEEE in UTF-16)
        \xBA becomes \uEEBA (which encodes U+EEBA in UTF-16)
        \x80 becomes \uEE80 (which encodes U+EE80 in UTF-16)
    - \x22: valid UTF-8,
        no change, code as \u0021 (which encodes U+0021 in UTF-16)

    * Escaping to valid UTF-32:
    - Just replace all occurences of "\u" and "UTF-16" in the previous paragraph
    by "\U0000" and "UTF-32".

    But note that any occurence of U+EE80 to U+EEFF in the original NON-UTF-8
    "text" are escaped, despite they are valid Unicode. However, choosing U+EE80
    to U+EEFF is not a problem because these PUAs are very unlikely to be
    present in valid source texts, in absence of a prior PUA-agreement.

    Remember that this is only a Transform-Encoding-Syntax, not a new encoding.
    It does not require ANY new codepoint allocation by Unicode!



    This archive was generated by hypermail 2.1.5 : Fri Dec 17 2004 - 11:12:35 CST