Re: Roundtripping Solved

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Dec 16 2004 - 16:08:04 CST

Next message: Philippe Verdy: "Re: Is it roundtripping or transfer-encoding (was: RE: Roundtripping Solved)"

Previous message: Peter Kirk: "Re: Roundtripping Solved"
In reply to: Arcane Jill: "Re: Roundtripping Solved"
Next in thread: Lars Kristan: "RE: Roundtripping Solved"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Arcane Jill" <arcanejill@ramonsky.com>
> Lars's current implementation of this scheme is that his "f" "escapes" the
> binary octet 1bbbbbbb to 11101110 1011101b 10bbbbbb (or equivalently, byte
> x becomes the character U+EE00 + x). He is unhappy with this because
> characters in the range U+EE80 to U+EEFF might be found in real text. So
> you and I have, between us, suggested three alternative escaping
> functions, in an attempt to find an escape sequence with a vanishingly
> small probability of being found in real text. I'm not quite sure why Lars
> isn't happy with these suggestions - maybe his goal has still not been
> clearly stated - but either way, since nobody is proposing an amendment to
> UTFs, it surely isn't the business of the UTC.

What Lars wants has a name: it's a "transfer-encoding-syntax", to allow
transporting any code unit sequences into a more restricted environment.
This is not a new thing, but this is not specified by Unicode.

It is specified in specific interfaces or APIs, as part of a protocol
accepted by two compliant-parties. Such Transfer-Encoding-Syntaxes are used:
- in MIME for transporting non-plain-text documents: Base64, UUEncoding,
Hex, Quoted-Printable...
- in programming languages: the special "\" prefix used to escape some
characters (including '\' itself) with a sequence whose meaning is specified
in the language itself, or doubling occurences of single-quotes in quoted
SQL string constants.
- in many protocols: notably COBS (that allow escaping any restricted byte
such as 0x00); many variations of the COBS technic are used
- in HTML: for example """ to escape the double-quote character

Remember that all this is a notation. What makes it a
Transfer-Encoding-Syntax is that this notation is published and easily
implementable by various processes, because the specification is wellknown
and can be easily agreed between two distinct processes that accept the
notation with a well-defined name.

A Transfer-Encoding-Syntax does not alter the meaning or encoding of the
original document, and it is by necessity completely bijective: given an
arbitrary code unit sequence x in a value set F, it transforms it into valid
code unit sequences y=f(x) in a value set G, and is reversible back to x
with a second "decoding" function g, so that x=g(y)=g(f(x)).

A Transfer-Encoding-Syntax is fully bijective between the two definition
domains of f() and g(): any valid code unit sequence y in G (the definition
domain of g) MUST be decodable without error to F (the definition domain of
f), so that y=g(f(y)) for any valid y (in G).

Note that F and G will are almost always distinct even if, often but not
always, F includes G (F will not include G for example if f() transforms any
sequence of bytes F="[\x00-\xFF]*" in a sequence of valid UTF-32 code units
G="[\U00000000-\U0010FFFF]*").

There's a way to create such pair of functions f() and g():
- G must be the complete valid value range of Unicode codepoints as
indicated above.
- F must be the complete valid value range of bytes as indicated above.
- f() transforms each invalid byte '\xnn' into codepoint U+EEnn (note that
as all \x00-\x7F are valid, only U+EE80 to U+EEFF will be used).
- f() MUST also transform any valid byte sequence normally encoding
codepoints U+EE80 to U+EEFF, by mapping these VALID bytes '\xnn' with the
codepoints U+EEnn.

Note that the UTF-8 encoding of U+EE80 to U+EEFF is:
source bits: 1110 11101b bbbbbb
UTF-8 bits: 11101110 1011101b 10bbbbbb
UTF-8 bytes: [\xEE][\xBA-\xBB][\x80-\xBF]

For example, consider this NOT-UTF-8 sequence of bytes:
\x20\xC0\x80\x21\xEE\xB8\x80\x22

You want to escape it to valid UTF-8. It decomposes as:
- \x20 : valid UTF-8,
    no change, code as \x20 (which encodes U+0020 in UTF-8)
- \xC0\x80 : not UTF-8, escape it as:
    \xC0 becomes \xEE\xBB\x80 (which encodes U+EEC0 in UTF-8)
    \x80 becomes \xEE\xBA\x80 (which encodes U+EE80 in UTF-8)
- \x21: valid UTF-8,
    no change, code as \x21 (which encodes U+0021 in UTF-8)
- \xEE\xBA\x80: valid UTF-8, but it would encode U+EE80, escape it:
    \xEE becomes \xEE\xBB\xBE (which encodes U+EEEE in UTF-8)
    \xBA becomes \xEE\xBA\xBA (which encodes U+EEBA in UTF-8)
    \x80 becomes \xEE\xBA\x80 (which encodes U+EE80 in UTF-8)
- \x22: valid UTF-8,
    no change, code as \x21 (which encodes U+0021 in UTF-8)

The generated sequence is 10-bytes longer, but it is now all valid UTF-8. To
get it back to the original NON-UTF-8 sequence, you just need to convert
back any occurence of [\xEE][\xBA-\xBB][\x80-\xBF] back to a byte in
[\x80-\xFF].

You could as well have generated valid UTF-16 or UTF-32 with the SAME
algorithm:

* Escaping to UTF-16:
- \x20 : valid UTF-8,
    no change, code as \u0020 (which encodes U+0020 in UTF-16)
- \xC0\x80 : not UTF-8, escape it as:
    \xC0 becomes \uEEC0 (which encodes U+EEC0 in UTF-16)
    \x80 becomes \uEE80 (which encodes U+EE80 in UTF-16)
- \x21: valid UTF-8,
    no change, code as \x21 (which encodes U+0021 in UTF-16)
- \xEE\xBA\x80: valid UTF-8, but it would encode U+EE80, escape it:
    \xEE becomes \uEEEE (which encodes U+EEEE in UTF-16)
    \xBA becomes \uEEBA (which encodes U+EEBA in UTF-16)
    \x80 becomes \uEE80 (which encodes U+EE80 in UTF-16)
- \x22: valid UTF-8,
    no change, code as \u0021 (which encodes U+0021 in UTF-16)

* Escaping to valid UTF-32:
- Just replace all occurences of "\u" and "UTF-16" in the previous paragraph
by "\U0000" and "UTF-32".

But note that any occurence of U+EE80 to U+EEFF in the original NON-UTF-8
"text" are escaped, despite they are valid Unicode. However, choosing U+EE80
to U+EEFF is not a problem because these PUAs are very unlikely to be
present in valid source texts, in absence of a prior PUA-agreement.

Remember that this is only a Transform-Encoding-Syntax, not a new encoding.
It does not require ANY new codepoint allocation by Unicode!

Next message: Philippe Verdy: "Re: Is it roundtripping or transfer-encoding (was: RE: Roundtripping Solved)"
Previous message: Peter Kirk: "Re: Roundtripping Solved"
In reply to: Arcane Jill: "Re: Roundtripping Solved"
Next in thread: Lars Kristan: "RE: Roundtripping Solved"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Dec 17 2004 - 11:12:35 CST