From: Doug Ewell (dewell@roadrunner.com)
Date: Sat Sep 08 2007 - 00:43:51 CDT
Mark E. Shoulson <mark at kli dot org> wrote:
>> Somebody wanted to build that capability into an extension to UTF-8,
>> so it could faithfully represent invalid garbage. We were never able
>> to get him to work through what he wanted to do with the garbage thus
>> preserved.
>
> Is there an obvious reason we couldn't just treat the garbage UTF-8 as
> a string of 8-bit characters (might be part of a binary file or
> something) and base-64 encode them? That'll definitely preserve
> round-trippedness.
Not quite; you would no longer be able to tell the garbage UTF-8 from
the base-64 characters used to encode it.
Consider the following sequences of bytes (with annotations in
parentheses):
41 41 41
(three valid UTF-8 characters)
C3 80 C3 81 C3 82
(three more valid UTF-8 characters)
C0 C1 C2
(above three in ISO 8859-1; invalid UTF-8)
41 4D 41 41 77 51 44 43
(above three values encoded in base64)
The last eight bytes could just as easily be valid ASCII (i.e. UTF-8)
text on their own. Indeed, some base64 sequences do spell out
natural-language words (but this is left as an exercise for the reader).
The point is that ASCII-as-base64 could not be distinguished from
ASCII-as-real-text.
The original requirement was to be able to represent any valid UTF-8
sequence unambiguously, and *also* to preserve invalid UTF-8 sequences.
One way of doing this would be by UTF-8-encoding the bytes from 0x80 to
0xFF as if they were characters in the range U+110000 to U+11007F.
Other possibilities exist. Note that although I describe the
requirement, for the sake of discussion, I don't in any way agree that
UTF-8 should be extended or modified to satisfy it.
-- Doug Ewell * Fullerton, California, USA * RFC 4645 * UTN #14 http://users.adelphia.net/~dewell/ http://www1.ietf.org/html.charters/ltru-charter.html http://www.alvestrand.no/mailman/listinfo/ietf-languages
This archive was generated by hypermail 2.1.5 : Sat Sep 08 2007 - 00:46:23 CDT