Re: [icu-support] complete binary/utf mapping

From: Doug Ewell (dewell@roadrunner.com)
Date: Sat Sep 08 2007 - 00:43:51 CDT

Next message: Mahesh T. Pai: "Re: [indic] Re: Feedback on PR-104"

Previous message: James Kass: "Re: [indic] Re: Feedback on PR-104"
In reply to: Mark E. Shoulson: "Re: [icu-support] complete binary/utf mapping"
Next in thread: Philippe Verdy: "RE: [icu-support] complete binary/utf mapping"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Mark E. Shoulson <mark at kli dot org> wrote:

>> Somebody wanted to build that capability into an extension to UTF-8,
>> so it could faithfully represent invalid garbage. We were never able
>> to get him to work through what he wanted to do with the garbage thus
>> preserved.
>
> Is there an obvious reason we couldn't just treat the garbage UTF-8 as
> a string of 8-bit characters (might be part of a binary file or
> something) and base-64 encode them? That'll definitely preserve
> round-trippedness.

Not quite; you would no longer be able to tell the garbage UTF-8 from
the base-64 characters used to encode it.

Consider the following sequences of bytes (with annotations in
parentheses):

41 41 41
(three valid UTF-8 characters)

C3 80 C3 81 C3 82
(three more valid UTF-8 characters)

C0 C1 C2
(above three in ISO 8859-1; invalid UTF-8)

41 4D 41 41 77 51 44 43
(above three values encoded in base64)

The last eight bytes could just as easily be valid ASCII (i.e. UTF-8)
text on their own. Indeed, some base64 sequences do spell out
natural-language words (but this is left as an exercise for the reader).
The point is that ASCII-as-base64 could not be distinguished from
ASCII-as-real-text.

The original requirement was to be able to represent any valid UTF-8
sequence unambiguously, and *also* to preserve invalid UTF-8 sequences.
One way of doing this would be by UTF-8-encoding the bytes from 0x80 to
0xFF as if they were characters in the range U+110000 to U+11007F.
Other possibilities exist. Note that although I describe the
requirement, for the sake of discussion, I don't in any way agree that
UTF-8 should be extended or modified to satisfy it.

--
Doug Ewell * Fullerton, California, USA * RFC 4645 * UTN #14
http://users.adelphia.net/~dewell/
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages

Next message: Mahesh T. Pai: "Re: [indic] Re: Feedback on PR-104"
Previous message: James Kass: "Re: [indic] Re: Feedback on PR-104"
In reply to: Mark E. Shoulson: "Re: [icu-support] complete binary/utf mapping"
Next in thread: Philippe Verdy: "RE: [icu-support] complete binary/utf mapping"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Sep 08 2007 - 00:46:23 CDT