Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

From: Richard Wordingham via Unicode <unicode_at_unicode.org>
Date: Tue, 16 May 2017 14:21:53 +0100

On Tue, 16 May 2017 14:44:44 +0200
Hans Åberg via Unicode <unicode_at_unicode.org> wrote:

> > On 15 May 2017, at 12:21, Henri Sivonen via Unicode
> > <unicode_at_unicode.org> wrote:
> ...
> > I think Unicode should not adopt the proposed change.
>
> It would be useful, for use with filesystems, to have Unicode
> codepoint markers that indicate how UTF-8, including non-valid
> sequences, is translated into UTF-32 in a way that the original octet
> sequence can be restored.

Escape sequences for the inappropriate bytes is the natural technique.
Your problem is smoothly transitioning so that the escape character is
always escaped when it means itself. Strictly, it can't be done.

Of course, some sequences of escaped characters should be prohibited.
Checking could be fiddly.
 
Richard.
Received on Tue May 16 2017 - 08:22:37 CDT

This archive was generated by hypermail 2.2.0 : Tue May 16 2017 - 08:22:37 CDT