Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Alastair Houghton via Unicode unicode at
Wed May 17 03:07:25 CDT 2017

> On 16 May 2017, at 20:43, Richard Wordingham via Unicode <unicode at> wrote:
> On Tue, 16 May 2017 11:36:39 -0700
> Markus Scherer via Unicode <unicode at> wrote:
>> Why do we care how we carve up an illegal sequence into subsequences?
>> Only for debugging and visual inspection. Maybe some process is using
>> illegal, overlong sequences to encode something special (à la Java
>> string serialization, "modified UTF-8"), and for that it might be
>> convenient too to treat overlong sequences as single errors.
> I think that's not quite true.  If we are moving back and forth through
> a buffer containing corrupt text, we need to make sure that moving three
> characters forward and then three characters back leaves us where we
> started.  That requires internal consistency.

That’s very true.  But the proposed change doesn’t actually affect that; it’s still the case that you can correctly identify boundaries in both directions.

Kind regards,



More information about the Unicode mailing list