Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Alastair Houghton via Unicode
unicode at unicode.org
Wed May 17 03:07:25 CDT 2017
> On 16 May 2017, at 20:43, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> On Tue, 16 May 2017 11:36:39 -0700
> Markus Scherer via Unicode <unicode at unicode.org> wrote:
>> Why do we care how we carve up an illegal sequence into subsequences?
>> Only for debugging and visual inspection. Maybe some process is using
>> illegal, overlong sequences to encode something special (à la Java
>> string serialization, "modified UTF-8"), and for that it might be
>> convenient too to treat overlong sequences as single errors.
> I think that's not quite true. If we are moving back and forth through
> a buffer containing corrupt text, we need to make sure that moving three
> characters forward and then three characters back leaves us where we
> started. That requires internal consistency.
That’s very true. But the proposed change doesn’t actually affect that; it’s still the case that you can correctly identify boundaries in both directions.
More information about the Unicode