Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Philippe Verdy via Unicode unicode at
Fri May 26 08:22:54 CDT 2017

> Citing directly from the PRI:
> >>>>
> The term "maximal subpart of the ill-formed subsequence" refers to the
> longest potentially valid initial subsequence or, if none, then to the next
> single code unit.
> >>>>

The way i understand it is that C0 80 will have TWO maximal subparts,
because there's not any valid initial subsequence, so only the next single
code unit (C0) will be considered. After this the following byte 80 also
has not any valid initial subsequence, so here again only the next single
code unit (80) will be considered. You'll get U+FFFD replacements emitted
twice. This treats all cases of "overlong" sequences that were in the old
UTF-8 definition in the first RFC.

For C3 80 20, there will be only ONE maximal subpart because C3 80 is a
valid initial subsequence, so a single U+FFFD replacement will be emitted,
followed then by the valid UTF-8 sequence (20) which will correctly decode
as U+0020.

Good ! This means that this proposal makes sense and is compatible with
random accesses within the encoded text whithout having to look backward
for an indefinite number of code units and we never have to handle any case
with possibly infinite number of code units mapped to the same U+FFFD
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list