Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Shawn Steele via Unicode unicode at
Wed May 31 14:28:03 CDT 2017

> it’s more meaningful for whoever sees the output to see a single U+FFFD representing 
> the illegally encoded NUL that it is to see two U+FFFDs, one for an invalid lead byte and 
> then another for an “unexpected” trailing byte.

I disagree.  It may be more meaningful for some applications to have a single U+FFFD representing an illegally encoded 2-byte NULL than to have 2 U+FFFDs.  Of course then you don't know if it was an illegally encoded 2-byte NULL or an illegally encoded 3-byte NULL or whatever, so some information that other applications may be interested in is lost.

Personally, I prefer the "emit a U+FFFD if the sequence is invalid, drop the byte, and try again" approach.  


More information about the Unicode mailing list