Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Philippe Verdy via Unicode
unicode at unicode.org
Tue May 16 05:44:00 CDT 2017
> The proposal actually does cover things that aren’t structurally valid,
> like your e0 e0 e0 example, which it suggests should be a single U+FFFD
> because the initial e0 denotes a three byte sequence, and your 80 80 80
> example, which it proposes should constitute three illegal subsequences
> (again, both reasonable). However, I’m not entirely certain about things
> e0 e0 c3 89
> which the proposal would appear to decode as
> U+FFFD U+FFFD U+FFFD U+FFFD (3)
> instead of a perhaps more reasonable
> U+FFFD U+FFFD U+00C9 (4)
> (the key part is the “without ever restricting trail bytes to less than
I also agree with that, due to access in strings from random position: if
you access it from byte 0x89, you can assume it's a trialing byte and
you'll want to look backward, and will see 0xc3,0x89 which will decode
correctly as U+00C9 without any error detected.
So the wrong bytes are only the initial two occurences of 0x80 which are
individually converted to U+FFFD.
In summary: when you detect any ill-formed sequence, only replace the first
code unit by U+FFFD and restart scanning from the next code unit, without
skeeping over multiple bytes.
This means that multiple occurences of U+FFFD is not only the best
practice, it also matches the intended design of UTF-8 to allow access from
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode