Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Alastair Houghton via Unicode
unicode at unicode.org
Wed May 31 13:34:15 CDT 2017
On 31 May 2017, at 18:43, Shawn Steele via Unicode <unicode at unicode.org> wrote:
> It is unclear to me what the expected behavior would be for this corruption if, for example, there were merely a half dozen 0x80 in the middle of ASCII text? Is that garbage a single "character"? Perhaps because it's a consecutive string of bad bytes? Or should it be 6 characters since they're nonsense? Or maybe 2 characters because the maximum # of trail bytes we can have is 3?
It should be six U+FFFD characters, because 0x80 is not a lead byte. Basically, the new proposal is that we should decode bytes that structurally match UTF-8, and if the encoding is then illegal (because it’s over-long, because it’s a surrogate or because it’s over U+10FFFF) then the entire thing is replaced with U+FFFD. If, on the other hand, we get a sequence that isn’t structurally valid UTF-8, we replace the maximally *structurally* valid subpart with U+FFFD and continue.
> What if it were 2 consecutive 2-byte sequence lead bytes and no trail bytes?
Then you get two U+FFFDs.
> I can see how different implementations might be able to come up with "rules" that would help them navigate (or clean up) those minefields, however it is not at all clear to me that there is a "best practice" for those situations.
I’m not sure the whole “best practice” thing has been a lot of help here. Perhaps we should change it to say “Suggested Handling”, to make quite clear that filing a bug report against code that chooses some other option is not necessary?
> There also appears to be a special weight given to non-minimally-encoded sequences.
I don’t think that’s true, *although* it *is* true that UTF-8 decoders historically tended to allow such things, so one might assume that some software out there is generating them for whatever reason.
There are also *deliberate* violations of the minimal length encoding specification in some cases (for instance to allow NUL to be encoded in such a way that it won’t terminate a C-style string). Yes, you may retort, that isn’t “valid UTF-8”. Sure. It *is* useful, though, and it is *in use*. If a UTF-8 decoder encounters such a thing, it’s more meaningful for whoever sees the output to see a single U+FFFD representing the illegally encoded NUL that it is to see two U+FFFDs, one for an invalid lead byte and then another for an “unexpected” trailing byte.
Likewise, there are encoders that generate surrogates in UTF-8, which is, of course, illegal, but *does* happen. Again, they can provide reasonable justifications for their behaviour (typically they want the default binary sort to work the same as for UTF-16 for some reason), and again, replacing a single surrogate with U+FFFD rather than multiple U+FFFDs is more helpful to whoever/whatever ends up seeing it.
And, of course, there are encoders that are attempting to exploit security flaws, which will very definitely generate these kinds of things.
> It would seem to me that none of these illegal sequences should appear in practice, so we have either:
> * A bad encoder spewing out garbage (overlong sequences)
> * Flipped bit(s) due to storage/transmission/whatever errors
> * Lost byte(s) due to storage/transmission/coding/whatever errors
> * Extra byte(s) due to whatever errors
> * Bad string manipulation breaking/concatenating in the middle of sequences, causing garbage (perhaps one of the above 2 codeing errors).
I see no reason to suppose that the proposed behaviour would function any less well in those cases.
> Only in the first case, of a bad encoder, are the overlong sequences actually "real". And that shouldn't happen (it's a bad encoder after all).
Except some encoders *deliberately* use over-longs, and one would assume that since UTF-8 decoders historically allowed this, there will be data “in the wild” that has this form.
> The other scenarios seem just as likely, (or, IMO, much more likely) than a badly designed encoder creating overlong sequences that appear to fit the UTF-8 pattern but aren't actually UTF-8.
I’m not sure I agree that flipped bits, lost bytes and extra bytes are more likely than a “bad” encoder. Bad string manipulation is of course prevalent, though - there’s no way around that.
> The other cases are going to cause byte patterns that are less "obvious" about how they should be navigated for various applications.
This is true, *however* the new proposed behaviour is in no way inferior to the old proposed behaviour in those cases - it’s just different.
More information about the Unicode