Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Alastair Houghton via Unicode
unicode at unicode.org
Tue May 16 03:45:48 CDT 2017
> On 16 May 2017, at 09:18, David Starner <prosfilaes at gmail.com> wrote:
> On Tue, May 16, 2017 at 12:42 AM Alastair Houghton <alastair at alastairs-place.net> wrote:
>> If you’re about to mutter something about security, consider this: security code *should* refuse to compare strings that contain U+FFFD (or at least should never treat them as equal, even to themselves), because it has no way to know what that code point represents.
> Which causes various other security problems; if an object (file, database element, etc.) gets a name with a FFFD in it, it becomes impossible to reference. That an IEEE 754 float may not equal itself is a perpetual source of confusion for programmers.
That’s true anyway; imagine the database holds raw bytes, that just happen to decode to U+FFFD. There might seem to be *two* names that both contain U+FFFD in the same place. How do you distinguish between them?
Clearly if you are holding Unicode code points that you know are validly encoded somehow, you may want to be able to match U+FFFDs, but that’s a special case where you have extra knowledge.
> In this case, It's pretty clear, but I don't see it as a general rule. Any rule has to handle e0 e0 e0 or 80 80 80 or any variety of charset or mojibake or random binary data.
I don’t see a problem; the point is that where a structurally valid UTF-8 encoding has been used, albeit in an invalid manner (e.g. encoding a number that is not a valid code point, or encoding a valid code point as an over-long sequence), a single U+FFFD is appropriate. That seems a perfectly sensible rule to adopt.
The proposal actually does cover things that aren’t structurally valid, like your e0 e0 e0 example, which it suggests should be a single U+FFFD because the initial e0 denotes a three byte sequence, and your 80 80 80 example, which it proposes should constitute three illegal subsequences (again, both reasonable). However, I’m not entirely certain about things like
e0 e0 c3 89
which the proposal would appear to decode as
U+FFFD U+FFFD U+FFFD U+FFFD (3)
instead of a perhaps more reasonable
U+FFFD U+FFFD U+00C9 (4)
(the key part is the “without ever restricting trail bytes to less than 80..BF”)
and if Markus or others could explain why they chose (3) over (4) I’d be quite interested to hear the explanation.
More information about the Unicode