Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
David Starner via Unicode
unicode at unicode.org
Tue May 16 03:18:41 CDT 2017
On Tue, May 16, 2017 at 12:42 AM Alastair Houghton <
alastair at alastairs-place.net> wrote:
> If you’re about to mutter something about security, consider this:
> security code *should* refuse to compare strings that contain U+FFFD (or at
> least should never treat them as equal, even to themselves), because it has
> no way to know what that code point represents.
Which causes various other security problems; if an object (file, database
element, etc.) gets a name with a FFFD in it, it becomes impossible to
reference. That an IEEE 754 float may not equal itself is a perpetual
source of confusion for programmers.
> Would you advocate replacing
> e0 80 80
> U+FFFD U+FFFD U+FFFD (1)
> rather than
> U+FFFD (2)
> It’s pretty clear what the intent of the encoder was there, I’d say, and
> while we certainly don’t want to decode it as a NUL (that was the source of
> previous security bugs, as I recall), I also don’t see the logic in
> insisting that it must be decoded to *three* code points when it clearly
> only represented one in the input.
In this case, It's pretty clear, but I don't see it as a general rule. Any
rule has to handle e0 e0 e0 or 80 80 80 or any variety of charset or
mojibake or random binary data. 88 A0 8B D4 is UTF-16 Chinese, but I'm not
going to insist that it get replaced with U+FFFD U+FFFD because it's clear
(to me) it was meant as two characters.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode