Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Alastair Houghton via Unicode
unicode at unicode.org
Thu May 18 02:54:11 CDT 2017
On 18 May 2017, at 01:04, Philippe Verdy via Unicode <unicode at unicode.org> wrote:
> I find intriguating that the update intends to enforce the decoding of the **shortest** sequences, but now wants to treat **maximal sequences** as a single unit with arbitrary length. UTF-8 was designed to work only with some state machines that would NEVER need to parse more than 4 bytes.
This won’t change. You still don’t need to parse more than four bytes. In fact, you don’t need to do *anything*, even if your implementation doesn’t match the proposal, because *it’s only a recommendation*. But if you did choose to do something, you *still* don’t need to scan arbitrary numbers of bytes.
> For me, as soon as the first byte encountered is invalid, the current sequence should be stopped there and treated as error (replaced by U+FFFD is replacement is enabled instead of returning an error or throwing an exception),
This is still essentially true under the proposal; the only difference is that instead of being a clever dick and taking account of the valid *code point* ranges while doing this in order to ban certain trailing bytes given the values of their predecessors, you allow any trailing byte, and only worry about whether the complete sequence represents a valid code point or is over-long once you’ve finished reading it. You never need to read more than four bytes under the new proposal, because the lead byte tells you how many to expect, and you’d still stop and instantly replace with U+FFFD if you see a byte outside the 0x80-0xbf range, even if you hadn’t scanned the number of bytes the lead byte says to expect.
This also *does not* change the view of the underlying UTF-8 string based on iteration direction; you would still generate the exact same sequence of code points in both directions.
More information about the Unicode