Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Alastair Houghton via Unicode
unicode at unicode.org
Thu May 18 02:55:49 CDT 2017
On 18 May 2017, at 06:01, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> On Thu, 18 May 2017 02:04:55 +0200
> Philippe Verdy via Unicode <unicode at unicode.org> wrote:
>> I find intriguating that the update intends to enforce the decoding
>> of the **shortest** sequences, but now wants to treat **maximal
>> sequences** as a single unit with arbitrary length. UTF-8 was
>> designed to work only with some state machines that would NEVER need
>> to parse more than 4 bytes.
> If you look at the sample code in
> http://www.unicode.org/versions/Unicode2.0.0/appA.pdf, you'll see that
> it's working with 6-byte sequences. It's the Unicode, as opposed to
> ISO 10646, version that has always been restricted to 4 bytes.
There are good reasons for restricting it to four byte sequences, mind; doing so increases the number of invalid code units, which makes it easier to detect UTF-8 versus not UTF-8. I don’t think anyone is proposing allowing 5-byte or 6-byte sequences.
More information about the Unicode