Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Henri Sivonen via Unicode
unicode at unicode.org
Mon May 15 13:33:18 CDT 2017
On Mon, May 15, 2017 at 6:37 PM, Alastair Houghton
<alastair at alastairs-place.net> wrote:
> On 15 May 2017, at 11:21, Henri Sivonen via Unicode <unicode at unicode.org> wrote:
>> In reference to:
>> I think Unicode should not adopt the proposed change.
> Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting multiple errors there makes no sense.
The currently-specced behavior makes perfect sense when you add error
emission on top of a fail-fast UTF-8 validation state machine.
>> ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
>> representative of implementation concerns of implementations that use
>> UTF-8 as their in-memory Unicode representation.
>> ICU, etc.) that are stuck with UTF-16 as their in-memory
>> representation, which makes concerns of such implementation very
>> relevant, I think the Unicode Consortium should acknowledge that
>> UTF-16 was, in retrospect, a mistake
> You may think that. There are those of us who do not.
My point is:
The proposal seems to arise from the "UTF-16 as the in-memory
representation" mindset. While I don't expect that case in any way to
go away, I think the Unicode Consortium should recognize the serious
technical merit of the "UTF-8 as the in-memory representation" case as
having significant enough merit that proposals like this should
consider impact to both cases equally despite "UTF-8 as the in-memory
representation" case at present appearing to be the minority case.
That is, I think it's wrong to view things only or even primarily
through the lens of the "UTF-16 as the in-memory representation" case
that ICU represents.
hsivonen at hsivonen.fi
More information about the Unicode