Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Alastair Houghton via Unicode
unicode at unicode.org
Mon May 15 10:37:13 CDT 2017
On 15 May 2017, at 11:21, Henri Sivonen via Unicode <unicode at unicode.org> wrote:
> In reference to:
> I think Unicode should not adopt the proposed change.
Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting multiple errors there makes no sense.
> ICU uses UTF-16 as its in-memory Unicode representation, so ICU isn't
> representative of implementation concerns of implementations that use
> UTF-8 as their in-memory Unicode representation.
> ICU, etc.) that are stuck with UTF-16 as their in-memory
> representation, which makes concerns of such implementation very
> relevant, I think the Unicode Consortium should acknowledge that
> UTF-16 was, in retrospect, a mistake
You may think that. There are those of us who do not. The fact is that UTF-16 makes sense as a default encoding in many cases. Yes, UTF-8 is more efficient for primarily ASCII text, but that is not the case for other situations and the fact is that handling surrogates (which is what proponents of UTF-8 or UCS-4 usually focus on) is no more complicated than handling combining characters, which you have to do anyway.
> Therefore, despite UTF-16 being widely used as an in-memory
> representation of Unicode and in no way going away, I think the
> Unicode Consortium should be *very* sympathetic to technical
> considerations for implementations that use UTF-8 as the in-memory
> representation of Unicode.
I don’t think the Unicode Consortium should be unsympathetic to people who use UTF-8 internally, for sure, but I don’t see what that has to do with either the original proposal or with your criticism of UTF-16.
> If the proposed
> change was adopted, while Draconian decoders (that fail upon first
> error) could retain their current state machine, implementations that
> emit U+FFFD for errors and continue would have to add more state
> machine states (i.e. more complexity) to consolidate more input bytes
> into a single U+FFFD even after a valid sequence is obviously
“Impossible”? Why? You just need to add some error states (or *an* error state and a counter); it isn’t exactly difficult, and I’m sure ICU isn’t the only library that already did just that *because it’s clearly the right thing to do*.
More information about the Unicode