Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Richard Wordingham via Unicode
unicode at unicode.org
Wed May 31 12:11:13 CDT 2017
On Wed, 31 May 2017 15:12:12 +0300
Henri Sivonen via Unicode <unicode at unicode.org> wrote:
> The write-up mentions
> https://bugs.chromium.org/p/chromium/issues/detail?id=662822#c13 . I'd
> like to draw everyone's attention to that bug, which is real-world
> evidence of a bug arising from two UTF-8 decoders within one product
> handling UTF-8 errors differently.
> Does it matter if a proposal/appeal is submitted as a non-member
> implementor person, as an individual person member or as a liaison
> member? http://www.unicode.org/consortium/liaison-members.html list
> "the Mozilla Project" as a liaison member, but Mozilla-side
> conventions make submitting proposals like this "as Mozilla"
> problematic (we tend to avoid "as Mozilla" statements on technical
> standardization fora except when the W3C Process forces us to make
> them as part of charter or Proposed Recommendation review).
There may well be an advantage to being able to answer any questions on
the proposal at the meeting, especially if it isn't read until the
> > The modified text is a set of guidelines, not requirements. So no
> > conformance clause is being changed.
> I'm aware of this.
> > If people really believed that the guidelines in that section
> > should have been conformance clauses, they should have proposed
> > that at some point.
> It seems to me that this thread does not support the conclusion that
> the Unicode Standard's expression of preference for the number of
> REPLACEMENT CHARACTERs should be made into a conformance requirement
> in the Unicode Standard. This thread could be taken to support a
> conclusion that the Unicode Standard should not express any preference
> beyond "at least one and at most as many as there were bytes".
> On Tue, May 23, 2017 at 12:17 PM, Alastair Houghton via Unicode
> <unicode at unicode.org> wrote:
> > In any case, Henri is complaining that it’s too difficult to
> > implement; it isn’t. You need two extra states, both of which are
> > trivial.
> I am not claiming it's too difficult to implement. I think it
> inappropriate to ask implementations, even from-scratch ones, to take
> on added complexity in error handling on mere aesthetic grounds. Also,
> I think it's inappropriate to induce implementations already written
> according to the previous guidance to change (and risk bugs) or to
> make the developers who followed the previous guidance with precision
> be the ones who need to explain why they aren't following the new
How straightforward is the FSM for back-stepping?
> On Fri, May 26, 2017 at 6:41 PM, Markus Scherer via Unicode
> <unicode at unicode.org> wrote:
> > The UTF-8 conversion code that I wrote for ICU, and apparently the
> > code that various other people have written, collects sequences
> > starting from lead bytes, according to the original spec, and at
> > the end looks at whether the assembled code point is too low for
> > the lead byte, or is a surrogate, or is above 10FFFF. Stopping at a
> > non-trail byte is quite natural, and reading the PRI text
> > accordingly is quite natural too.
> I don't doubt that other people have written code with the same
> concept as ICU, but as far as non-shortest form handling goes in the
> implementations I tested (see URL at the start of this email) ICU is
> the lone outlier.
You should have researched implementations as they were in 2007.
My own code uses the same concept as Markus's ICU code - convert and
check the resulting value is legal for the length. As a check,
remember that for n > 1, n bytes could represent 2**(5n + 1) values if
overlongs were permitted.
> > Aside from UTF-8 history, there is a reason for preferring a more
> > "structural" definition for UTF-8 over one purely along valid
> > sequences. This applies to code that *works* on UTF-8 strings
> > rather than just converting them. For UTF-8 *processing* you need
> > to be able to iterate both forward and backward, and sometimes you
> > need not collect code points while skipping over n units in either
> > direction -- but your iteration needs to be consistent in all
> > cases. This is easier to implement (especially in fast, short,
> > inline code) if you have to look only at how many trail bytes
> > follow a lead byte, without having to look whether the first trail
> > byte is in a certain range for some specific lead bytes.
> But the matter at hand is decoding potentially-invalid UTF-8 input
> into a valid in-memory Unicode representation, so later processing is
> somewhat a red herring as being out of scope for this step.
No. Both lossily converting a UTF-8-like string as a stream of bytes to
scalar values and moving back and forth through the string 'character'
by 'character' imply an ability to count the number of 'characters' in
the string. The bug you mentioned arose from two different ways of
counting the string length in 'characters'. Having two different
'character' counts for the same string is inviting trouble.
More information about the Unicode