Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Richard Wordingham via Unicode
unicode at unicode.org
Thu Jun 1 14:16:52 CDT 2017
On Thu, 1 Jun 2017 12:32:08 +0300
Henri Sivonen via Unicode <unicode at unicode.org> wrote:
> On Wed, May 31, 2017 at 8:11 PM, Richard Wordingham via Unicode
> <unicode at unicode.org> wrote:
> > On Wed, 31 May 2017 15:12:12 +0300
> > Henri Sivonen via Unicode <unicode at unicode.org> wrote:
> >> I am not claiming it's too difficult to implement. I think it
> >> inappropriate to ask implementations, even from-scratch ones, to
> >> take on added complexity in error handling on mere aesthetic
> >> grounds. Also, I think it's inappropriate to induce
> >> implementations already written according to the previous guidance
> >> to change (and risk bugs) or to make the developers who followed
> >> the previous guidance with precision be the ones who need to
> >> explain why they aren't following the new guidance.
> > How straightforward is the FSM for back-stepping?
> This seems beside the point, since the new guidance wasn't advertised
> as improving backward stepping compared to the old guidance.
> (On the first look, I don't see the new guidance improving back
> stepping. In fact, if the UTC meant to adopt ICU's behavior for
> obsolete five and six-byte bit patterns, AFAICT, backstepping with the
> ICU behavior requires examining more bytes backward than the old
> guidance required.)
The greater simplicity comes from the the alternative behaviour being
more 'natural'. It's a little difficult to count states without
constraints on the machines, but for forward stepping, even supporting
6-byte patterns just in case 20.1 bits eventually turn out not to be
enough, there are five intermediate states - '1 byte to go', '2
bytes to go', ... '5 bytes to go'. For backward stepping, there are
similarly five intermediate states - '1 trailing byte seen', and so
For the recommended handling, forward stepping has seven
intermediate states, each directly reachable from the starting state -
start byte C2..DF; start byte E0; start byte E1..EC, EE or EF; start
byte ED; start byte F0; start byte F1..F3; and start byte FF. No
further intermediate states are required.
For the recommended handling, I see a need for 8 intermediate steps,
depending on how may trail bytes have been considered and whether the
last one was in the range 80..8F (precludes E0 and F0 immediately
preceding), 90..9F (precludes E0 and F4 immediately preceding) or A0..BF
(precludes ED and F4 immediately preceding). The logic feels quite
complicated. If I implement it, I'm not likely to code it up as an FSM.
> > You should have researched implementations as they were in 2007.
> I don't see how the state of things in 2007 is relevant to a decision
> taken in 2017.
Because the argument is that the original decision taken in 2008 was
wrong. I have a feeling I have overlooked some of the discussion
around then, because I can't find my contribution in the archives, and I
thought I objected at the time.
More information about the Unicode