Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Richard Wordingham via Unicode
unicode at unicode.org
Wed May 17 18:11:53 CDT 2017
On Wed, 17 May 2017 15:31:56 -0700
Doug Ewell via Unicode <unicode at unicode.org> wrote:
> Richard Wordingham wrote:
> > So it was still a legal way for a non-UTF-8-compliant process!
> Anything is possible if you are non-compliant. You can encode U+263A
> with 9,786 FF bytes followed by a terminating FE byte and call that
> "UTF-8," if you are willing to be non-compliant enough.
> > Note for example that a compliant implementation of full
> > upper-casing shall convert the canonically equivalent strings
> > <U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI, U+0313
> > COMBINING COMMA ABOVE> and <U+1F00 GREEK SMALL LETTER ALPHA WITH
> > PSILI, U+0345 COMBINING GREEK
> > YPOGEGRAMMENI> to the canonically inequivalent strings <U+0391
> > YPOGEGRAMMENI> GREEK
> > CAPITAL LETTER ALPHA, U+0399 GREEK CAPITAL LETTER IOTA, U+0313> and
> > <U+1F08 GREEK CAPITAL LETTER ALPHA WITH PSILI, 0399 GREEK CAPITAL
> > LETTER IOTA>. A compliant Unicode process may not assume that this
> > is the right thing to do. (Or are some compliant Unicode processes
> > required to incorrectly believe that they are doing something they
> > mustn't do?)
> I'm afraid I don't get the analogy.
You can't build a full Unicode system out of Unicode-compliant parts.
However, having dug out Unicode Version 2 Appendix A Section 2 UTF-8
(in http://www.unicode.org/versions/Unicode2.0.0/appA.pdf), I find the
critical wording, "When converting from UTF-8 to Unicode values,
however, implementations do not need to check that the shortest
encoding is being used,...". There was no prohibition on
implementations performing the check, so whether C0 80 would be
interpreted as U+0000 or as an error was unpredictable.
More information about the Unicode