Re: Invalid code points

From: Doug Ewell (doug@ewellic.org)
Date: Sun May 31 2009 - 15:18:16 CDT

Next message: Hans Aberg: "Re: Invalid code points"

Previous message: Hans Aberg: "Re: Invalid code points"
In reply to: Hans Aberg: "Re: Invalid code points"
Next in thread: Hans Aberg: "Re: Invalid code points"
Reply: Hans Aberg: "Re: Invalid code points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hans Aberg <haberg at math dot su dot se> wrote:

>>> In particular, it would be great to know if the range U+0080, ,
>>> U+009F is invalid.
>>
>> That bit is especially wrong. I can at least imagine why there might
>> be confusion about the noncharacters and surrogate code points, but
>> not the C1 controls.
>
> It is a bit disappointing: I was looking for a beginning (escape) byte
> sequence to tell that string isn't UTF-8, among other valid strings.
> But perhaps it does not matter.

If you're thinking about inventing one, for your own use, then any byte
sequence that is not valid UTF-8 should do the job. One possibility
would be {0xA0}.

Be sure you understand the difference between an invalid *byte sequence*
and an invalid *code point*. There are many invalid byte sequences in
UTF-8. As Mark pointed out, the only invalid code points are the
surrogates.

The section of the Wikipedia article you cited actually contains quite a
concentration of misleading information:

"Unpaired surrogate halves may indicate an invalid UTF-16 string was
encoded, or a valid one with a CESU-8 converter."

Even in CESU-8, surrogate halves are expected to be paired
appropriately.

"U+FFFE may indicate encoding of a byte-swapped UTF-16 string as it
is a backwards BOM."

While true, this has very little to do with UTF-8. The process from
which such data was received would have to have been smart enough to
recognize UTF-16 text and convert it to UTF-8, but dumb enough to get
the UTF-16 byte order wrong in the first place.

"U+0080 through U+009F may indicate CP1252 was converted without
translating the characters to Unicode"

This has to do with the original content, not the validity of the UTF-8.
Single bytes of value 0x80 through 0x9F are simply errors. Unicode
scalar values from U+0080 through U+009F (represented in UTF-8 as {0xC2,
0x80} through {0xC2, 0x9F}) may indicate that CP1252 was converted as if
it were ISO 8859-1. In that case, the UTF-8 is perfectly valid but the
underlying data may not be correct.

"U+0080 through U+009F and nothing greater than U+00FF may indicate
double-converted UTF-8."

Again, this confuses validity of UTF-8 with validity of the underlying
content. In any event, incorrect conversion of CP1252 as if it were ISO
8859-1 (above) would fall into this category.

"U+DC80 through U+DCFF may be reserved for converting invalid byte
sequences (see above)"

This is flat wrong and bogus and ill-conceived and non-conformant, and
should never, ever be done, full stop.

--
Doug Ewell  *  Thornton, Colorado, USA  *  RFC 4645  *  UTN #14
http://www.ewellic.org
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ

Next message: Hans Aberg: "Re: Invalid code points"
Previous message: Hans Aberg: "Re: Invalid code points"
In reply to: Hans Aberg: "Re: Invalid code points"
Next in thread: Hans Aberg: "Re: Invalid code points"
Reply: Hans Aberg: "Re: Invalid code points"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun May 31 2009 - 15:29:54 CDT