From: Doug Ewell (doug@ewellic.org)
Date: Sun May 31 2009 - 15:18:16 CDT
Hans Aberg <haberg at math dot su dot se> wrote:
>>> In particular, it would be great to know if the range U+0080, ,
>>> U+009F is invalid.
>>
>> That bit is especially wrong. I can at least imagine why there might
>> be confusion about the noncharacters and surrogate code points, but
>> not the C1 controls.
>
> It is a bit disappointing: I was looking for a beginning (escape) byte
> sequence to tell that string isn't UTF-8, among other valid strings.
> But perhaps it does not matter.
If you're thinking about inventing one, for your own use, then any byte
sequence that is not valid UTF-8 should do the job. One possibility
would be {0xA0}.
Be sure you understand the difference between an invalid *byte sequence*
and an invalid *code point*. There are many invalid byte sequences in
UTF-8. As Mark pointed out, the only invalid code points are the
surrogates.
The section of the Wikipedia article you cited actually contains quite a
concentration of misleading information:
"Unpaired surrogate halves may indicate an invalid UTF-16 string was
encoded, or a valid one with a CESU-8 converter."
Even in CESU-8, surrogate halves are expected to be paired
appropriately.
"U+FFFE may indicate encoding of a byte-swapped UTF-16 string as it
is a backwards BOM."
While true, this has very little to do with UTF-8. The process from
which such data was received would have to have been smart enough to
recognize UTF-16 text and convert it to UTF-8, but dumb enough to get
the UTF-16 byte order wrong in the first place.
"U+0080 through U+009F may indicate CP1252 was converted without
translating the characters to Unicode"
This has to do with the original content, not the validity of the UTF-8.
Single bytes of value 0x80 through 0x9F are simply errors. Unicode
scalar values from U+0080 through U+009F (represented in UTF-8 as {0xC2,
0x80} through {0xC2, 0x9F}) may indicate that CP1252 was converted as if
it were ISO 8859-1. In that case, the UTF-8 is perfectly valid but the
underlying data may not be correct.
"U+0080 through U+009F and nothing greater than U+00FF may indicate
double-converted UTF-8."
Again, this confuses validity of UTF-8 with validity of the underlying
content. In any event, incorrect conversion of CP1252 as if it were ISO
8859-1 (above) would fall into this category.
"U+DC80 through U+DCFF may be reserved for converting invalid byte
sequences (see above)"
This is flat wrong and bogus and ill-conceived and non-conformant, and
should never, ever be done, full stop.
-- Doug Ewell * Thornton, Colorado, USA * RFC 4645 * UTN #14 http://www.ewellic.org http://www1.ietf.org/html.charters/ltru-charter.html http://www.alvestrand.no/mailman/listinfo/ietf-languages ˆ
This archive was generated by hypermail 2.1.5 : Sun May 31 2009 - 15:29:54 CDT