From: Dreiheller, Albrecht (albrecht.dreiheller@siemens.com)
Date: Wed Jun 03 2009 - 05:47:10 CDT
Doug Ewell wrote:
>>> In particular, it would be great to know if the range U+0080, ?,
>>> U+009F is invalid.
>>
>> Those code points (encoded properly) are valid. However, their
>> appearance may indicate that an error occurred in processing, as the
>> C1 controls would be rare in real Unicode text (and, with the
>> exception of U+0085, are discouraged in XML). They most often arise by
>> treating Windows-1252 as if it were ISO-Latin-1.
>>
>> In other words, not invalid, but suspicious.
>
>But once again, this is a question of the accuracy or fidelity of the
>input data, before it was converted to UTF-8. It has nothing to do with
>the validity of the Unicode characters from U+0080 to U+009F, nor of
>their UTF-8 representations.
Obviously, the intention of the Wikipedia's author was to point out a third stage
of proofing text data:
After checking for
(a) invalid UTF-8 byte sequences and
(b) invalid code points
there could be another step identifying
(c) typical UTF-8 related faults using a heuristic which is based on certain preconditions.
The two mentioned heuristics assume that using a control character from U+0080 to U+009F
is much more unlikely than the possibility that a CP1252 text is mistaken for being ISO-8859-1
or that a UTF-8 text is falsely converted to UTF-8 for a second time.
There are many other typical encoding-related faults for which heuristics could be specified.
Of course, as said before, these concepts must be strictly distinguished from detecting invalid UTF-8.
Albrecht
This archive was generated by hypermail 2.1.5 : Wed Jun 03 2009 - 05:51:18 CDT