RE: Invalid code points

From: Dreiheller, Albrecht (albrecht.dreiheller@siemens.com)
Date: Wed Jun 03 2009 - 05:47:10 CDT

  • Next message: Kenneth Whistler: "Re: Invalid code points"

    Doug Ewell wrote:
    >>> In particular, it would be great to know if the range U+0080, ?,
    >>> U+009F is invalid.
    >>
    >> Those code points (encoded properly) are valid. However, their
    >> appearance may indicate that an error occurred in processing, as the
    >> C1 controls would be rare in real Unicode text (and, with the
    >> exception of U+0085, are discouraged in XML). They most often arise by
    >> treating Windows-1252 as if it were ISO-Latin-1.
    >>
    >> In other words, not invalid, but suspicious.
    >
    >But once again, this is a question of the accuracy or fidelity of the
    >input data, before it was converted to UTF-8. It has nothing to do with
    >the validity of the Unicode characters from U+0080 to U+009F, nor of
    >their UTF-8 representations.

    Obviously, the intention of the Wikipedia's author was to point out a third stage
    of proofing text data:
    After checking for
    (a) invalid UTF-8 byte sequences and
    (b) invalid code points
    there could be another step identifying
    (c) typical UTF-8 related faults using a heuristic which is based on certain preconditions.

    The two mentioned heuristics assume that using a control character from U+0080 to U+009F
    is much more unlikely than the possibility that a CP1252 text is mistaken for being ISO-8859-1
    or that a UTF-8 text is falsely converted to UTF-8 for a second time.
    There are many other typical encoding-related faults for which heuristics could be specified.

    Of course, as said before, these concepts must be strictly distinguished from detecting invalid UTF-8.

    Albrecht



    This archive was generated by hypermail 2.1.5 : Wed Jun 03 2009 - 05:51:18 CDT