Re: Invalid code points

From: Doug Ewell (doug@ewellic.org)
Date: Sun May 31 2009 - 15:18:16 CDT

  • Next message: Hans Aberg: "Re: Invalid code points"

    Hans Aberg <haberg at math dot su dot se> wrote:

    >>> In particular, it would be great to know if the range U+0080, ,
    >>> U+009F is invalid.
    >>
    >> That bit is especially wrong. I can at least imagine why there might
    >> be confusion about the noncharacters and surrogate code points, but
    >> not the C1 controls.
    >
    > It is a bit disappointing: I was looking for a beginning (escape) byte
    > sequence to tell that string isn't UTF-8, among other valid strings.
    > But perhaps it does not matter.

    If you're thinking about inventing one, for your own use, then any byte
    sequence that is not valid UTF-8 should do the job. One possibility
    would be {0xA0}.

    Be sure you understand the difference between an invalid *byte sequence*
    and an invalid *code point*. There are many invalid byte sequences in
    UTF-8. As Mark pointed out, the only invalid code points are the
    surrogates.

    The section of the Wikipedia article you cited actually contains quite a
    concentration of misleading information:

        "Unpaired surrogate halves may indicate an invalid UTF-16 string was
    encoded, or a valid one with a CESU-8 converter."

    Even in CESU-8, surrogate halves are expected to be paired
    appropriately.

        "U+FFFE may indicate encoding of a byte-swapped UTF-16 string as it
    is a backwards BOM."

    While true, this has very little to do with UTF-8. The process from
    which such data was received would have to have been smart enough to
    recognize UTF-16 text and convert it to UTF-8, but dumb enough to get
    the UTF-16 byte order wrong in the first place.

        "U+0080 through U+009F may indicate CP1252 was converted without
    translating the characters to Unicode"

    This has to do with the original content, not the validity of the UTF-8.
    Single bytes of value 0x80 through 0x9F are simply errors. Unicode
    scalar values from U+0080 through U+009F (represented in UTF-8 as {0xC2,
    0x80} through {0xC2, 0x9F}) may indicate that CP1252 was converted as if
    it were ISO 8859-1. In that case, the UTF-8 is perfectly valid but the
    underlying data may not be correct.

        "U+0080 through U+009F and nothing greater than U+00FF may indicate
    double-converted UTF-8."

    Again, this confuses validity of UTF-8 with validity of the underlying
    content. In any event, incorrect conversion of CP1252 as if it were ISO
    8859-1 (above) would fall into this category.

        "U+DC80 through U+DCFF may be reserved for converting invalid byte
    sequences (see above)"

    This is flat wrong and bogus and ill-conceived and non-conformant, and
    should never, ever be done, full stop.

    --
    Doug Ewell  *  Thornton, Colorado, USA  *  RFC 4645  *  UTN #14
    http://www.ewellic.org
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages  ˆ 
    


    This archive was generated by hypermail 2.1.5 : Sun May 31 2009 - 15:29:54 CDT