Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Doug Ewell (dewell@adelphia.net)
Date: Wed Dec 08 2004 - 00:19:11 CST

  • Next message: Doug Ewell: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"

    Kenneth Whistler <kenw at sybase dot com> wrote:

    > I do not think this is a proposal to amend UTF-8 to allow
    > invalid sequences. So we should get that off the table.

    I hope you are right.

    > Apparently Lars is currently using PUA U+E080..U+E0FF
    > (or U+EE80..U+EEFF ?) for this purpose, enabling the round-tripping
    > of byte values uninterpretable as characters to be converted, and
    > is asking for standard Unicode values for this purpose, instead.

    If I understand correctly, he is using these PUA values when the data is
    in UTF-16, and using bare high-bit bytes (i.e. invalid UTF-8 sequences)
    when the data is in UTF-8, and expecting to convert between the two.
    That has at least two bad implications:

    (1) the PUA characters would not round-trip from UTF-8 to UTF-16 to
    UTF-8, but would be converted to the bare high-bit bytes, and

    (2) the bare high-bit bytes might or might not accidentally form valid
    UTF-8 sequences, which mean they might not round-tip either.

    > Say a process gets handed a "UTF-8" string that contains the
    > byte sequence <61 62 63 93 4D D0 B0 E4 BA 8C F0 90 8C 82 94>.
    > ^^ ^^
    >
    > The 93 and 94 are just corrupt data -- it cannot be interpreted
    > as UTF-8, and may have been introduced by some process that
    > screwed up smart quotes from Code Page 1252 and UTF-8, for
    > example. Interpreting the string, we have:
    >
    > <U+0061, U+0062, U+0063, ???, U+004D, U+0430, U+4E8C, U+10302, ???>
    >
    > Now *if* I am interpreting Lars correctly, he is using 128
    > PUA code points to *validly* contain any such byte, so that
    > it can be retained. If the range he is using is U+EE80..U+EEFF,
    > then the string would be reinterpreted as:
    >
    > <U+0061, U+0062, U+0063, U+EE93, U+004D, U+0430, U+4E8C, U+10302,
    > U+EE94>
    >
    > which in UTF-8 would be the byte sequence:
    >
    > <61 62 63 EE BA 93 4D D0 B0 E4 BA 8C F0 90 8C 82 EE BA 94>
    > ^^^^^^^^ ^^^^^^^^
    >
    > This is now well-formed UTF-8, which anybody could deal with.
    > And if you interpret U+EE93 as meaning "a placeholder for the
    > uninterpreted or corrupt byte 0x93 in the original source",
    > and so on, you could use this representation to exactly
    > preserve the original information, including corruptions,
    > which you could feed back out, byte-for-byte, if you reversed
    > the conversion.

    Oh, how I hope that is all he is asking for.

    > Now moving from interpretation to critique, I think it unlikely
    > that the UTC would actually want to encode 128 such characters
    > to represent byte values -- and the reasons would be similar to
    > those adduced for rejecting the earlier proposal. Effectively,
    > in either case, these are proposals for enabling representation
    > of arbitrary, embedded binary data (byte streams) in plain text.
    > And that concept is pretty fundamentally antithetical to the
    > Unicode concept of plain text.

    Isn't this an excellent use for the PUA? These characters are private
    anyway; they are defined by some standard other than Unicode, which is
    not evident in the Unicode data.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 00:20:23 CST