Re: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Doug Ewell (doug@ewellic.org)
Date: Thu Nov 04 2010 - 18:46:25 CST

  • Next message: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"

    Markus Scherer wrote:

    > While processing 16-bit Unicode text which is not assumed to be
    > well-formed UTF-16, you can treat (decode) an unpaired surrogate as a
    > mostly-inert surrogate code point. However, you cannot unambiguously
    > encode a surrogate code point in 16-bit text (because you could not
    > distinguish a sequence of lead+trail surrogate code points from one
    > supplementary code point), and therefore it is not allowed to encode
    > surrogate code points in any well-formed UTF-8/16/32. [All of this is
    > discussed in The Unicode Standard, Chapter 3.]

    I'm probably missing something here, but I don't agree that it's OK for
    a consumer of UTF-16 to accept an unpaired surrogate without throwing an
    error, or converting it to U+FFFD, or otherwise raising a fuss.
    Unpaired surrogates are ill-formed, and have to be caught and dealt
    with.

    --
    Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
    RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­
    


    This archive was generated by hypermail 2.1.5 : Thu Nov 04 2010 - 18:49:50 CST