Re: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Doug Ewell (doug@ewellic.org)
Date: Thu Nov 04 2010 - 18:46:25 CST

Next message: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"

Previous message: Jim Monty: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
In reply to: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Next in thread: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Reply: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Reply: Asmus Freytag: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Markus Scherer wrote:

> While processing 16-bit Unicode text which is not assumed to be
> well-formed UTF-16, you can treat (decode) an unpaired surrogate as a
> mostly-inert surrogate code point. However, you cannot unambiguously
> encode a surrogate code point in 16-bit text (because you could not
> distinguish a sequence of lead+trail surrogate code points from one
> supplementary code point), and therefore it is not allowed to encode
> surrogate code points in any well-formed UTF-8/16/32. [All of this is
> discussed in The Unicode Standard, Chapter 3.]

I'm probably missing something here, but I don't agree that it's OK for
a consumer of UTF-16 to accept an unpaired surrogate without throwing an
error, or converting it to U+FFFD, or otherwise raising a fuss.
Unpaired surrogates are ill-formed, and have to be caught and dealt
with.

--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s

Next message: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Previous message: Jim Monty: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
In reply to: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Next in thread: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Reply: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Reply: Asmus Freytag: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Nov 04 2010 - 18:49:50 CST