From: Doug Ewell (doug@ewellic.org)
Date: Thu Nov 04 2010 - 18:46:25 CST
Markus Scherer wrote:
> While processing 16-bit Unicode text which is not assumed to be
> well-formed UTF-16, you can treat (decode) an unpaired surrogate as a
> mostly-inert surrogate code point. However, you cannot unambiguously
> encode a surrogate code point in 16-bit text (because you could not
> distinguish a sequence of lead+trail surrogate code points from one
> supplementary code point), and therefore it is not allowed to encode
> surrogate code points in any well-formed UTF-8/16/32. [All of this is
> discussed in The Unicode Standard, Chapter 3.]
I'm probably missing something here, but I don't agree that it's OK for
a consumer of UTF-16 to accept an unpaired surrogate without throwing an
error, or converting it to U+FFFD, or otherwise raising a fuss.
Unpaired surrogates are ill-formed, and have to be caught and dealt
with.
-- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s
This archive was generated by hypermail 2.1.5 : Thu Nov 04 2010 - 18:49:50 CST