From: Doug Ewell (doug@ewellic.org)
Date: Fri Nov 05 2010 - 08:02:34 CST
Asmus Freytag <asmusf at ix dot netcom dot com> wrote:
>> I'm probably missing something here, but I don't agree that it's OK
>> for a consumer of UTF-16 to accept an unpaired surrogate without
>> throwing an error, or converting it to U+FFFD, or otherwise raising a
>> fuss. Unpaired surrogates are ill-formed, and have to be caught and
>> dealt with.
>
> The question is whether you want every library that handles strings
> perform the equivalent of a citizen's arrest, or whether you architect
> things that the gatekeepers (border control) police the data stream.
If you can have upstream libraries check for unpaired surrogates at the
time they convert UTF-16 to Unicode code points, then your point is well
taken, because then the downstream libraries are no longer dealing with
UTF-16, but with code points. Doing conversion and validation at
different stages isn't a great idea; that's how character encodings get
involved with security problems.
Corrigendum #1 closed the door on interpretation of invalid UTF-8
sequences. I'm not sure why the approach to handling UTF-16 should be
any different.
-- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s
This archive was generated by hypermail 2.1.5 : Fri Nov 05 2010 - 08:07:39 CST