From: Mark Davis ☕ (mark@macchiato.com)
Date: Fri Nov 05 2010 - 14:57:38 CST
I'm in general agreement.
1. A Unicode 16-bit string can contain any sequence of 16-bit code units:
it might or might not be valid UTF-16.
2. Whenever a process is emitting a Unicode string, if it is *
guaranteeing* that it is UTF-16, it must catch any unpaired surrogates
and fix (eg replace by FFFD).
3. It is a burden on processes to always guarantee UTF-16 conformance,
and the vast majority of processing can handle a Unicode string robustly,
just treating the unpaired surrogates as UNASSIGNED.
4. Whenever a process is accepting a Unicode string, if it is requiring
that the string is UTF-16 it has a couple of choices: if the source is
'trusted' and purports to supply UTF-16, no problem; otherwise the process
need to validate the input for safety.
Mark
*— Il meglio è l’inimico del bene —*
On Fri, Nov 5, 2010 at 11:54, Asmus Freytag <asmusf@ix.netcom.com> wrote:
> On 11/5/2010 7:02 AM, Doug Ewell wrote:
>
>> Asmus Freytag<asmusf at ix dot netcom dot com> wrote:
>>
>> I'm probably missing something here, but I don't agree that it's OK
>>>> for a consumer of UTF-16 to accept an unpaired surrogate without
>>>> throwing an error, or converting it to U+FFFD, or otherwise raising a
>>>> fuss. Unpaired surrogates are ill-formed, and have to be caught and
>>>> dealt with.
>>>>
>>> The question is whether you want every library that handles strings
>>> perform the equivalent of a citizen's arrest, or whether you architect
>>> things that the gatekeepers (border control) police the data stream.
>>>
>> If you can have upstream libraries check for unpaired surrogates at the
>> time they convert UTF-16 to Unicode code points, then your point is well
>> taken, because then the downstream libraries are no longer dealing with
>> UTF-16, but with code points. Doing conversion and validation at
>> different stages isn't a great idea; that's how character encodings get
>> involved with security problems.
>>
>
> Note that I am careful not to suggest that (and I'm sure Markus isn't
> either). "Handling" includes much more than code conversion. It includes
> uppercasing, spell checking, sorting, searching, the whole lot. Burdening
> every single one of those tasks with policing the integrity of the encoding
> seems wasteful, and, as I tried to explain, puts the error detection in a
> place where you'll be most likely prevented from doing something useful in
> recovery.
>
> Data import or code conversion routines are in a much better place,
> architecturally, to allow the user meaningful options to deal with corrupted
> data, from rejecting to attempts of repair.
>
> However, some tasks, such as network identifier matching, are
> security-sensitive and must re-validate their input, even if the data has
> already passed a gate keeper routine such as a validating code conversion
> routine.
>
>
> Corrigendum #1 closed the door on interpretation of invalid UTF-8
>> sequences. I'm not sure why the approach to handling UTF-16 should be
>> any different.
>>
>>
>>
>
This archive was generated by hypermail 2.1.5 : Fri Nov 05 2010 - 15:00:25 CST