Re: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Mark Davis ☕ (
Date: Fri Nov 05 2010 - 14:57:38 CST

  • Next message: Doug Ewell: "RE: Utility to report and repair broken surrogate pairs in UTF-16 text"

    I'm in general agreement.

       1. A Unicode 16-bit string can contain any sequence of 16-bit code units:
       it might or might not be valid UTF-16.
       2. Whenever a process is emitting a Unicode string, if it is *
       guaranteeing* that it is UTF-16, it must catch any unpaired surrogates
       and fix (eg replace by FFFD).
       3. It is a burden on processes to always guarantee UTF-16 conformance,
       and the vast majority of processing can handle a Unicode string robustly,
       just treating the unpaired surrogates as UNASSIGNED.
       4. Whenever a process is accepting a Unicode string, if it is requiring
       that the string is UTF-16 it has a couple of choices: if the source is
       'trusted' and purports to supply UTF-16, no problem; otherwise the process
       need to validate the input for safety.


    *— Il meglio è l’inimico del bene —*

    On Fri, Nov 5, 2010 at 11:54, Asmus Freytag <> wrote:

    > On 11/5/2010 7:02 AM, Doug Ewell wrote:
    >> Asmus Freytag<asmusf at ix dot netcom dot com> wrote:
    >> I'm probably missing something here, but I don't agree that it's OK
    >>>> for a consumer of UTF-16 to accept an unpaired surrogate without
    >>>> throwing an error, or converting it to U+FFFD, or otherwise raising a
    >>>> fuss. Unpaired surrogates are ill-formed, and have to be caught and
    >>>> dealt with.
    >>> The question is whether you want every library that handles strings
    >>> perform the equivalent of a citizen's arrest, or whether you architect
    >>> things that the gatekeepers (border control) police the data stream.
    >> If you can have upstream libraries check for unpaired surrogates at the
    >> time they convert UTF-16 to Unicode code points, then your point is well
    >> taken, because then the downstream libraries are no longer dealing with
    >> UTF-16, but with code points. Doing conversion and validation at
    >> different stages isn't a great idea; that's how character encodings get
    >> involved with security problems.
    > Note that I am careful not to suggest that (and I'm sure Markus isn't
    > either). "Handling" includes much more than code conversion. It includes
    > uppercasing, spell checking, sorting, searching, the whole lot. Burdening
    > every single one of those tasks with policing the integrity of the encoding
    > seems wasteful, and, as I tried to explain, puts the error detection in a
    > place where you'll be most likely prevented from doing something useful in
    > recovery.
    > Data import or code conversion routines are in a much better place,
    > architecturally, to allow the user meaningful options to deal with corrupted
    > data, from rejecting to attempts of repair.
    > However, some tasks, such as network identifier matching, are
    > security-sensitive and must re-validate their input, even if the data has
    > already passed a gate keeper routine such as a validating code conversion
    > routine.
    > Corrigendum #1 closed the door on interpretation of invalid UTF-8
    >> sequences. I'm not sure why the approach to handling UTF-16 should be
    >> any different.

    This archive was generated by hypermail 2.1.5 : Fri Nov 05 2010 - 15:00:25 CST