Re: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Markus Scherer (markus.icu@gmail.com)
Date: Thu Nov 04 2010 - 11:46:16 CST

  • Next message: Jim Monty: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"

    On Thu, Nov 4, 2010 at 7:20 AM, Doug Ewell <doug@ewellic.org> wrote:

    > It may be that broken UTF-16 text doesn't appear that often in the real
    > world. Certainly it's a test case that should be detected and handled
    > (and I always do so when rolling my own transcoders), but perhaps not
    > many people besides you have actually been bitten such that they needed
    > such a tool.
    >

    16-bit Unicode is convenient in that when you find an unpaired surrogate
    (that is, it's not well-formed UTF-16) you can usually just treat it like a
    surrogate code point which normally has default properties much like an
    unassigned code point or noncharacter. It case-maps to itself, normalizes to
    itself, has default Unicode property values (except for the general
    category), etc.

    In other words, when you process 16-bit Unicode text it takes no effort to
    handle unpaired surrogates, other than making sure that you only assemble a
    supplementary code point when a lead surrogate is really followed by a trail
    surrogate. Hence little need for cleanup functions -- but if you need one,
    it's trivial to write one for UTF-16.

    markus



    This archive was generated by hypermail 2.1.5 : Thu Nov 04 2010 - 11:51:23 CST