Re: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Jim Monty (jim.monty@yahoo.com)
Date: Thu Nov 04 2010 - 15:52:19 CST

  • Next message: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"

    Markus Scherer wrote:
    > Doug Ewell wrote:
    > > It may be that broken UTF-16 text doesn't appear that often in the
    > > realworld.
    >
    > 16-bit Unicode is convenient in that when you find an unpaired surrogate
    > (that is, it's not well-formed UTF-16) you can usually just treat it like
    > a surrogate code point which normally has default properties much like an
    > unassigned code point or noncharacter. It case-maps to itself, normalizes
    > to itself, has default Unicode property values (except for the general
    > category), etc.
    >
    > In other words, when you process 16-bit Unicode text it takes no effort to
    > handle unpaired surrogates, other than making sure that you only assemble a
    > supplementary code point when a lead surrogate is really followed by a trail
    > surrogate. Hence little need for cleanup functions -- but if you need one,
    > it's trivial to write one for UTF-16.

    Thank you! This is what I've always understood about the design of the UTFs:
    they're generally quite robust. One errant character doesn't make the whole text
    unusable. And in the case of transcoding from, say, UTF-16 to UTF-8, it's
    reasonably straightforward to handle anomalies.

    So imagine my dismay when I wrote a trivial Perl script to convert a UTF-16 file
    to a UTF-8 file and it died immediately on the first text file I tested it on. I
    got this error message:

        UTF-16:Malformed LO surrogate db82 at utf16-to-utf8.pl line 24,
        <$utf16_dat_fh> line 119.

    So I checked the documentation
    (http://search.cpan.org/dist/Encode/Unicode/Unicode.pm#Error_Checking) and read
    this:

        Unlike most encodings which accept various ways to handle errors,
        Unicode encodings simply croaks.

        ...

        Unlike other encodings where mappings are not one-to-one against
        Unicode, UTFs are supposed to map 100% against one another. So
        Encode is more strict on UTFs.

        Consider that "division by zero" of Encode :)

    I see nothing to grin about. Division by zero? Seriously? This effectively means
    I can't use Perl to transcode Unicode, at least not in the imperfect world *I*
    live in.

    And GNU iconv is no better. It fails to transcode the same file with an even
    more laconic error message:

        iconv: Data.txt: cannot convert

    I guess I should appeal to the maintainer of the Perl core Encode module to
    loosen the shackles a bit, eh?

    Thank you all for your very helpful responses.

    Jim Monty



    This archive was generated by hypermail 2.1.5 : Thu Nov 04 2010 - 15:57:24 CST