Re: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Jim Monty (jim.monty@yahoo.com)
Date: Thu Nov 04 2010 - 15:52:19 CST

Next message: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"

Previous message: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
In reply to: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Next in thread: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Reply: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Markus Scherer wrote:
> Doug Ewell wrote:
> > It may be that broken UTF-16 text doesn't appear that often in the
> > realworld.
>
> 16-bit Unicode is convenient in that when you find an unpaired surrogate
> (that is, it's not well-formed UTF-16) you can usually just treat it like
> a surrogate code point which normally has default properties much like an
> unassigned code point or noncharacter. It case-maps to itself, normalizes
> to itself, has default Unicode property values (except for the general
> category), etc.
>
> In other words, when you process 16-bit Unicode text it takes no effort to
> handle unpaired surrogates, other than making sure that you only assemble a
> supplementary code point when a lead surrogate is really followed by a trail
> surrogate. Hence little need for cleanup functions -- but if you need one,
> it's trivial to write one for UTF-16.

Thank you! This is what I've always understood about the design of the UTFs:
they're generally quite robust. One errant character doesn't make the whole text
unusable. And in the case of transcoding from, say, UTF-16 to UTF-8, it's
reasonably straightforward to handle anomalies.

So imagine my dismay when I wrote a trivial Perl script to convert a UTF-16 file
to a UTF-8 file and it died immediately on the first text file I tested it on. I
got this error message:

UTF-16:Malformed LO surrogate db82 at utf16-to-utf8.pl line 24,
<$utf16_dat_fh> line 119.

So I checked the documentation
(http://search.cpan.org/dist/Encode/Unicode/Unicode.pm#Error_Checking) and read
this:

Unlike most encodings which accept various ways to handle errors,
Unicode encodings simply croaks.

...

    Unlike other encodings where mappings are not one-to-one against
    Unicode, UTFs are supposed to map 100% against one another. So
    Encode is more strict on UTFs.

Consider that "division by zero" of Encode :)

I see nothing to grin about. Division by zero? Seriously? This effectively means
I can't use Perl to transcode Unicode, at least not in the imperfect world *I*
live in.

And GNU iconv is no better. It fails to transcode the same file with an even
more laconic error message:

iconv: Data.txt: cannot convert

I guess I should appeal to the maintainer of the Perl core Encode module to
loosen the shackles a bit, eh?

Thank you all for your very helpful responses.

Jim Monty

Next message: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Previous message: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
In reply to: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Next in thread: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Reply: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Nov 04 2010 - 15:57:24 CST