Re: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Jim Monty (jim.monty@yahoo.com)
Date: Thu Nov 04 2010 - 18:07:55 CST

  • Next message: Doug Ewell: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"

    Thank you, Markus, for your clear, authoritative explanation and for talking me down from the ledge. Björn Höhrmann kindly suggested using 'uconv' that comes with ICU. That's what I'll use to repair the corrupted UTF-16 text I have in hand. My true object is to demonstrate to a software maker that its product emits bad Unicode text that cannot be transcoded from UTF-16 to, say, UTF-8 using any old transcoder. I'm just trying to make the world a better place. Jim Monty On Thu, Nov 4, 2010 at 2:52 PM, Jim Monty <jim.monty@yahoo.com> wrote: > In other words, when you process 16-bit Unicode text it takes no effort to >> handle unpaired surrogates, other than making sure that you only assemble a >> supplementary code point when a lead surrogate is really followed by a trail >> surrogate. Hence little need for cleanup functions -- but if you need one, >> it's trivial to write one for UTF-16. > >Thank you! This is what I've always understood about the design of the UTFs: >they're generally quite robust. One errant character doesn't make the whole text >unusable. And in the case of transcoding from, say, UTF-16 to UTF-8, it's >reasonably straightforward to handle anomalies. > >So imagine my dismay when I wrote a trivial Perl script to convert a UTF-16 file >to a UTF-8 file and it died immediately on the first text file I tested it on. I >got this error message: > >    UTF-16:Malformed LO surrogate db82 at utf16-to-utf8.pl line 24, >    <$utf16_dat_fh> line 119. > There is a difference between processing "16-bit Unicode text" and converting to UTF-8 or UTF-32, and even well-formed UTF-16. While processing 16-bit Unicode text which is not assumed to be well-formed UTF-16, you can treat (decode) an unpaired surrogate as a mostly-inert surrogate code point. However, you cannot unambiguously encode a surrogate code point in 16-bit text (because you could not distinguish a sequence of lead+trail surrogate code points from one supplementary code point), and therefore it is not allowed to encode surrogate code points in any well-formed UTF-8/16/32. [All of this is discussed in The Unicode Standard, Chapter 3.] So a converter is correct in treating an unpaired surrogate as an error. On the other hand... I guess I should appeal to the maintainer of the Perl core Encode module to >loosen the shackles a bit, eh? > Any conversion library should offer options for how to deal with errors. One way is to return an error, throw an exception, or equivalent. Another is to replace the offending sequence with some substitution character (usually U+FFFD when the target is a form of Unicode) and continue converting after that. If the conversion libraries you are using do not support this (I don't know), then you could ask for such options. Or use conversion libraries that do support such options (like ICU and Java). Best regards, markus



    This archive was generated by hypermail 2.1.5 : Thu Nov 04 2010 - 18:11:12 CST