Re: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Markus Scherer (markus.icu@gmail.com)
Date: Thu Nov 04 2010 - 11:46:16 CST

Next message: Jim Monty: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"

Previous message: Mark Davis â˜•: "Re: inquiry about collation testing"
In reply to: Doug Ewell: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Next in thread: Jim Monty: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Reply: Jim Monty: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Reply: Martin J. Dürst: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Thu, Nov 4, 2010 at 7:20 AM, Doug Ewell <doug@ewellic.org> wrote:

> It may be that broken UTF-16 text doesn't appear that often in the real
> world. Certainly it's a test case that should be detected and handled
> (and I always do so when rolling my own transcoders), but perhaps not
> many people besides you have actually been bitten such that they needed
> such a tool.
>

16-bit Unicode is convenient in that when you find an unpaired surrogate
(that is, it's not well-formed UTF-16) you can usually just treat it like a
surrogate code point which normally has default properties much like an
unassigned code point or noncharacter. It case-maps to itself, normalizes to
itself, has default Unicode property values (except for the general
category), etc.

In other words, when you process 16-bit Unicode text it takes no effort to
handle unpaired surrogates, other than making sure that you only assemble a
supplementary code point when a lead surrogate is really followed by a trail
surrogate. Hence little need for cleanup functions -- but if you need one,
it's trivial to write one for UTF-16.

markus

Next message: Jim Monty: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Previous message: Mark Davis â˜•: "Re: inquiry about collation testing"
In reply to: Doug Ewell: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Next in thread: Jim Monty: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Reply: Jim Monty: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Reply: Martin J. Dürst: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Nov 04 2010 - 11:51:23 CST