From: Markus Scherer (markus.icu@gmail.com)
Date: Thu Nov 04 2010 - 11:46:16 CST
On Thu, Nov 4, 2010 at 7:20 AM, Doug Ewell <doug@ewellic.org> wrote:
> It may be that broken UTF-16 text doesn't appear that often in the real
> world. Certainly it's a test case that should be detected and handled
> (and I always do so when rolling my own transcoders), but perhaps not
> many people besides you have actually been bitten such that they needed
> such a tool.
>
16-bit Unicode is convenient in that when you find an unpaired surrogate
(that is, it's not well-formed UTF-16) you can usually just treat it like a
surrogate code point which normally has default properties much like an
unassigned code point or noncharacter. It case-maps to itself, normalizes to
itself, has default Unicode property values (except for the general
category), etc.
In other words, when you process 16-bit Unicode text it takes no effort to
handle unpaired surrogates, other than making sure that you only assemble a
supplementary code point when a lead surrogate is really followed by a trail
surrogate. Hence little need for cleanup functions -- but if you need one,
it's trivial to write one for UTF-16.
markus
This archive was generated by hypermail 2.1.5 : Thu Nov 04 2010 - 11:51:23 CST