Unpaired surrogates (was: Re: Why Work at Encoding Level?)

From: Doug Ewell <doug_at_ewellic.org>
Date: Mon, 19 Oct 2015 13:32:07 -0700

Richard Wordingham wrote:

>> This discussion was originally about how to handle unpaired
>> surrogates, as if that were a normal use case.
>
> And the subject line was changed when the topic changed to
> traversing strings.

Granted. I've changed it again to reflect this specific issue.

> How about, 'The specification says that one must pass the number of
> _characters_ in the string.'? Even worse, some specifications talk of
> 'Unicode characters' when they mean UTF-16 code units. The word
> 'codepoint' is even worse, as a supplementary plane codepoint is
> represented by two BMP codepoints.

None of this lets any implementer or implementation off the hook. TUS is
very clear that an unpaired surrogate is not to be interpreted in any
way, and particularly not to be treated as an abstract character. See,
for example, C1 and D75.

> ICU (but perhaps it's actually Java) seems to have a culture of
> tolerating lone surrogates, and rules for handling lone surrogates are
> strewn across the Unicode standards and annexes.

I suspect you have an example. I'd be curious what any of them has to
say that does not equate to "this is an anomalous situation and
represents broken and ill-formed text."

Applications that treat unpaired surrogates as well-formed text do not
change the rules; they are in violation of the rules.

> It was the once the
> case that basic Unicode support in regular expressions required a
> regular expression engine to be able to search for specified lone
> surrogates - a real show-stopper for an engine working in UTF-8.
> The Unicode collation algorithm conformance test once tested that
> implementations of collation collated lone surrogates correctly.
> Raising an exception was an automatic test failure! By contrast,
> no-one's proposed collation rules for broken bits of UTF-8 characters
> or non-minimal length forms.

Are these tests still included, or did someone notice that they were in
conflict with the standard and removed them?

>> That is like having an image editor that deletes every
>> 128th byte from a JPEG file, and then worrying about how to display
>> the file.
>
> 1. Of course, telemetry streams may very well contain damaged JPEG
> images!

Of course. But are they conformant to the JPEG standard? Is there a
standard way to repair and display them?

> 2. The problem bad handling of supplementary characters seems to be
> associated with UTF-16 is that the damage is rarely as obvious as every
> 128th code unit. By contrast, bad UTF-8 handling usually comes to light
> as soon as the text processing moves beyond ASCII.

Of course. I could have said "deletes random bytes from a JPEG file."

An unpaired surrogate can be detected either immediately, or immediately
after the next code unit. In neither case is it to be interpreted as
anything other than invalid text.

Philippe Verdy wrote:

> No ! The "supplementary code points" (or "supplementary characters"
> when they are assigned to characters) are represented in UTF-16 as two
> **code units**, NOT as two "code points" (even if their binary value
> are related).

Surrogate values are not abstract characters, but they are code points
(D10). Note that Surrogate is one of the seven types of code points
(D10a).

--
Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸
Received on Mon Oct 19 2015 - 15:33:21 CDT

This archive was generated by hypermail 2.2.0 : Mon Oct 19 2015 - 15:33:21 CDT