Re: Deleting Lone Surrogates

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sun, 4 Oct 2015 20:53:25 +0200

The default behavior of unassigned characters are to treat them like base
characters, so if they are followed by a combining mark, it would create a
default grapheme cluster, which is not appropriate here.

Surrogates are not chracters (so they cannot have any character
properties), but they are assigned and so don't have "default" properties
(only meant for *unassigned* codepoints).

I still think that it is safer to treat them (for text segmentation purpose
as pure isolates i.e. exactly like basic controls such as U+0000 NUL, or
such as the U+FFFD replacement control which is typically used as visible
placeholders for various errors).

For normalisation purpose they should also have combining class 0 (i.e.
acting as blockers against reorderings for canonical equivalences), and not
as "transparent" (discarded and bypassed as if those surrogates were not
present at all).

2015-10-04 19:50 GMT+02:00 Markus Scherer <markus.icu_at_gmail.com>:

> I would not spend any time specifying intricate rules for unpaired
> surrogates in 16-bit strings, or out-of range values in 32-bit strings.
> Most processing will treat them like unassigned characters, like U+50005,
> with only default behaviors.
> markus
>
Received on Sun Oct 04 2015 - 13:54:51 CDT

This archive was generated by hypermail 2.2.0 : Sun Oct 04 2015 - 13:54:51 CDT