Re: Surrogates and noncharacters (was: Re: Ways to detect that XXXX...) from Daniel Bünzli on 2015-05-08 (Unicode Mail List Archive)

From: Daniel Bünzli <daniel.buenzli_at_erratique.ch>
Date: Sat, 9 May 2015 02:26:59 +0200

Le samedi, 9 mai 2015 à 00:37, Doug Ewell a écrit :
> Noncharacters are Unicode scalar values,

Non characters are Unicode scalar values by definitions D14 and D76.

> while unpaired surrogates are not.

All surrogates code points are not Unicode scalar values by D71, D73 and D76.

> This means noncharacters may appear in a well-formed UTF-8, -16, or
> -32 string,

It take "appear" to mean "be encoded". Yes, any Unicode encoding forms allows to interchange all scalar values by D79.

(However noncharacters are not designed to be openly interchanged see "Restricted interchange" on p. 31. of 7.0.0)

> while unpaired surrogates may not.
All surrogate code points *paired or not* cannot be encoded in UTF-{8,16,32} by D92, D91, D90. All these encoding forms, by definition, assign only Unicode scalar values to code units sequences (see also the already mentioned p. 31. which clarifies this).

However in UTF-16 code unit sequences may contain surrogate pairs (that taken together represent a Unicode scalar value).

> They may both be part of a "Unicode string" which does not claim to be in any given encoding
> form.

Not sure what you mean by that. So I let someone else answer.

Best,

Daniel
Received on Fri May 08 2015 - 19:29:16 CDT

This archive was generated by hypermail 2.2.0 : Fri May 08 2015 - 19:29:18 CDT