On Sat, 9 May 2015 02:26:59 +0200
Daniel Bünzli <daniel.buenzli_at_erratique.ch> wrote:
> Le samedi, 9 mai 2015 à 00:37, Doug Ewell a écrit :
> > Noncharacters are Unicode scalar values,
> (However noncharacters are not designed to be openly interchanged see
> "Restricted interchange" on p. 31. of 7.0.0)
That didn't stop their being openly interchanged.
> > They may both be part of a "Unicode string" which does not claim to
> > be in any given encoding form.
> Not sure what you mean by that. So I let someone else answer.
There are a number of phrases whose declared meanings cannot be
deduced from the individual words. A UTF-8, UTF-16 or UTF-32 string
defines a sequence of scalar values. However, Unicode 8-bit, 16-bit
or 32-bit string is merely a sequence of 8-bit, 16-bit or 32-bit
values that may occur in a UTF-8, UTF-16 or UTF-32 string
respectively. This definition has some odd consequences:
A Unicode 32-bit string is a UTF-32 string, for UTF-32 is not a
multi-word encoding. An arbitrary string of unsigned 32-bit values is
not in general a Unicode 32-bit string.
All strings of unsigned 16-bit values are Unicode 16-bit strings. Not
all (Unicode) 16-bit strings are UTF-16 strings.
Not all strings of unsigned 8-bit values are Unicode 8-bit strings, and
not all Unicode 8-bit strings are UTF-8 strings.
I can't think of a practical use for the specific concepts of Unicode
8-bit, 16-bit and 32-bit strings. Unicode 16-bit strings are
essentially the same as 16-bit strings, and Unicode 32-bit strings are
UTF-32 strings. 'Unicode 8-bit string' strikes me as an exercise in
pedantry; there are more useful categories of 8-bit strings that are
not UTF-8 strings.
Richard.
Received on Fri May 08 2015 - 22:15:23 CDT
This archive was generated by hypermail 2.2.0 : Fri May 08 2015 - 22:15:23 CDT