Re: Counting Codepoints

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Mon, 12 Oct 2015 01:08:23 +0200

Both statements are false.

The ill-fomed sequence <0xDC00, 0xD800, 0xDC20> in invalid for UTF-16,
because it contains 1 invalid code unit for UTF-16 (the unpaired surrogate
0xDC00), followed by a single code point (U+10020).

All 3 surrogate codepoints U+DC00, U+DC00 and U+DC20 are NOT encoded (as
they are not representable in valid UTF-16).

The number of codepoints in a **valid** UTF-16 string is perfectly well
defined.

If the encoded string is not valid UTF-16, then the number of codepoints in
it is NOT defined (whever the invalid code units will be dropped or
replaced, and the number of replacement codepoints can also vary depending
on implementation, but an implementation can also consider the whole string
as invalid and will return no code points at all or could stop returning
any code point after the first error encountered and drop all the rest, or
substitute all the rest with a single replacement character).
Only the number of 16-bit code units is defined (this number does not
depend on UTF-16 validity).

2015-10-11 23:20 GMT+02:00 Richard Wordingham <
richard.wordingham_at_ntlworld.com>:

> Is the number of codepoints in a UTF-16 string well defined?
>
> For example, which of the following two statements are true?
>
> (a) The ill-formed three code-unit Unicode 16-bit string <0xDC00,
> 0xD800, 0xDC20> contains two codepoints, U+DC00 and U+10020.
>
> (b) The ill-formed three code-unit Unicode 16-bit string <0xDC00,
> 0xD800, 0xDC20> contains three codepoints, U+DC00, U+D800 and U+DC20.
>
> Statement (a) is probably more useful, but I couldn't find anything to
> rule that statement (b) is false.
>
> Richard.
>
>
Received on Sun Oct 11 2015 - 18:09:45 CDT

This archive was generated by hypermail 2.2.0 : Sun Oct 11 2015 - 18:09:45 CDT