Re: Why Work at Encoding Level?

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Mon, 19 Oct 2015 21:35:16 +0200

2015-10-19 20:53 GMT+02:00 Richard Wordingham <
richard.wordingham_at_ntlworld.com>:

> On Mon, 19 Oct 2015 10:07:31 -0700
> "Doug Ewell" <doug_at_ewellic.org> wrote:
>
> > This discussion was originally about how to handle unpaired
> > surrogates, as if that were a normal use case.
>
> And the subject line was changed when the topic changed to traversing
> strings.
>
> > Regardless of what encoding model is used to handle characters under
> > the hood, and regardless of how the Delete key should work with actual
> > characters or clusters, there is never any excuse for software to
> > create unpaired surrogates, or any other sort of invalid code unit
> > sequences.
>
> The word
> 'codepoint' is even worse, as a supplementary plane codepoint is
> represented by two BMP codepoints.
>

No ! The "supplementary code points" (or "supplementary characters" when
they are assigned to characters) are represented in UTF-16 as two **code
units**, NOT as two "code points" (even if their binary value are related).

The code points in range U+D800..U+DF00 are NEVER characters they are juste
permanently reserved in order to unassign them to any character, so these
code points are assigned, but not to characters (otherwise these characters
would not be representable as valid UTF-16). These code points also do not
have any scalar value, and there are not valid scalar values in range
0xD800..0xDFFF (the valid scalar values are in two ranges of integers,
separated by this hole).

So please don't mix "code points" and "code units" !
Received on Mon Oct 19 2015 - 14:37:20 CDT

This archive was generated by hypermail 2.2.0 : Mon Oct 19 2015 - 14:37:21 CDT