Re: Counting Codepoints

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Mon, 12 Oct 2015 17:29:13 +0200

2015-10-12 14:42 GMT+02:00 Mark Davis ☕️ <mark_at_macchiato.com>:

> If these are not all aligned, then all heck breaks loose: you are letting
> yourself in for code breakage and/or security problems.
>
> So the corresponding code point count would just return a count of 1 for
> an isolated surrogate.
>

But the behavior in this case is absolutely not defined, and applications
are free to do what they want when they encounter them. There's not even
any warranty that any further (correctly encoded) code point will be
returned, even if a replacement character like U+FFFE is returned, it could
replace all the rest.

So the count of 1 is possible for the first isolated surrogate but all the
rest count count as 0 as well, or all the further characters could be
replaced by U+FFFE independantly of what they initially represented. This
would also be a "sanitized" result.

TUS gives freedom of choice in application. There's absolutely no warranty
that all possible "sanitized" results will be the same for all
applications, and TUS does not even mandate which replacement character to
use (not necessarily U+FFFE, it could as well be an ASCII '?' character or
a C0 <SUB> or <DEL> control, when further processed to an application
converting the result to some legacy 7-bit or 8-bit charset).

My opinion is that the only really safe result is to not return any count
of code points but instead throw an error (counting code points and with a
function returning an integer is only valid if the UTF-16 input is actually
a valid representation of code points, you cannot return a single integer
as the application using that integer could expect to allocate some
processing buffer, and then get this exact number of code points when
reading the data into some processing buffer, and could leave initialized
some positions in that buffer, or the application could assume that the
input was left untouched and could then get an unexpected mismatch of
digital signature).

If your function counting codepoints and returning an integer counts those
lone surrogates as 1, it assumes that exactly one codepoint will be
returned for each lone surrogate, and it should document that clearly,
meaning that the result is only valid if this matches the results of the
actual input scanner. In that case that function will never fail and throw
an exception. But between two implementations the result of the scanner
could still be different because the replacement character is not
specified. If that result "sanitized" string is then used to generate an
URI, the URI is also unpredictable and will vary between implementations,
as well as its effective length. If it is used to generate an identifier
granting some new access, such as a user name, several new user names
could be generated from the same input.

So in all cases using replacements will also create security problems. This
will not happen if you don't return any result but throw an exception (that
counting function should document this exception so that it is not
unexpectedly thrown and left unhandled, causing the program to abort
prematurely in an unsafe state including loosing other data or transaction
elsewhere in an incoherent state).

For all programs taking some standard UTF input, the input scanner or
processing functions MUST be prepared to handle the encoding error
exception, which is an result expected equally to the return of a value or
the execution of some code ! Sanitization is possible, but not described in
the standard, and there are several conflict ways of doing it, it should be
a separate subprocess documented separately.
Received on Mon Oct 12 2015 - 10:30:43 CDT

This archive was generated by hypermail 2.2.0 : Mon Oct 12 2015 - 10:30:44 CDT