Re: Counting Codepoints

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Tue, 13 Oct 2015 12:17:43 +0200

2015-10-13 8:36 GMT+02:00 Richard Wordingham <
richard.wordingham_at_ntlworld.com>:

> For
> example, a MSKLC keyboard will deliver a supplementary character in
> two WM_CHAR messages, one for the high surrogate and one for the low
> surrogate.
>
I have not tested the actual behavior in 64-bit versions of Windows : is
the message field of the WM_CHAR returned by the 64-bit version of the API
still requires returning two messages and not a single one if that field
has been extended to 64-bit ? In that case, no surrogates would be
returned, but directly the supplementary character. But may be this has not
changed so that the predefined Windows type for wide characters remains
16-bit (otherwise even in the 32-bit version of the API, a single message
would have been enough with a 32-bit message data field): the "Unicode"
version of the API's assume everywhere a 16-bit encoding of strings and the
event message most probably uses the same size of code units.

The actual behavior is also tricky as the basic layouts built with MSKLC
will have its character data translated "transparently" to other "OEM"
encodings according to the current input code page of the console (using
one of the codepage mapping tables installed separately): the transcoder
will also need to translate the 16-bit Unicode input from WM_CHAR messages
into the 8-bit input stream used by the console, and this translation will
need to read both surrogates at once before sending any output.

Also I don't think this is specific to MSKLC drivers. A driver (not just
keyboard layouts that actually contain no code but just a data structure,
but also input methods using their own message loop to process and filter
input events and delivering their own translated messages) built with any
other tool will use the same message format.

Any way, those Windows drivers cannot actually know how the editing
application will finally process the two surrogates : if the application
does not detect surrogates properly and chose to discard one but not the
other, the driver is not at fault and it is a bug of the application. Those
MSKLC drivers actually have no view on the input buffer, they process the
input on the flow (but may be the a more advanced input driver with its own
message processing loop could send its own messages to query the
application about what is in its buffer, or to instruct it to perform some
custom substring replacements/editing and update its caret position or
selection).

So in my view, this is not a bug of the layout drivers themselves and not
even a bug of the Windows core API. The editing application (or the common
interface component) has to be prepared to process both surrogates as one
character, or discard lone surrogates it could see (after alerting the user
with some beep message), or submit some custom replacement. It is this
application or component that will need to manage its input buffer
correctly. If that buffer uses 16-bit code units, deleting one position in
the buffer (for example when pressing Backspace or delete) without looking
at what is deleted, or performing text selection in the middle of a
surrogates pair (and then blindly replacing that selection) will generate
those lone surrogates in the input buffer.

The same considerations would also apply to Linux input drivers and GUI
components, that use 8-bit encodings including UTF-8 (this is more
difficult because the Linux kernel is blind about the encoding, which is
defined only in the user's input locale environment): the same havoc could
happen if the editing application breaks in the middle of a multibyte
 UTF-8 sequence, and the applications must also be ready to accept random
byte sequences including those not containing valid UTF-8 (but how those
applications will actually handle the offending bytes remains also
application dependant), and the same question will arise : how many code
points are in the 8-bit string if it is not valid UTF-8 ? There will not be
a unique answer because how the application will filter those errors will
vary.

You'd also have the same problem with console apps using the 8-bit BIOS/DOS
input emulation API, or within terminal applications listening for input
from a network socket sending 8-bit data streams (the emulation protocol
will also need to filter that input and detect errors when the input does
not validate the expected encoding, but how that protocol protocol will
recover after the error will remain protocol dependant, and it's not sure
that the emulation terminal provides notifications to the user when there
are input errors, the protocol may as well interrupt the communication with
an EOF event and the communication channel closed).

In other words: as soon as there's a single error in some input for the UTF
validation, you cannot assert any value for the whole input content.
Received on Tue Oct 13 2015 - 05:19:09 CDT

This archive was generated by hypermail 2.2.0 : Tue Oct 13 2015 - 05:19:10 CDT