Best practices for replacing UTF-8 overlongs
markus.icu at gmail.com
Tue Dec 20 12:33:50 CST 2016
On Tue, Dec 20, 2016 at 8:59 AM, Ken Whistler <kenwhistler at att.net> wrote:
> You found the resulting text in TUS 9.0, p. 126 - 129. The origin of the
> text there about best practices for using U+FFFD was the discussion and
> resolution of PRI #121 in August, 2008:
Yes. However, some of the discussion in this thread is due to details that
were not spelled out in the PRI. There is basically a 2a and a 2b, while
the examples in PRI #121 work the same in both variants.
2a. As Richard said, "The natural logic is to read the requisite number of
continuation bytes, converting the whole to a codepoint value, and then
check that the codepoint value is allowed in UTF-8. Obviously one also has
to check that the requisite continuation bytes are present."
This naturally treats overlong sequences, surrogate-code-point sequences,
and 5/6-byte sequences (and prefixes thereof) as single errors.
(I suppose that lead byte above F4 could be somewhat debatable.)
(This is what ICU does for UTF-8.)
2b. The text in the standard represents the workings of a state machine
that walks strictly valid sequences. Overlong/surrogate/etc. sequences
become multiple errors.
(This is what ICU converters do for multi-byte charsets like Shift-JIS and
In my opinion, 2a. "feels right" for UTF-8, because of the history and
mechanics of the encoding, and 2b. is a good fit for MBCS where concepts
like overlong sequences don't exist. (And for GB 18030 you do have to walk
a validity state machine, you can't just look at the lead byte.)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode