Re: Best practices for replacing UTF-8 overlongs

From: Doug Ewell <doug_at_ewellic.org>
Date: Mon, 19 Dec 2016 16:52:36 -0700

Karl Williamson wrote:

> It seems counterintuitive to me that the two byte sequence C0 80
> should be replaced by 2 replacement characters under best practices,
> or that E0 80 80 should also be replaced by 2. Each sequence was legal
> in early Unicode versions,

This is overstated at best. Decoders weren't required to detect overlong
sequences until 2000, but it was never legal to generate them. This was
stated explicitly in RFC 2279 and in Unicode 1.1, Appendix F. Correct
use of the instructions and table in RFC 2044 also precluded the
creation of overlong sequences.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Received on Mon Dec 19 2016 - 17:53:44 CST

This archive was generated by hypermail 2.2.0 : Mon Dec 19 2016 - 17:53:44 CST