Re: UTC action on malformed/illegal UTF-8 sequences?

From: Doug Ewell (
Date: Tue Oct 24 2000 - 01:47:52 EDT

"Hart, Edwin F." <> wrote:

> Does the UTC need to address the issue of malformed and illegal UTF-8
> sequences, etc.? The text in question is the example in D32 and the
> last sentence of the section on shortest encoding.


> The issue for UTC may be: If a process receives an "ill-formed"
> code sequence, should the standard specify the action or allow
> interpretation and give warnings (like RFC 2279). Will more software
> break if the ill-formed sequence is allowed or denied? Given the
> number of security problems and fixes I see a week, I personally
> think that the UTC needs to tighten the algorithms and require an
> exception condition rather than interpret the ill-formed code value
> sequences[.]

Ever since I first saw this topic come up on the list, I have moved
farther and farther over to the "security" side, and the conversion is
now complete. IMHO, there is *nothing* positive to be gained from
allowing 0xC0 0x80 to be interpreted as U+0000, as definition D32
explicitly allows, and as Markus Kuhn has pointed out, the code to
perform illegal-sequence checking is simple and quite fast (I know,
I've implemented it).

Ed had quoted Cris Bailiff <> thusly:

> The warning in RFC 2279 hasn't been heeded by a single unicode
> decoder that I have ever tested, commercial or free, including the
> Solaris 2.6 system libraries, the Linux unicode_console driver,
> Netscape commuicator and now, obviously, IIS.

Well, obviously Cris has never tested MY decoder. (OK, that's not
fair, since I've never published it.) But then:

> I've no idea how to put the brakes on the crash dive into a character
> encoding standard which seems to have no defined canonical encoding
> and no obvious way of performing deterministic comparisons.

Now we're back to the Bruce Schneier premise that Unicode is horribly
and irreparably flawed, when the truth is that UTF-8 would be just as
secure as any other encoding form ever invented if the UTC would only
tighten the spec and forbid conformant decoders from interpreting
overlong sequences, as Edwin suggests.

-Doug Ewell
 Fullerton, California

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT