Re: PDUTR #26 posted

From: DougEwell2@cs.com
Date: Tue Sep 18 2001 - 01:27:55 EDT


In a message dated 2001-09-17 16:24:05 Pacific Daylight Time,
david.hopwood@zetnet.co.uk writes:

> It doesn't reopen that specific type of security hole, because irregular
UTF-8
> sequences (as defined by Unicode 3.1) can only decode to characters above
> 0xFFFF, and those characters are unlikely to be "special" for any
application
> protocol. However, I entirely agree that it's desirable that UTF-8 should
only
> allow shortest form; 6-byte surrogate encodings have always been incorrect.

All Unicode code points of the form U+xxxxFE and U+xxxxFF are special, in
that they are non-characters and can be treated in a special way by
applications (e.g. as sentinels).

I don't agree that irregular UTF-8 sequences in general can only decode to
characters above 0xFFFF. For example, the following irregular UTF-8
sequences all decode to U+0000:

C0 80
E0 80 80
F0 80 80 80
F8 80 80 80 80
FC 80 80 80 80 80

It is true that the *specific* irregular UTF-8 sequences introduced (and
required) by CESU-8 decode to characters above 0xFFFF when interpreted as
CESU-8, and to pairs of surrogate code points when (incorrectly) interpreted
as UTF-8. Since definition D29, arguably my least favorite part of Unicode,
requires that all UTFs (including UTF-8) be able to represent unpaired
surrogates, the character count for the same chunk of data could be different
depending on whether it is interpreted as CESU-8 or UTF-8. That's a
potential security hole.

CESU-8 decoders that are really diligent could check for this, of course, but
when I think of CESU-8 the concept of "really diligent decoders" just doesn't
spring to mind. If the inventors were really diligent, they would have
implemented UTF-16 sorting correctly in the first place.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Sep 18 2001 - 00:28:10 EDT