Security Risks of Unicode

From: Doug Ewell (dewell@compuserve.com)
Date: Thu Jul 20 2000 - 10:58:19 EDT


Elliotte Rusty Harold <elharo@metalab.unc.edu> wrote:

> Bruce Schneier expresses some concerns about "Security Risks of
> Unicode" in the latest issue of his Cryptogram newsletter. Thoser who
> don't subscribe can see:
>
> http://www.counterpane.com/crypto-gram-0007.html#9

I'm no expert on computer security issues (ironic; that is the field I
thought I would go into after college), but not many people have
responded to this, so I will give it a try. Naturally I will end up
defending Unicode most of the time.

> At this point the concerns are mostly theoretical. Nonetheless I
> think they're reasonable, especially when you consider the recent
> discussions here about C1 control characters and the unintended
> consequences of these characters. Throw XML/Unicode encoded
> application protocols like SOAP and XML-RPC into the mix and who
> knows what can happen? Which is pretty much Schneier's point.

My problem with the premise presented here is that it assumes security
problems are the fault of the character encoding mechanism, not the
software that processes those characters. This is similar to the
argument that UTF-8 breaks terminal-host communication because it uses
bytes in the 0x80-0x9F range. My response, that if terminals have some
reason to expect UTF-8 then they should apply the UTF-8 conversion first
and process C1 controls afterward, appears to miss the point in some
way. So I may be missing Schneier's point as well, but I think if you
are going to receive and process Unicode characters, you have to process
them conformantly, which means learning what "conformant" means.

Schneier writes:

> The Unicode specification includes all sorts of complicated new escape
> sequences. They have things called UTF-8 and UTF-16, which allow
> several possible representations of various character codes, several
> different places where control-characters pop through, a scheme for
> placing diacriticals and accents in separated characters (looking very
> much like an escape), and hundreds of brand new punctuation characters
> and otherwise nonalphabetic characters.

A lot of this reminds me of the Amy Burns article that used to be linked
on the Unicode Web site, in which the author was too intimidated by the
size and complexity of Unicode to write about it coherently.

What "complicated new escape sequences" is he talking about? I can't
think of any new escape sequences introduced by Unicode, unless you
count surrogate handling and non-spacing modifiers, which I don't think
are particularly complicated.

Where do control characters "pop through" in Unicode? Is he talking
about the BOM, the replacement character, the deprecated activate/
inhibit pairs, or what? In any case I don't see what type of security
problems these characters will cause.

> What happens when:
>
> - We start attaching semantics to the new characters as delimiters,
> white space, etc? With thousands of characters and new characters
> being added all the time, it will be extremely difficult to categorize
> all the possible characters consistently, and where there is
> inconsistency, there tends to be security holes.

OK, now we're getting somewhere. The author is simply not aware that
in UnicodeData.txt and other data files provided by Unicode, there are
*normative* classifications for characters as (e.g.) white space.

"Delimiter" is an application-specific concept, so no character encoding
standard can ever cover that one completely, but otherwise this is a
matter of following the standard vs. ignoring parts of it.

> - Somebody uses "modifier" characters in an unexpected way?

Perhaps this is the best point he makes. A non-spacing modifier can
turn 'a' into a-with-acute, and you don't know when (if ever) the acute
will arrive on the wire, so you must wait an arbitrary amount of time
before processing the 'a'. This is out of my expertise, so I can't say
he's wrong.

> - Somebody uses UTF-8 or UTF-16 to encode a conventional character in
> a novel way to bypass validation checks?

Then that is the fault of your validation checks. UTF-8 and UTF-16
are both simple and clearly defined, so there is no excuse for not
implementing them correctly and completely. Do not treat C0 AE as a
valid encoding of U+002E FULL STOP. Decode first, interpret second.

> With Unicode, we probably won't be able to get a consistent definition
> of what to accept, what is a delimiter under what circumstance, or how
> to handle arbitrary streams safely.

See comments above. Unicode provides a very consistent definition of
what it is able to define, and that which it cannot define *must* be
left up to the implementation. If you really want to be secure about
delimiters and other "special" characters, restrict their repertoire to
that which existed in your favorite 8-bit character set, and reject
everything else.

> Unicode is just too complex to ever be secure.

Not if you make the effort to understand it.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT