From: - - (crossroads0000@googlemail.com)
Date: Sun Dec 27 2009 - 11:56:04 CST
Hello.
I'm currently trying to figure out which steps to take after receiving
UTF-8 over a connection. I cannot trust the sender in any way, so
input validation and filtering HAS to be done. The UTF-8 data is text,
which is why I also want to filter out control characters which have
nothing to do with proper text presentation (that is, directional
markers may be allowed in the UTF-8 stream, control characters like
U+0001 however not).
I want to present my steps here for comments and suggestions. Remember
that security is paramount and I welcome every suggestion on which
code points should also be filtered out. Filtering out means in this
context that they are simply "cut out".
Here's what I do right now:
1) Validate that UTF-8 is well-formed with no overlong byte sequences
or 5 to 6 byte sequences.
2) For code points in planes 0 to 2 (BMP, SMP, SIP) filter the following:
* 0x0000 - 0x001F (1st bunch of control characters)
* 0x007F - 0x009F (2nd bunch of control characters)
* 0xD800 - 0xDFFF (surrogate pairs, have no use in UTF-8)
* 0xE000 - 0xF900 (private use; since everyone can make up a
different character for a code point in private use, filter them all)
* 0xFEFF (byte order mark, no use in UTF-8 and may be
potentially dangerous if converted later to UTF-16 without proper
filtering)
* 0xFFFE (byte order mark in wrong endian format, guaranteed
never to be assigned as a Unicode character)
* 0xFFFF (also guaranteed never to be assigned as a Unicode character).
For the rest, allow all ***assigned*** code points, filter unassigned.
3) For code points in planes 3 to 13 (unassigned planes) filter the
complete range 0x30000 to 0xDFFFF.
4) For code points in plane 14 (SSP) allow all ***assigned*** code
points, filter unassigned.
5) For code points in plane 15 and 16 (private use) filter the
complete range 0xF0000 - 0x10FFFF. Same argument as before: since
everyone can make up a different character for a code point in private
use, filter them all.
I'm looking forward to informed comments, especially on point 4). I'm
not sure on whether I should allow any code points from plane 14,
especially since they seem to be tags mostly (what are they good
for?). Also, are the steps in taken in points 1) to 5) enough?
My final question is this: which of the (in the previous steps)
allowed code points ***higher than*** 127 do I have to "HTML encode"
if I display them in an HTML page? None? Or is it possible that
characters with code points outside the US-ASCII range may be
interpreted by the browser in a similar way to < & and > in the
US-ASCII range, thereby allowing for an XSS attack?
Thanks for reading my lengthy post. :-)
This archive was generated by hypermail 2.1.5 : Sun Dec 27 2009 - 20:20:15 CST