From: Tom Gewecke (tom@bluesky.org)
Date: Sun Apr 23 2006 - 13:09:04 CST
On Apr 23, 2006, at 9:49 AM, Richard Wordingham wrote:
>
> It's actually very simple. Given an initial byte E1, the next two
> bytes must be of the form 10xxxxxx 10xxxxxx. If the parser then
> trusts alleged UTF-8 to be valid UTF-8 (which it should not), it can
> then ignore the non-x bits. Now, it is the second and third bytes
> that are incorrect, being FC and D0 rather than BC and 90, ie. bit 6
> is 1 whereas it must be 0. The low six bits of FC (wrong) and BC
> (correct) and D0 (wrong) and 90 (correct) are the same.
>
Thanks! This would explain some other weird things I have seen in Win
Outlook, where invalid byte sequences can get displayed as Chinese
characters.
Apparently there is some code around which also generates erroneous
UTF-8 like this, which is then pretty hard to detect for a Win IE user.
Any security issues from this ability to read invalid UTF-8 as if it
were valid?
This archive was generated by hypermail 2.1.5 : Sun Apr 23 2006 - 13:11:03 CST