From: Francois Yergeau (FYergeau@alis.com)
Date: Tue Feb 25 2003 - 18:05:19 EST
ftang@netscape.com <mailto:ftang@netscape.com> wrote:
Unfortunatelly, FSS-UTF in Unicode 1.1 IS NOT UTF-8. Most of the people
refer to UTF-8 by looking at RFC 2279
http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2279.html
<http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2279.html>
and RFC 2044 http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2044.html
<http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2044.html>
but in that two RFCs, when it stated the decoding process, it does not
mention checking the non-shortest-form
2279 does mention it. Near the end of section 2 you have:
NOTE -- actual implementations of the decoding algorithm above
should protect against decoding invalid sequences. For
instance, a naive implementation may (wrongly) decode the
invalid UTF-8 sequence C0 80 into the character U+0000, which
may have security consequences and/or cause other problems. See
the Security Considerations section below.
And the Security Considerations explains why one should check that. It's
only a NOTE in 2279, hence not a normative prescription, reflecting the
state of Unicode back in 1998. It's being made a normative MUST in 2279bis.
Likewise, ever since the surrogate code point range was designated in
Unicode 2.0, it has been invalid (or at least nonsensical) to encode
values from U+D800 through U+DFFF directly in UTF-8.
Again, RFC 2279 is the one people look at when they refer to UTF-8. And the
decoding process stated in there does not mention checking the range which
directly map to D800-DFFF
Unfortunately true, but that's being fixed in 2279bis.
Well... that is another question. Is UTF-8 which represent U+FFFE and
U+FFFF legal UTF-8 sequence?
Markus Scherer already answered this one: it's valid UTF-8 representing
non-characters that should not be exchanged across system boundaries. A
UTF-8 decoder is not necessarily located at such a boundary.
(Just like you may have a valid Base64 encoded file which encode an illegal
GIF file. Your base 64 is legal, fully conform to Base64 decoding logic and
could be decoed, but the decoded file is not a legal GIF file which conform
to the GIF file specification)
Pretty apt analogy.
Where is the boundary of legal UTF-8 from legal Unicode ?
At "system boundaries", which non-characters may not cross.
-- François Yergeau
This archive was generated by hypermail 2.1.5 : Tue Feb 25 2003 - 18:45:07 EST