RE: please review the paper for me

From: Francois Yergeau (FYergeau@alis.com)
Date: Tue Feb 25 2003 - 18:05:19 EST

Next message: Kenneth Whistler: "Re: Unicode 4.0 BETA available for review"

Previous message: Yung-Fong Tang: "Re: please review the paper for me"
Maybe in reply to: Yung-Fong Tang: "please review the paper for me"
Next in thread: Kenneth Whistler: "Re: please review the paper for me"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

ftang@netscape.com <mailto:ftang@netscape.com> wrote:

Unfortunatelly, FSS-UTF in Unicode 1.1 IS NOT UTF-8. Most of the people
refer to UTF-8 by looking at RFC 2279
http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2279.html
<http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2279.html>
and RFC 2044 http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2044.html
<http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2044.html>
but in that two RFCs, when it stated the decoding process, it does not
mention checking the non-shortest-form

2279 does mention it. Near the end of section 2 you have:

        NOTE -- actual implementations of the decoding algorithm above
        should protect against decoding invalid sequences. For
        instance, a naive implementation may (wrongly) decode the
        invalid UTF-8 sequence C0 80 into the character U+0000, which
        may have security consequences and/or cause other problems. See
        the Security Considerations section below.

And the Security Considerations explains why one should check that. It's
only a NOTE in 2279, hence not a normative prescription, reflecting the
state of Unicode back in 1998. It's being made a normative MUST in 2279bis.

Likewise, ever since the surrogate code point range was designated in

Unicode 2.0, it has been invalid (or at least nonsensical) to encode

values from U+D800 through U+DFFF directly in UTF-8.

Again, RFC 2279 is the one people look at when they refer to UTF-8. And the
decoding process stated in there does not mention checking the range which
directly map to D800-DFFF

Unfortunately true, but that's being fixed in 2279bis.

Well... that is another question. Is UTF-8 which represent U+FFFE and
U+FFFF legal UTF-8 sequence?

Markus Scherer already answered this one: it's valid UTF-8 representing
non-characters that should not be exchanged across system boundaries. A
UTF-8 decoder is not necessarily located at such a boundary.

(Just like you may have a valid Base64 encoded file which encode an illegal
GIF file. Your base 64 is legal, fully conform to Base64 decoding logic and
could be decoed, but the decoded file is not a legal GIF file which conform
to the GIF file specification)

Pretty apt analogy.

Where is the boundary of legal UTF-8 from legal Unicode ?

At "system boundaries", which non-characters may not cross.

-- 
François Yergeau

Next message: Kenneth Whistler: "Re: Unicode 4.0 BETA available for review"
Previous message: Yung-Fong Tang: "Re: please review the paper for me"
Maybe in reply to: Yung-Fong Tang: "please review the paper for me"
Next in thread: Kenneth Whistler: "Re: please review the paper for me"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Feb 25 2003 - 18:45:07 EST