RE: please review the paper for me

From: Francois Yergeau (FYergeau@alis.com)
Date: Tue Feb 25 2003 - 18:05:19 EST

  • Next message: Kenneth Whistler: "Re: Unicode 4.0 BETA available for review"

    ftang@netscape.com <mailto:ftang@netscape.com> wrote:

    Unfortunatelly, FSS-UTF in Unicode 1.1 IS NOT UTF-8. Most of the people
    refer to UTF-8 by looking at RFC 2279
    http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2279.html
    <http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2279.html>
    and RFC 2044 http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2044.html
    <http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2044.html>
    but in that two RFCs, when it stated the decoding process, it does not
    mention checking the non-shortest-form
     

    2279 does mention it. Near the end of section 2 you have:
     
            NOTE -- actual implementations of the decoding algorithm above
            should protect against decoding invalid sequences. For
            instance, a naive implementation may (wrongly) decode the
            invalid UTF-8 sequence C0 80 into the character U+0000, which
            may have security consequences and/or cause other problems. See
            the Security Considerations section below.
     
    And the Security Considerations explains why one should check that. It's
    only a NOTE in 2279, hence not a normative prescription, reflecting the
    state of Unicode back in 1998. It's being made a normative MUST in 2279bis.

    Likewise, ever since the surrogate code point range was designated in

    Unicode 2.0, it has been invalid (or at least nonsensical) to encode

    values from U+D800 through U+DFFF directly in UTF-8.

    Again, RFC 2279 is the one people look at when they refer to UTF-8. And the
    decoding process stated in there does not mention checking the range which
    directly map to D800-DFFF
     

    Unfortunately true, but that's being fixed in 2279bis.
     

     Well... that is another question. Is UTF-8 which represent U+FFFE and
    U+FFFF legal UTF-8 sequence?
     

    Markus Scherer already answered this one: it's valid UTF-8 representing
    non-characters that should not be exchanged across system boundaries. A
    UTF-8 decoder is not necessarily located at such a boundary.
     

    (Just like you may have a valid Base64 encoded file which encode an illegal
    GIF file. Your base 64 is legal, fully conform to Base64 decoding logic and
    could be decoed, but the decoded file is not a legal GIF file which conform
    to the GIF file specification)
     

    Pretty apt analogy.

      Where is the boundary of legal UTF-8 from legal Unicode ?
     

    At "system boundaries", which non-characters may not cross.
     

    -- 
    François Yergeau
    


    This archive was generated by hypermail 2.1.5 : Tue Feb 25 2003 - 18:45:07 EST