Re: [question] UTF-8 issue

From: Doug Ewell (doug@ewellic.org)
Date: Fri Oct 09 2009 - 23:17:18 CDT

  • Next message: Markus Scherer: "Re: [question] UTF-8 issue"

    I could be wrong, but I was quite sure that when Chat Depasucat asked
    about UTF-8 and non-shortest forms, he was not talking about
    normalization.

    The rules about UTF-8 non-shortest forms are straightforward: you must
    never generate them, and if you read one, you must detect it as invalid
    and not treat it the same as if it were encoded correctly. Checking for
    C0 and C1 is only part of the story; you must also detect these invalid
    sequences:

    * E0 followed by (80 through 9F)
    * F0 followed by (80 through 8F)
    * F4 followed by (90 through BF)

    and, of course, any occurrence of F5 through FF. (You will not ever
    find the 5- and 6-byte sequences that Eljay mentioned in real-world
    data, though you might find them in laboratory conditions, such as
    Markus Kuhn's test file at
    http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt .)

    As Michael D'Errico observed, you also need to detect sequences
    corresponding to loose surrogates:

    * ED followed by (A0 through BF)

    and perform all the other validity checks that are unrelated to
    non-shortest forms.

    --
    Doug Ewell  |  Thornton, Colorado, USA  |  http://www.ewellic.org
    RFC 5645, 4645, UTN #14  |  ietf-languages @ http://is.gd/2kf0s ­
    


    This archive was generated by hypermail 2.1.5 : Fri Oct 09 2009 - 23:21:09 CDT