From: Doug Ewell (doug@ewellic.org)
Date: Fri Oct 09 2009 - 23:17:18 CDT
I could be wrong, but I was quite sure that when Chat Depasucat asked
about UTF-8 and non-shortest forms, he was not talking about
normalization.
The rules about UTF-8 non-shortest forms are straightforward: you must
never generate them, and if you read one, you must detect it as invalid
and not treat it the same as if it were encoded correctly. Checking for
C0 and C1 is only part of the story; you must also detect these invalid
sequences:
* E0 followed by (80 through 9F)
* F0 followed by (80 through 8F)
* F4 followed by (90 through BF)
and, of course, any occurrence of F5 through FF. (You will not ever
find the 5- and 6-byte sequences that Eljay mentioned in real-world
data, though you might find them in laboratory conditions, such as
Markus Kuhn's test file at
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt .)
As Michael D'Errico observed, you also need to detect sequences
corresponding to loose surrogates:
* ED followed by (A0 through BF)
and perform all the other validity checks that are unrelated to
non-shortest forms.
-- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ http://is.gd/2kf0s
This archive was generated by hypermail 2.1.5 : Fri Oct 09 2009 - 23:21:09 CDT