May be you'll find this useful:
UTF-8 decoder capability and stress test
----------------------------------------
Markus Kuhn <mkuhn@acm.org> - 1999-04-14
This test text examines, how UTF-8 decoder handle various types of
corrupted or otherwise interesting UTF-8 sequences. According to ISO
10646-1, sections R.7 and 2.3c, a device receiving UTF-8 shall
interpret a "malformed sequence in the same way that it interprets a
character that is outside the adopted subset".
Test sequences (all enclosed in ""):
Correct UTF-8 text (Greek word 'kosme'): "κόσμε"
Correct 2-byte sequence (U+00000080): ""
Correct 3-byte sequence (U+00000800): "ࠀ"
Correct 4-byte sequence (U+00010000): "𐀀"
Correct 5-byte sequence (U+00200000): ""
Correct 6-byte sequence (U+04000000): ""
Correct 2-byte sequence (U+000007ff): "߿"
Correct 3-byte sequence (U+0000ffff): ""
Correct 4-byte sequence (U+001fffff): ""
Correct 5-byte sequence (U+03ffffff): ""
Correct 6-byte sequence (U+7fffffff): ""
Correct 2-byte sequence (U+0000): ""
Correct 3-byte sequence (U+0000): ""
Correct 4-byte sequence (U+0000): ""
Correct 5-byte sequence (U+0000): ""
Correct 6-byte sequence (U+0000): ""
Unexpected continuation byte (10000000): ""
Another lonely continuation byte (10111111): ""
Sequence of 2 unexpected continuation bytes: ""
Sequence of 3 unexpected continuation bytes: ""
Sequence of 4 unexpected continuation bytes: ""
Sequence of 5 unexpected continuation bytes: ""
Sequence of 6 unexpected continuation bytes: ""
Sequence of 7 unexpected continuation bytes: ""
Sequence of all 64 possible continuation bytes (10000000-10111111):
"
"
Sequence of all 32 first bytes of 2-byte sequences (11000000-11011111),
each followed by a space character:
"
"
Sequence of all 16 first bytes of 3-byte sequences (11100000-11101111),
each followed by a space character: " "
Sequence of all 8 first bytes of 4-byte sequences (11110000-11110111),
each followed by a space character: " "
Sequence of all 4 first bytes of 5-byte sequences (11111000-11111011),
each followed by a space character: " "
Sequence of all 2 first bytes of 6-byte sequences (11111100-11111101),
each followed by a space character: " "
Impossible byte (11111110): ""
Impossible byte (11111111): ""
2-byte sequence with last byte missing: ""
3-byte sequence with last byte missing: ""
4-byte sequence with last byte missing: ""
5-byte sequence with last byte missing: ""
6-byte sequence with last byte missing: ""
All these 5 sequences with last byte missing concatenated:
""
-- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT