May be you'll find this useful:
UTF-8 decoder capability and stress test
----------------------------------------
Markus Kuhn <mkuhn@acm.org> - 1999-04-14
This test text examines, how UTF-8 decoder handle various types of
corrupted or otherwise interesting UTF-8 sequences. According to ISO
10646-1, sections R.7 and 2.3c, a device receiving UTF-8 shall
interpret a "malformed sequence in the same way that it interprets a
character that is outside the adopted subset".
Test sequences (all enclosed in ""):
Correct UTF-8 text (Greek word 'kosme'):     "κόσμε"
Correct 2-byte sequence (U+00000080):        ""
Correct 3-byte sequence (U+00000800):        "ࠀ"
Correct 4-byte sequence (U+00010000):        "𐀀"
Correct 5-byte sequence (U+00200000):        ""
Correct 6-byte sequence (U+04000000):        ""
Correct 2-byte sequence (U+000007ff):        "߿"
Correct 3-byte sequence (U+0000ffff):        ""
Correct 4-byte sequence (U+001fffff):        ""
Correct 5-byte sequence (U+03ffffff):        ""
Correct 6-byte sequence (U+7fffffff):        ""
Correct 2-byte sequence (U+0000):            ""
Correct 3-byte sequence (U+0000):            ""
Correct 4-byte sequence (U+0000):            ""
Correct 5-byte sequence (U+0000):            ""
Correct 6-byte sequence (U+0000):            ""
Unexpected continuation byte (10000000):     ""
Another lonely continuation byte (10111111): ""
Sequence of 2 unexpected continuation bytes: ""
Sequence of 3 unexpected continuation bytes: ""
Sequence of 4 unexpected continuation bytes: ""
Sequence of 5 unexpected continuation bytes: ""
Sequence of 6 unexpected continuation bytes: ""
Sequence of 7 unexpected continuation bytes: ""
Sequence of all 64 possible continuation bytes (10000000-10111111):
"
 
 
 "
Sequence of all 32 first bytes of 2-byte sequences (11000000-11011111),
each followed by a space character:
"                
                 "
Sequence of all 16 first bytes of 3-byte sequences (11100000-11101111),
each followed by a space character: "                "
Sequence of all 8 first bytes of 4-byte sequences (11110000-11110111),
each followed by a space character: "        "
Sequence of all 4 first bytes of 5-byte sequences (11111000-11111011),
each followed by a space character: "    "
Sequence of all 2 first bytes of 6-byte sequences (11111100-11111101),
each followed by a space character: "  "
Impossible byte (11111110): ""
Impossible byte (11111111): ""
2-byte sequence with last byte missing: ""
3-byte sequence with last byte missing: ""
4-byte sequence with last byte missing: ""
5-byte sequence with last byte missing: ""
6-byte sequence with last byte missing: ""
All these 5 sequences with last byte missing concatenated:
""
-- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT