UTF-8 stress test

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Wed Apr 14 1999 - 17:29:34 EDT

Next message: Masahiko Maedera: "Re: [Proposal] Extended UTF-16 by using"
Previous message: Markus Kuhn: "Re: Where is UTF-8 Font"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

May be you'll find this useful:

UTF-8 decoder capability and stress test
----------------------------------------

Markus Kuhn <mkuhn@acm.org> - 1999-04-14

This test text examines, how UTF-8 decoder handle various types of
corrupted or otherwise interesting UTF-8 sequences. According to ISO
10646-1, sections R.7 and 2.3c, a device receiving UTF-8 shall
interpret a "malformed sequence in the same way that it interprets a
character that is outside the adopted subset".

Test sequences (all enclosed in ""):

Correct UTF-8 text (Greek word 'kosme'): "κόσμε"
Correct 2-byte sequence (U+00000080): ""
Correct 3-byte sequence (U+00000800): "ࠀ"
Correct 4-byte sequence (U+00010000): "𐀀"
Correct 5-byte sequence (U+00200000): "��"
Correct 6-byte sequence (U+04000000): "��"
Correct 2-byte sequence (U+000007ff): "߿"
Correct 3-byte sequence (U+0000ffff): ""
Correct 4-byte sequence (U+001fffff): "��"
Correct 5-byte sequence (U+03ffffff): "��"
Correct 6-byte sequence (U+7fffffff): "��"
Correct 2-byte sequence (U+0000): "��"
Correct 3-byte sequence (U+0000): "��"
Correct 4-byte sequence (U+0000): "��"
Correct 5-byte sequence (U+0000): "��"
Correct 6-byte sequence (U+0000): "��"
Unexpected continuation byte (10000000): "�"
Another lonely continuation byte (10111111): "�"
Sequence of 2 unexpected continuation bytes: "�"
Sequence of 3 unexpected continuation bytes: "��"
Sequence of 4 unexpected continuation bytes: "��"
Sequence of 5 unexpected continuation bytes: "��"
Sequence of 6 unexpected continuation bytes: "��"
Sequence of 7 unexpected continuation bytes: "��"
Sequence of all 64 possible continuation bytes (10000000-10111111):
"��
��
��
��"
Sequence of all 32 first bytes of 2-byte sequences (11000000-11011111),
each followed by a space character:
"� � � � � � � � � � � � � � � �
� � � � � � � � � � � � � � � � "
Sequence of all 16 first bytes of 3-byte sequences (11100000-11101111),
each followed by a space character: "� � � � � � � � � � � � � � � � "
Sequence of all 8 first bytes of 4-byte sequences (11110000-11110111),
each followed by a space character: "� � � � � � � � "
Sequence of all 4 first bytes of 5-byte sequences (11111000-11111011),
each followed by a space character: "� � � � "
Sequence of all 2 first bytes of 6-byte sequences (11111100-11111101),
each followed by a space character: "� � "
Impossible byte (11111110): "�"
Impossible byte (11111111): "�"
2-byte sequence with last byte missing: "�"
3-byte sequence with last byte missing: "��"
4-byte sequence with last byte missing: "��"
5-byte sequence with last byte missing: "��"
6-byte sequence with last byte missing: "��"
All these 5 sequences with last byte missing concatenated:
"��"

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

Next message: Masahiko Maedera: "Re: [Proposal] Extended UTF-16 by using"
Previous message: Markus Kuhn: "Re: Where is UTF-8 Font"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT