From: Theodore H. Smith (delete@elfdata.com)
Date: Sun Oct 10 2004 - 15:59:25 CST
>> I'd like to see a UTF-8 stress test file.
>> It should consist of lines of UTF-8, separated each by a newline.
>> Each line should be malformed. Also, some idea of how to deal with
>> the malformed UTF-8 should be noted in a separate file.
>> Really, I just want some way to verify that I can detect every kind
>> of UTF-8 wrongness. I have some code I adapted from Unicode.org, but
>> I want to make sure my adaptions haven't broken the code.
>
> http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
"This file is not meant to be a conformance test. It does
not prescribes any particular outcome and therefore there is no way to
"pass" or "fail" this test file, even though the texts suggests a
preferable decoder behaviour at some places."
I'm wondering if Unicode.org has a proper conformance test? If not, I
suggest they make one. One where we had each test separated by a single
newline, and no non-ttest lines existing... less they wanted to make
some kind of "comment line" which is easy to parse (lets say starting
the line with "#").
For me to use that test programmatically, I'll need to break out my
non-UTF-8 aware text editor, delete all the non test lines, and then
separate out the good and the bad UTF8 into different files! That way I
can use readline type code to do my UTF-8 verification.
It would be nice if someone had a "automated test ready" UTF-8 file.
If not, I'll modify this one and then put the results up on my website,
someday. (week or so).
-- Theodore H. Smith - Software Developer. http://www.elfdata.com
This archive was generated by hypermail 2.1.5 : Sun Oct 10 2004 - 16:01:28 CST