Re: UTF-8 stress test file?

From: Philipp Reichmuth (reichmuth@web.de)
Date: Tue Oct 12 2004 - 16:59:30 CST

Next message: Richard Cook: "outside decomposed, inside precomposed"

Previous message: Mike Ayers: "RE: bit notation in ISO-8859-x is wrong"
In reply to: Philippe Verdy: "Re: UTF-8 stress test file?"
Next in thread: Philippe Verdy: "Re: UTF-8 stress test file?"
Reply: Philippe Verdy: "Re: UTF-8 stress test file?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy schrieb:
> Examples of bad assumptions that a reader could make:
>
> - [quote](...) Experience so far suggests
> that most first-time authors of UTF-8 decoders find at least one
> serious problem in their decoder by using this file.[/quote]
>
> This suggests to the reader that if its browser or editor does not
> display the contained test text as indicated, there's a problem in that
> application.

Well, to me it didn't. After all, the purpose of this file is to be a
stress test for UTF-8 decoders, as indicated in line 1. By testing
their decoders on this file, UTF-8 decoder authors tend to find problems
of some kind in their programs. So where is the problem again?

> But given that the file is not conforming to UTF-8 because
> of the "errors" it contains *on purpose*, No assumption should be made
> about how the browser or text editor will behave with the content of
> that file.

Where is any such assumption being made? Actually, most of your
statements on what is "wrong" with this file are based on the idea that
it makes some expectations on parser behaviour. However, in paragraph
1, this is explicitly excluded. So what is the point?

> A conforming browser or editor should load that document without
> encoding violation problems, assuming it is encoded instead with
> ISO-8859-1 [...]

While possibly being technically correct behaviour, that would sort of
defeat the purpose of testing an UTF-8 decoder, wouldn't it?

> Nothing is wrong if lines are displayed with more or less characters, or
> if "|" characters are not vertically aligned when using fixed fonts.

Assuming, however, that the file is used for its purpose of testing an
UTF-8 decoder, all lines should indeed align.

>> You should see the Greek word 'kosme': "κόσμε"
>> (...) [/quote]
>
> You can see the Greek word here in this message (because this message is
> properly UTF-8 encoded), but nothing is wrong in your editor or browser
> if the word is not readable as indicated, and you see for example the
> string "Îºá½¹ÏƒÎ¼Îµ" when your editor or browser loads the file as an
> ISO-8859-1 text.

Don't you think you are stretching things a bit? This is an UTF-8
parser stress test file. If an application opens it in a different
encoding, well, of course the results will be different, and things will
not look UTF-8-ish. Again, this is a non-issue. It's like distributing
a Linux binary for testing something and then getting complaints that it
doesn't work under DOS and that it shouldn't make assumptions on
operating systems.

And so on.

Philipp

Next message: Richard Cook: "outside decomposed, inside precomposed"
Previous message: Mike Ayers: "RE: bit notation in ISO-8859-x is wrong"
In reply to: Philippe Verdy: "Re: UTF-8 stress test file?"
Next in thread: Philippe Verdy: "Re: UTF-8 stress test file?"
Reply: Philippe Verdy: "Re: UTF-8 stress test file?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Oct 12 2004 - 17:02:17 CST