From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Oct 12 2004 - 14:18:20 CST
From: "Philippe Verdy" <verdy_p@wanadoo.fr>
To: "Doug Ewell" <dewell@adelphia.net>
Sent: Tuesday, October 12, 2004 8:24 PM
Subject: Re: UTF-8 stress test file?
> From: "Doug Ewell" <dewell@adelphia.net>
>> Theodore H. Smith <delete at elfdata dot com> wrote:
>>
>>>> - the file mixes UTF-8 and UTF-16
>>>
>>> Does this file mix UTF-8 and UTF-16? I thought it just had surrogates
>>> encoded into UTF-8? Of course a surrogate should never exist in UTF-8.
>>
>> You are right. Philippe's statement was incorrect, and also puzzling.
What is much more puzzling is the text contained in that referenced text:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
Examples of bad assumptions that a reader could make:
- [quote](...) Experience so far suggests
that most first-time authors of UTF-8 decoders find at least one
serious problem in their decoder by using this file.[/quote]
This suggests to the reader that if its browser or editor does not display
the contained test text as indicated, there's a problem in that application.
But given that the file is not conforming to UTF-8 because of the "errors"
it contains *on purpose*, No assumption should be made about how the browser
or text editor will behave with the content of that file. Any difference
with what is "expected" by the text is really not a bug, given that the
whole file is incorrect and is *not* UTF-8 encoded. In fact, if your browser
or editor still allows to view it as if it was UTF-8, and inidicates to the
user that it is UTF-8 encoded without warning the user about the encoding
violations that should be detected, I really think that this browser or
editor is not conforming. A conforming browser or editor should load that
document without encoding violation problems, assuming it is encoded instead
with ISO-8859-1 or ISO-8859-2 or any other complete 8-bit encoding (an
encoding that has no invalid code position, so ISO-8859-4 should not work
without similar warnings). The only thing that could be said is that the
document respects only the ISO 10646-1:2000 standard, but not its later
version and not Unicode (so a browser or editor could still accept the
document as being encoded with UTF-8:2000, but not with UTF-8.
- [quote](...) All lines in this file are exactly 79 characters long (plus
the line
feed). In addition, all lines end with "|", except for the two test
lines 2.1.1 and 2.2.1, which contain non-printable ASCII controls
U+0000 and U+007F. If you display this file with a fixed-width font,
these "|" characters should all line up in column 79 (right margin).[/quote]
Nothing is wrong if lines are displayed with more or less characters, or if
"|" characters are not vertically aligned when using fixed fonts.
- [quote] (...)
1 Some correct UTF-8 text
You should see the Greek word 'kosme': "κόσμε"
(...) [/quote]
You can see the Greek word here in this message (because this message is
properly UTF-8 encoded), but nothing is wrong in your editor or browser if
the word is not readable as indicated, and you see for example the string
"κόσμε" when your editor or browser loads the file as an ISO-8859-1
text.
- All the section 3 "Malformed sequences" should not be readable at all, or
could display random characters when the text is loaded as ISO-8859-1. Don't
expect to see "?" even if Internet Explorer display them without warning the
user (this is a violation of the current UTF-8 encoding rules).
- Same thing for section 4 "Overlong sequences" (prohibited in UTF-8, but
tolerated in UTF-8:2000 i.e. the RFC version used by ISO 10646:2000). If you
see "?" characters without other warnings, your browser is not conforming
exactly like browsers that would display the indicated slash "/".
- Section 5 "Illegal code positions" (single and paired "UTF-16" surrogates)
is the one that should immediately throw an exception in the browser's UTF-8
decoder to force it retry with another encoding (possibly with UTF-8:2000,
or with ISO-8859-1). Nothing is wrong in your browser if you see sequences
like "í €" or "í¿¿"when the file is loaded as Windows-1252, or if lines do
not line up or have strange layout when the file is loaded as ISO-8859-1.
- Subsection 5.3 "Other illegal code positions" also forgets all illegal
*code points* (not "code positions" !) that are permanently reserved in the
16 other planes (out of the BMP), and illegal positions found in the Arabic
compatibility block.
So who's puzzling here? Not me! It's the content of the text itself.
This archive was generated by hypermail 2.1.5 : Tue Oct 12 2004 - 14:21:53 CST