Re: UTF-8 stress test file?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Oct 12 2004 - 14:18:20 CST

  • Next message: Philippe Verdy: "Re: UTF-8 stress test file?"

    From: "Philippe Verdy" <verdy_p@wanadoo.fr>
    To: "Doug Ewell" <dewell@adelphia.net>
    Sent: Tuesday, October 12, 2004 8:24 PM
    Subject: Re: UTF-8 stress test file?

    > From: "Doug Ewell" <dewell@adelphia.net>
    >> Theodore H. Smith <delete at elfdata dot com> wrote:
    >>
    >>>> - the file mixes UTF-8 and UTF-16
    >>>
    >>> Does this file mix UTF-8 and UTF-16? I thought it just had surrogates
    >>> encoded into UTF-8? Of course a surrogate should never exist in UTF-8.
    >>
    >> You are right. Philippe's statement was incorrect, and also puzzling.

    What is much more puzzling is the text contained in that referenced text:
    http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

    Examples of bad assumptions that a reader could make:

    - [quote](...) Experience so far suggests
    that most first-time authors of UTF-8 decoders find at least one
    serious problem in their decoder by using this file.[/quote]

    This suggests to the reader that if its browser or editor does not display
    the contained test text as indicated, there's a problem in that application.
    But given that the file is not conforming to UTF-8 because of the "errors"
    it contains *on purpose*, No assumption should be made about how the browser
    or text editor will behave with the content of that file. Any difference
    with what is "expected" by the text is really not a bug, given that the
    whole file is incorrect and is *not* UTF-8 encoded. In fact, if your browser
    or editor still allows to view it as if it was UTF-8, and inidicates to the
    user that it is UTF-8 encoded without warning the user about the encoding
    violations that should be detected, I really think that this browser or
    editor is not conforming. A conforming browser or editor should load that
    document without encoding violation problems, assuming it is encoded instead
    with ISO-8859-1 or ISO-8859-2 or any other complete 8-bit encoding (an
    encoding that has no invalid code position, so ISO-8859-4 should not work
    without similar warnings). The only thing that could be said is that the
    document respects only the ISO 10646-1:2000 standard, but not its later
    version and not Unicode (so a browser or editor could still accept the
    document as being encoded with UTF-8:2000, but not with UTF-8.

    - [quote](...) All lines in this file are exactly 79 characters long (plus
    the line
    feed). In addition, all lines end with "|", except for the two test
    lines 2.1.1 and 2.2.1, which contain non-printable ASCII controls
    U+0000 and U+007F. If you display this file with a fixed-width font,
    these "|" characters should all line up in column 79 (right margin).[/quote]

    Nothing is wrong if lines are displayed with more or less characters, or if
    "|" characters are not vertically aligned when using fixed fonts.

    - [quote] (...)
    1 Some correct UTF-8 text

    You should see the Greek word 'kosme': "κόσμε"
    (...) [/quote]

    You can see the Greek word here in this message (because this message is
    properly UTF-8 encoded), but nothing is wrong in your editor or browser if
    the word is not readable as indicated, and you see for example the string
    "κόσμε" when your editor or browser loads the file as an ISO-8859-1
    text.

    - All the section 3 "Malformed sequences" should not be readable at all, or
    could display random characters when the text is loaded as ISO-8859-1. Don't
    expect to see "?" even if Internet Explorer display them without warning the
    user (this is a violation of the current UTF-8 encoding rules).

    - Same thing for section 4 "Overlong sequences" (prohibited in UTF-8, but
    tolerated in UTF-8:2000 i.e. the RFC version used by ISO 10646:2000). If you
    see "?" characters without other warnings, your browser is not conforming
    exactly like browsers that would display the indicated slash "/".

    - Section 5 "Illegal code positions" (single and paired "UTF-16" surrogates)
    is the one that should immediately throw an exception in the browser's UTF-8
    decoder to force it retry with another encoding (possibly with UTF-8:2000,
    or with ISO-8859-1). Nothing is wrong in your browser if you see sequences
    like "í €" or "í¿¿"when the file is loaded as Windows-1252, or if lines do
    not line up or have strange layout when the file is loaded as ISO-8859-1.

    - Subsection 5.3 "Other illegal code positions" also forgets all illegal
    *code points* (not "code positions" !) that are permanently reserved in the
    16 other planes (out of the BMP), and illegal positions found in the Arabic
    compatibility block.

    So who's puzzling here? Not me! It's the content of the text itself.



    This archive was generated by hypermail 2.1.5 : Tue Oct 12 2004 - 14:21:53 CST