Re: Conformance (was UTF, BOM, etc)

From: Peter Kirk (peterkirk@qaya.org)
Date: Fri Jan 21 2005 - 17:12:49 CST

  • Next message: Gregg Reynolds: "Re: Subject: Re: 32'nd bit & UTF-8"

    On 21/01/2005 19:26, Richard T. Gillam wrote:

    > ...
    >
    >Peter Kirk had this one right. Certain encoding SCHEMES treat the byte
    >sequence FEFF (or some variant of it) as a byte order mark when it
    >appears at the beginning of a text stream. ...
    >

    Thank you.

    > ...
    >
    >UTF-8 is both an encoding form and an encoding scheme, and it doesn't do
    >anything special with EF BB BF. It always comes through as U+FEFF, the
    >ZWNBSP. Applications that use EF BB BF as a signal that the text stream
    >is in UTF-8 and not some other encoding are implementing a higher-level
    >protocol based on UTF-8. UTF-8 itself doesn't treat this sequence as
    >special.
    >
    >

    This is not correct, at least with the UTF-8 encoding SCHEME. See the
    following from TUS section 15.9, pp.401-402:

    > In UTF-8, the BOM corresponds to the byte sequence <EF16 BB16 BF16>.
    > Although there
    > are never any questions of byte order with UTF-8 text, this sequence
    > can serve as signature
    > for UTF-8 encoded text where the character set is unmarked. ...

    > Systems that use the byte order mark must recognize when an initial
    > U+FEFF signals the
    > byte order. In those cases, it is not part of the textual content and
    > should be removed before
    > processing, because otherwise it may be mistaken for a legitimate zero
    > width no-break space.

    This clearly implies that that this byte sequence in UTF-8 is sometimes
    to be interpreted as a BOM and not as the character ZWNBSP, in fact not
    as part of the textual content at all. This also implies that Antoine is
    wrong to say that the UTF-8 BOM produced by Notepad etc is a bug.

    >For that matter, applications that use the full panoply of
    >signature-byte sequences (0000FEFF for UTF-32BE, FFFE0000 to UTF-32LC,
    >FEFF for UTF-16BE, FFFE for UTF-16LE, EF BB BF for UTF-8, etc.) to
    >determine whether a byte stream is Unicode and what Unicode encoding
    >scheme it is are also implementing a higher-level protocol based on
    >Unicode.
    >
    >
    >
    Arguably, the Unicode encoding FORM is the lower-level protocol
    interface and the Unicode encoding SCHEME is the higher-level protocol
    interface. The latter includes the code point U+FEFF; the former
    includes the character ZWNBSP. The protocol which converts between these
    two may insert or interpret stream initial U+FEFF as a BOM, or simply
    convert it to or from ZWNBSP. This implies that the UTF-8 encoding FORM
    and the UTF-8 encoding SCHEME are not identical, although in other
    respects they are and so they are commonly confused.

    Unix, it would appear, supports the UTF-8 encoding FORM but not the
    UTF-8 encoding SCHEME. Windows Notepad supports the UTF-8 encoding
    SCHEME but not the UTF-8 encoding FORM.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    -- 
    No virus found in this outgoing message.
    Checked by AVG Anti-Virus.
    Version: 7.0.300 / Virus Database: 265.7.1 - Release Date: 19/01/2005
    


    This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 18:08:15 CST