From: Peter Kirk (peterkirk@qaya.org)
Date: Fri Jan 21 2005 - 17:12:49 CST
On 21/01/2005 19:26, Richard T. Gillam wrote:
> ...
>
>Peter Kirk had this one right. Certain encoding SCHEMES treat the byte
>sequence FEFF (or some variant of it) as a byte order mark when it
>appears at the beginning of a text stream. ...
>
Thank you.
> ...
>
>UTF-8 is both an encoding form and an encoding scheme, and it doesn't do
>anything special with EF BB BF. It always comes through as U+FEFF, the
>ZWNBSP. Applications that use EF BB BF as a signal that the text stream
>is in UTF-8 and not some other encoding are implementing a higher-level
>protocol based on UTF-8. UTF-8 itself doesn't treat this sequence as
>special.
>
>
This is not correct, at least with the UTF-8 encoding SCHEME. See the
following from TUS section 15.9, pp.401-402:
> In UTF-8, the BOM corresponds to the byte sequence <EF16 BB16 BF16>.
> Although there
> are never any questions of byte order with UTF-8 text, this sequence
> can serve as signature
> for UTF-8 encoded text where the character set is unmarked. ...
> Systems that use the byte order mark must recognize when an initial
> U+FEFF signals the
> byte order. In those cases, it is not part of the textual content and
> should be removed before
> processing, because otherwise it may be mistaken for a legitimate zero
> width no-break space.
This clearly implies that that this byte sequence in UTF-8 is sometimes
to be interpreted as a BOM and not as the character ZWNBSP, in fact not
as part of the textual content at all. This also implies that Antoine is
wrong to say that the UTF-8 BOM produced by Notepad etc is a bug.
>For that matter, applications that use the full panoply of
>signature-byte sequences (0000FEFF for UTF-32BE, FFFE0000 to UTF-32LC,
>FEFF for UTF-16BE, FFFE for UTF-16LE, EF BB BF for UTF-8, etc.) to
>determine whether a byte stream is Unicode and what Unicode encoding
>scheme it is are also implementing a higher-level protocol based on
>Unicode.
>
>
>
Arguably, the Unicode encoding FORM is the lower-level protocol
interface and the Unicode encoding SCHEME is the higher-level protocol
interface. The latter includes the code point U+FEFF; the former
includes the character ZWNBSP. The protocol which converts between these
two may insert or interpret stream initial U+FEFF as a BOM, or simply
convert it to or from ZWNBSP. This implies that the UTF-8 encoding FORM
and the UTF-8 encoding SCHEME are not identical, although in other
respects they are and so they are commonly confused.
Unix, it would appear, supports the UTF-8 encoding FORM but not the
UTF-8 encoding SCHEME. Windows Notepad supports the UTF-8 encoding
SCHEME but not the UTF-8 encoding FORM.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/ -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.300 / Virus Database: 265.7.1 - Release Date: 19/01/2005
This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 18:08:15 CST