Re: Conformance (was UTF, BOM, etc)

From: Peter Kirk (peterkirk@qaya.org)
Date: Fri Jan 21 2005 - 17:12:49 CST

Next message: Gregg Reynolds: "Re: Subject: Re: 32'nd bit & UTF-8"

Previous message: Mark Leisher: "The "JDGI" file grows [was re: UTF-8, BOM, 32'nd bit]"
In reply to: Richard T. Gillam: "RE: Conformance (was UTF, BOM, etc)"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Conformance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 21/01/2005 19:26, Richard T. Gillam wrote:

> ...
>
>Peter Kirk had this one right. Certain encoding SCHEMES treat the byte
>sequence FEFF (or some variant of it) as a byte order mark when it
>appears at the beginning of a text stream. ...
>

Thank you.

> ...
>
>UTF-8 is both an encoding form and an encoding scheme, and it doesn't do
>anything special with EF BB BF. It always comes through as U+FEFF, the
>ZWNBSP. Applications that use EF BB BF as a signal that the text stream
>is in UTF-8 and not some other encoding are implementing a higher-level
>protocol based on UTF-8. UTF-8 itself doesn't treat this sequence as
>special.
>
>

This is not correct, at least with the UTF-8 encoding SCHEME. See the
following from TUS section 15.9, pp.401-402:

> In UTF-8, the BOM corresponds to the byte sequence <EF16 BB16 BF16>.
> Although there
> are never any questions of byte order with UTF-8 text, this sequence
> can serve as signature
> for UTF-8 encoded text where the character set is unmarked. ...

> Systems that use the byte order mark must recognize when an initial
> U+FEFF signals the
> byte order. In those cases, it is not part of the textual content and
> should be removed before
> processing, because otherwise it may be mistaken for a legitimate zero
> width no-break space.

This clearly implies that that this byte sequence in UTF-8 is sometimes
to be interpreted as a BOM and not as the character ZWNBSP, in fact not
as part of the textual content at all. This also implies that Antoine is
wrong to say that the UTF-8 BOM produced by Notepad etc is a bug.

>For that matter, applications that use the full panoply of
>signature-byte sequences (0000FEFF for UTF-32BE, FFFE0000 to UTF-32LC,
>FEFF for UTF-16BE, FFFE for UTF-16LE, EF BB BF for UTF-8, etc.) to
>determine whether a byte stream is Unicode and what Unicode encoding
>scheme it is are also implementing a higher-level protocol based on
>Unicode.
>
>
>
Arguably, the Unicode encoding FORM is the lower-level protocol
interface and the Unicode encoding SCHEME is the higher-level protocol
interface. The latter includes the code point U+FEFF; the former
includes the character ZWNBSP. The protocol which converts between these
two may insert or interpret stream initial U+FEFF as a BOM, or simply
convert it to or from ZWNBSP. This implies that the UTF-8 encoding FORM
and the UTF-8 encoding SCHEME are not identical, although in other
respects they are and so they are commonly confused.

Unix, it would appear, supports the UTF-8 encoding FORM but not the
UTF-8 encoding SCHEME. Windows Notepad supports the UTF-8 encoding
SCHEME but not the UTF-8 encoding FORM.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/
-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 265.7.1 - Release Date: 19/01/2005

Next message: Gregg Reynolds: "Re: Subject: Re: 32'nd bit & UTF-8"
Previous message: Mark Leisher: "The "JDGI" file grows [was re: UTF-8, BOM, 32'nd bit]"
In reply to: Richard T. Gillam: "RE: Conformance (was UTF, BOM, etc)"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Conformance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 18:08:15 CST