Re: Names for UTF-8 with and without BOM

From: John Cowan (jcowan@reutershealth.com)
Date: Sat Nov 02 2002 - 18:46:39 EST

Next message: David Starner: "Re: Header Reply-To"

Previous message: John Hudson: "Re: ct, fj and blackletter ligatures"
In reply to: Tex Texin: "Re: Names for UTF-8 with and without BOM"
Next in thread: Tex Texin: "Re: Names for UTF-8 with and without BOM"
Reply: Tex Texin: "Re: Names for UTF-8 with and without BOM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Tex Texin scripsit:

> I didn't think the XML standard allowed for utf-8 files to have a BOM.

This capability was never actually excluded, and was added by erratum
(and force-majeure, when it became clear that BOMful UTF-8 was going to
start becoming common). XML files are intended to be plain text, and
if a large source of plain text insists on a BOM, so be it.

> The standard is quite clear about requiring 0xFEFF for utf-16.
> I would have thought a proper parser would reject a non-utf-16 file
> beginning with something other than "<".

If by "<" you mean the *character* "<", then yes. If you mean the *byte*
0x3C, then no: well-formed XML files can begin with any of 0x00 (UTF-32),
0x3C (ASCII-compatible), 0x4C (EBCDIC), 0xEF (UTF-8 with BOM), 0xFE (UTF-16
in BE order), or 0xFF (UTF-16 in LE order). In principle they could begin with
some other byte: 0x2B in UTF-7, e.g.

> (The fact that notepad puts it there should be irrelevant.)

Actual practice is never quite irrelevant.

-- 
John Cowan   jcowan@reutershealth.com   http://www.reutershealth.com
    "Mr. Lane, if you ever wish anything that I can do, all you will have
        to do will be to send me a telegram asking and it will be done."
    "Mr. Hearst, if you ever get a telegram from me asking you to do
        anything, you can put the telegram down as a forgery."

Next message: David Starner: "Re: Header Reply-To"
Previous message: John Hudson: "Re: ct, fj and blackletter ligatures"
In reply to: Tex Texin: "Re: Names for UTF-8 with and without BOM"
Next in thread: Tex Texin: "Re: Names for UTF-8 with and without BOM"
Reply: Tex Texin: "Re: Names for UTF-8 with and without BOM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Nov 02 2002 - 19:18:38 EST