Re: Names for UTF-8 with and without BOM

From: John Cowan (jcowan@reutershealth.com)
Date: Sat Nov 02 2002 - 18:46:39 EST

  • Next message: David Starner: "Re: Header Reply-To"

    Tex Texin scripsit:

    > I didn't think the XML standard allowed for utf-8 files to have a BOM.

    This capability was never actually excluded, and was added by erratum
    (and force-majeure, when it became clear that BOMful UTF-8 was going to
    start becoming common). XML files are intended to be plain text, and
    if a large source of plain text insists on a BOM, so be it.

    > The standard is quite clear about requiring 0xFEFF for utf-16.
    > I would have thought a proper parser would reject a non-utf-16 file
    > beginning with something other than "<".

    If by "<" you mean the *character* "<", then yes. If you mean the *byte*
    0x3C, then no: well-formed XML files can begin with any of 0x00 (UTF-32),
    0x3C (ASCII-compatible), 0x4C (EBCDIC), 0xEF (UTF-8 with BOM), 0xFE (UTF-16
    in BE order), or 0xFF (UTF-16 in LE order). In principle they could begin with
    some other byte: 0x2B in UTF-7, e.g.

    > (The fact that notepad puts it there should be irrelevant.)

    Actual practice is never quite irrelevant.

    -- 
    John Cowan   jcowan@reutershealth.com   http://www.reutershealth.com
        "Mr. Lane, if you ever wish anything that I can do, all you will have
            to do will be to send me a telegram asking and it will be done."
        "Mr. Hearst, if you ever get a telegram from me asking you to do
            anything, you can put the telegram down as a forgery."
    


    This archive was generated by hypermail 2.1.5 : Sat Nov 02 2002 - 19:18:38 EST