RE: PRODUCING and DESCRIBING UTF-8 with and without BOM

From: Joseph Boyle (
Date: Mon Nov 04 2002 - 22:41:40 EST

  • Next message: Tex Texin: "Re: PRODUCING and DESCRIBING UTF-8 with and without BOM"

    Yes, the software business is largely about dealing with the BADLY WRITTEN,
    the TRIVIAL, and the BRAIN-DEAD. Your point?

    Newline problems are a good analogy. They still require bookkeeping of
    different formats and attention in any new coding and cause new bugs, even
    though the problem has been around for decades. Nobody is holding their
    breath for any of the platforms to change their newline convention to match
    the others or even update all their tools to deal with the differences -
    bare LF still doesn't work in Notepad.

    -----Original Message-----
    From: Edward H Trager []
    Sent: Monday, November 04, 2002 9:19 AM
    To: Unicode Mailing List
    Subject: Re: PRODUCING and DESCRIBING UTF-8 with and without BOM

    Hi, everyone,

    It's almost unbelievable to me how many email postings are wasted on
    discussions such as this UTF-8 BOM issue ... I guess it means that there is
    a lot of BADLY WRITTEN software out there in the world ;-)

    With regard to READING incoming UTF-8 text streams, surely any good software
    designer will do exactly as Michael Michka has suggested here:

    > INCOMING TEXT: Trivial to simply check. I say (once again) its THREE
    > BYTES.

    With regard to EMITTING outgoing UTF-8 text streams, IMHO the default should
    be to do what is simplest, which is *not* output the BOM. It is superfluous
    to have it on UTF-8 streams. There's no harm in having a global option to
    turn BOM outputting on for the benefit of BRAIN-DEAD programs that are going
    to read the text:

    > EMITTING: They could simply choose globally whether to emit the BOM or
    > not. If they wanted to get "fancy" they could have a command line
    > option which said whether to emit the bytes or not. But that is
    > optional.

    The whole issue is analogous to the CR\LF issue in ASCII texts across
    different platforms. Well-written software is able to READ the text
    properly regardless of whether lines end in CR, LF, or CR\LF.

    This archive was generated by hypermail 2.1.5 : Mon Nov 04 2002 - 23:25:26 EST