RE: PRODUCING and DESCRIBING UTF-8 with and without BOM

From: Joseph Boyle (Boyle@siebel.com)
Date: Mon Nov 04 2002 - 22:41:40 EST

Next message: Tex Texin: "Re: PRODUCING and DESCRIBING UTF-8 with and without BOM"

Previous message: Michael Everson: "Re: In defense of Plane 14 language tags (long)"
Maybe in reply to: Joseph Boyle: "PRODUCING and DESCRIBING UTF-8 with and without BOM"
Next in thread: Tex Texin: "Re: PRODUCING and DESCRIBING UTF-8 with and without BOM"
Reply: Tex Texin: "Re: PRODUCING and DESCRIBING UTF-8 with and without BOM"
Reply: Doug Ewell: "Re: PRODUCING and DESCRIBING UTF-8 with and without BOM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Yes, the software business is largely about dealing with the BADLY WRITTEN,
the TRIVIAL, and the BRAIN-DEAD. Your point?

Newline problems are a good analogy. They still require bookkeeping of
different formats and attention in any new coding and cause new bugs, even
though the problem has been around for decades. Nobody is holding their
breath for any of the platforms to change their newline convention to match
the others or even update all their tools to deal with the differences -
bare LF still doesn't work in Notepad.

-----Original Message-----
From: Edward H Trager [mailto:ehtrager@umich.edu]
Sent: Monday, November 04, 2002 9:19 AM
To: Unicode Mailing List
Subject: Re: PRODUCING and DESCRIBING UTF-8 with and without BOM

Hi, everyone,

It's almost unbelievable to me how many email postings are wasted on
discussions such as this UTF-8 BOM issue ... I guess it means that there is
a lot of BADLY WRITTEN software out there in the world ;-)

With regard to READING incoming UTF-8 text streams, surely any good software
designer will do exactly as Michael Michka has suggested here:

> INCOMING TEXT: Trivial to simply check. I say (once again) its THREE
> BYTES.

With regard to EMITTING outgoing UTF-8 text streams, IMHO the default should
be to do what is simplest, which is *not* output the BOM. It is superfluous
to have it on UTF-8 streams. There's no harm in having a global option to
turn BOM outputting on for the benefit of BRAIN-DEAD programs that are going
to read the text:

> EMITTING: They could simply choose globally whether to emit the BOM or
> not. If they wanted to get "fancy" they could have a command line
> option which said whether to emit the bytes or not. But that is
> optional.

The whole issue is analogous to the CR\LF issue in ASCII texts across
different platforms. Well-written software is able to READ the text
properly regardless of whether lines end in CR, LF, or CR\LF.

Next message: Tex Texin: "Re: PRODUCING and DESCRIBING UTF-8 with and without BOM"
Previous message: Michael Everson: "Re: In defense of Plane 14 language tags (long)"
Maybe in reply to: Joseph Boyle: "PRODUCING and DESCRIBING UTF-8 with and without BOM"
Next in thread: Tex Texin: "Re: PRODUCING and DESCRIBING UTF-8 with and without BOM"
Reply: Tex Texin: "Re: PRODUCING and DESCRIBING UTF-8 with and without BOM"
Reply: Doug Ewell: "Re: PRODUCING and DESCRIBING UTF-8 with and without BOM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Nov 04 2002 - 23:25:26 EST