On Fri, Feb 15, 2002 at 09:47:54AM -0800, Rick Cameron wrote:
> If there is a file on disc called foo.txt, it is clearly not typed data.
> Thus, it appears to be Mr Davis' opinion that when such a file contains
> UTF-8 data, it is quite appropriate for there to be a BOM at the start.
In a global sense, it may be appropriate for a UTF-8 file to have a BOM.
However, in a Unix context - and UTF-8 was originally designed for Unix
and Unix-like systems - it is worthless and annoying.
Take, for example, three files:
A: <BOM>C<LF>AB<LF>
B: <BOM>ABC<LF>AB<LF>
C: <BOM>ABCDEFG<LF>
and the operation
grep "AB" A B > file; cat C >> file
you'll end up with
file: A: AB<LF>B: <BOM>ABC<LF>B: AB<LF><BOM>ABCDEFG<LF>
That's a document with two BOM's, and none at the start of the file.
There's no simple way to fix this; grep doesn't know if it's working on
UTF-8 text or raw binary or Latin-1 (I frequently do grep foo file |
recode l1..utf-8), and it doesn't know whether its output is going to
the screen or a file or the tail of a file or the input of another
program.
Again, while globally, UTF-8 BOM's might work, in Unix they will be more
of a nuisance than a help.
-- David Starner / Давид Старнэр - starner@okstate.edu Pointless website: http://dvdeug.dhis.org What we've got is a blue-light special on truth. It's the hottest thing with the youth. -- Information Society, "Peace and Love, Inc."
This archive was generated by hypermail 2.1.2 : Fri Feb 15 2002 - 13:53:55 EST