Re: Unicode and end users

From: David Starner (starner@okstate.edu)
Date: Fri Feb 15 2002 - 14:24:20 EST


On Fri, Feb 15, 2002 at 09:47:54AM -0800, Rick Cameron wrote:
> If there is a file on disc called foo.txt, it is clearly not typed data.
> Thus, it appears to be Mr Davis' opinion that when such a file contains
> UTF-8 data, it is quite appropriate for there to be a BOM at the start.

In a global sense, it may be appropriate for a UTF-8 file to have a BOM.
However, in a Unix context - and UTF-8 was originally designed for Unix
and Unix-like systems - it is worthless and annoying.

Take, for example, three files:

A: <BOM>C<LF>AB<LF>
B: <BOM>ABC<LF>AB<LF>
C: <BOM>ABCDEFG<LF>

and the operation

  grep "AB" A B > file; cat C >> file

you'll end up with

file: A: AB<LF>B: <BOM>ABC<LF>B: AB<LF><BOM>ABCDEFG<LF>

That's a document with two BOM's, and none at the start of the file.
There's no simple way to fix this; grep doesn't know if it's working on
UTF-8 text or raw binary or Latin-1 (I frequently do grep foo file |
recode l1..utf-8), and it doesn't know whether its output is going to
the screen or a file or the tail of a file or the input of another
program.

Again, while globally, UTF-8 BOM's might work, in Unix they will be more
of a nuisance than a help.

-- 
David Starner / Давид Старнэр - starner@okstate.edu
Pointless website: http://dvdeug.dhis.org
What we've got is a blue-light special on truth. It's the hottest thing 
with the youth. -- Information Society, "Peace and Love, Inc."



This archive was generated by hypermail 2.1.2 : Fri Feb 15 2002 - 13:53:55 EST