RE: Unicode and end users

From: Rick Cameron (Rick.Cameron@crystaldecisions.com)
Date: Fri Feb 15 2002 - 15:41:02 EST


I would say that this isn't an issue specific to unix - many programmers who
work primarily on Windows like to use command-line text-processing tools.

And OTOH surely the case where a BOM is useful also occurs on unix: when a
program that wants to operate in Unicode must import a text file. It's my
understanding that MBCS character sets are not uncommon on unix (for
example, EUC). If my program is running on a unix system where the default
character set is EUC-JP (as I believe it's called) and it tries to import a
text file containing UTF-8 without a BOM, how is the program supposed to
know that the file contains UTF-8 rather than EUC-JP?

So not a unix problem, but rather a problem with dumb command-line tools. I
wonder whether the GNU people have thought of making their command-line
tools aware of UTF-8 & BOMs.

Thanks

- rick cameron

-----Original Message-----
From: David Starner [mailto:starner@okstate.edu]
Sent: Friday, 15 February 2002 11:24
To: Rick Cameron
Cc: unicode@unicode.org
Subject: Re: Unicode and end users

On Fri, Feb 15, 2002 at 09:47:54AM -0800, Rick Cameron wrote:
> If there is a file on disc called foo.txt, it is clearly not typed
> data. Thus, it appears to be Mr Davis' opinion that when such a file
> contains UTF-8 data, it is quite appropriate for there to be a BOM at
> the start.

In a global sense, it may be appropriate for a UTF-8 file to have a BOM.
However, in a Unix context - and UTF-8 was originally designed for Unix and
Unix-like systems - it is worthless and annoying.

Take, for example, three files:

A: <BOM>C<LF>AB<LF>
B: <BOM>ABC<LF>AB<LF>
C: <BOM>ABCDEFG<LF>

and the operation

  grep "AB" A B > file; cat C >> file

you'll end up with

file: A: AB<LF>B: <BOM>ABC<LF>B: AB<LF><BOM>ABCDEFG<LF>

That's a document with two BOM's, and none at the start of the file. There's
no simple way to fix this; grep doesn't know if it's working on UTF-8 text
or raw binary or Latin-1 (I frequently do grep foo file | recode l1..utf-8),
and it doesn't know whether its output is going to the screen or a file or
the tail of a file or the input of another program.

Again, while globally, UTF-8 BOM's might work, in Unix they will be more of
a nuisance than a help.

-- 
David Starner / Давид Старнэр - starner@okstate.edu
Pointless website: http://dvdeug.dhis.org
What we've got is a blue-light special on truth. It's the hottest thing 
with the youth. -- Information Society, "Peace and Love, Inc."



This archive was generated by hypermail 2.1.2 : Fri Feb 15 2002 - 15:02:55 EST