From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue Mar 25 2003 - 04:59:21 EST
Stefan Persson wrote:
> Let's say that I have two files, namely file1 & file2, in any Unicode
> encoding, both starting with a BOM, and I compile them into
> one by using
>
> cat file1 file2 > file3
>
> in Unix or
>
> copy file1 + file2 file3
>
> in MS-DOS, file3 will have the following contents:
>
> BOM
> contents from file1
> BOM
> contents from file2
>
> Is this in accordance with the Unicode standard, or do I have
> to remove the second BOM?
IMHO, Unicode should not specify such a behavior. Deciding what a shell
command is supposed to do is a decision of the operating system, not of text
encoding standards.
BTW, consider that both Unix "cat" and DOS "copy" are not limited to Unicode
text files. Actually, they are not even limited to text files at all: you
could use them to concatenate a bitmap with a font with an HTML document
with a spreadsheet... whether the result makes sense or not is up to you
and/or to the applications that will process the resulting file.
Probably, there should be two separate commands (or different options of the
same command): to do a raw byte-by-byte concatenation, and to do an
encoding-aware concatenation of text files.
E.g., imagine a "cat" command with these extensions:
Synopsis
cat [ -... ] [ -R encoding ] { [ -F encoding ] file }
Description:
...
If neither -R or -F's are specified, the concatenation is
done byte by byte.
Options:
...
-R specifies the encoding of the resulting *text* file;
-F specifies the encoding of the following *text* file.
You command above would now expand to something like this:
cat -R UTF-16 -F UTF-16LE file1 -F Big-5 file2 > file3
Provided with information about the input encodings and the expected output
encoding, "cat" could now correctly handle BOM's, endianness, new-line
conventions, and even perform character set conversions. Without this extra
info, "cat" would retain its good ol' byte-by-byte functionality.
Similar options could be added to any Unix command potentially dealing with
text files ("cp", "head", "tail", etc.), as well as to their equivalents in
DOS or other operating systems.
_ Marco
This archive was generated by hypermail 2.1.5 : Tue Mar 25 2003 - 06:03:02 EST