Re: UTF-8 BOM (Re: Charset declaration in HTML)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Thu, 12 Jul 2012 16:06:11 +0200

Right. Unix was unique when it was created as it was built to handle
all files as unstructured binary files. The history os a lot
different, and text files have always used another paradigm, based n
line records. End of lines initially were not really control
characters. And even today the Unix-style end od lines (as advertized
on other systems now with the C language) is not using the
international standard (CR+LF, which was NOT a Microsoft creation for
DOS or Windows).
In fact the "plain text" concept was created by taking the common
denominator of lots of many historical terminal protocols and file
system protocols. ASCII tried to unify all this but most controls were
just assigned symbolic names but no standard functions.
Now to reconcile the various incarnations of plain text, we also have
to leave with at least 4 end-of-line styles in plain-texts, but still
many terminal protocols only consider CR+LF (except those for Unix
shells, using the initial AT&T definition for the C language ; but
even in that case, Unix terminals have also used other conventions
with various emulations — see termcap).
May be you would think that "cat utf8file1.txt utf8file2.txt
>utf8file.txt" would create problems. For plain text-files, this is no
longer a problem, even if there are extra BOMs in the middle, playing
as no-ops.
now try "cat utf8file1.txt utf16file2.txt > unknownfile.txt" and it
will not work. IT will not work as well each time you'll have text
files using various SBCS or DBCS encodings (there's never been any
standard encoding in the Unic filesystem, simply because the
concention was never stored in it; previous filesystems DID have the
way to track the encoding by storing metadata; even NTFS could track
the encoding, without guessing it from the content).
Nothing in fact prehibits Unix to have support of filesystems
supporting out-of-band metadata. But for now, you have to assume that
the "cat" tool is only usable to concatenate binary sequences, in
aritrary orders : it is not properly a tool to handle text files.
Use "ucat" instead to indicate that the input files are in some
standard UTF, and the BOMs will be silently handled, and the various
UTF's will be automatically recognizd with their leading BOM (all
other BOMs will be ignored and discarded, possibly with just a warning
if you work in a really pedantic mode).
No-op codes are not a problem. They have always existed in all
terminal protocols, for various functions such as padding. Even
Unicode documents characters that have no meaning at all except in
very limited contexts (and these characters are strongly discouraged
in "plain text" documents.
More and more tools are now aware of the BOM as a convenient way to
work reliably with various UTFs. Its absence meaning that the platform
default encoding, or the host default, or the encoding selected by the
user in his locale environment will be used.
BOM's are in fact most useful in contexts where the storage or
transmission platform does not allow storing out of band metadata
about the encoding. It is extremely small, it does not impact the
performance.

The BOM should now even be completely ignorable in all contexts,
including in the middle of combining sequences. There will remain a
few ontexts where it may harm, but those softwares that still don't
ignore it semantically (notably some syntaxic parsers used by
programming languages) should be corrected (this also includes HTML5
where it should have been ignored as well, except for its only
function of allowing the autodetection of a standard UTF, where ever
it occurs, as opposed to the legacy "platform default").

It could even be changed so that it could be present in any
UTF-encoded text to allow transitions between distinct UTFs (for
example when concatenating UTF-8 texts best suiting most alphabetic
scripts, and UTF-16 best suiting ideographic, Hangul scripts or
historic scripts encoded only outside the BMP with few occurences of
ASCII punctuations and controls). As if it was an out-of-band control
function (treated like surrogates that have a dedicated special
function and that are not really characters by themselves).

This solution would solve many problems to maximize the
interoperability (there does not exist an universal interopeability
solution that can solve all problems, but at least the UCS with its
standardized UTFs are soplving many). Effective solutions that solve
problems much more often than what it would create with old legacy
applications (most of them being updatable by updating/upgrading the
same softwares). The old legacy solutions will become then something
only needed by some geeks, and instead of blicking them when they
persist in maintaining them, it will be more valuale for them to
isolate those contents and offer them via a proxying conversion
filter.

BOMs are then not a problem, but a solution (which is not the only one
but helps filing the gap when other solutions are not usable or
available).

2012/7/12 Julian Bradfield <jcb+unicode_at_inf.ed.ac.uk>:
> On 2012-07-12, Steven Atreju <snatreju_at_googlemail.com> wrote:
>> In the future simple things like '$ cat File1 File2 > File3' will
>> no longer work that easily. Currently this works *whatever* file,
>> and even program code that has been written more than thirty years
>> ago will work correctly. No! You have to modify content to get it
>> right!!
>
> Nice rant, but actually this has never worked like that. You can't cat
> .csv files with headers, html files, images, movies, or countless
> other "just files" and get a meaningful result, and never have been
> able to.
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
Received on Thu Jul 12 2012 - 09:08:15 CDT

This archive was generated by hypermail 2.2.0 : Thu Jul 12 2012 - 09:08:16 CDT