From: Martin Duerst (duerst@w3.org)
Date: Mon Feb 17 2003 - 17:36:53 EST
Some comments:
- If you can avoid it, don't use a BOM at the start of an UTF-8
HTML file. It will display nicely on more browsers.
- The W3C Validator http://validator.w3.org/ accepts the BOM for
HTML 4.01, and also XHTML. It probably should produce a warning.
It did when I originally added code to handle it. I have requested
that it be added again.
- Adding a BOM/ZWNBSP to the whitespace definition is a bad idea,
because it would allow a ZWNBSP in all kinds of places where
not seeing a space would be confusing (e.g. between attributes).
Also, HTML 4 is only being maintained, not being developed.
- That HTML 4.0 allows ZWSP (​) as whitespace in
http://www.w3.org/TR/REC-html40/struct/text.html#h-9.1 is for
line breaking/rendering reasons (Thai), within element content.
This is in conflict with the whitespace definition for syntactic
purposes, which is formally given at
http://www.w3.org/TR/REC-html40/sgml/sgmldecl.html and does
not include ZWSP (​). I have filed a request for
clarification.
- RFC 2279 does not approve or disapprove of the BOM. Both Unicode
and ISO 10646 allow the BOM as a signature for UTF-8. RFC 2079
is being updated. See
http://lists.w3.org/Archives/Public/ietf-charsets/2003JanMar/0209.html.
- For XML, a BOM at the start of UTF-8 is allowed by an erratum at
http://www.w3.org/XML/xml-V10-2e-errata#E22. But similar to HTML,
better to not start your XML files with a BOM, because there are
XML parsers out there that don't like it (and this was okay at
least until 2001-07-25).
- The BOM is both rather handy in a Windows/Notepad scenario and
seriously disruptive in an Unix-like filter scenario (which may
also be on Windows). I have found that Notepad doesn't need the
BOM to detect that a file is UTF-8 if it has enough other information
(this is on a Japanese Win2000, your milage may vary). It would be
nice if it had a setting to not produce a BOM.
- I append a small perl program that removes an UTF-8 BOM if there
is one. Quite handy, I use it regularly. Feel free to use and change
on your own responsibility.
(i.e. if starts to eat up your files, don't blame me!)
Regards, Martin.
#!/usr/bin/perl
# program to remove a leading UTF-8 BOM from a file
# works both STDIN -> STDOUT and on the spot (with filename as argument)
if ($#ARGV > 0) {
print STDERR "Too many arguments!\n";
exit;
}
my @file; # file content
my $lineno = 0;
my $filename = $ARGV[0];
if ($filename) {
open BOMFILE, "$filename";
while (<BOMFILE>) {
if (!$lineno++) {
s/^\xEF\xBB\xBF//;
}
push @file, $_ ;
}
close BOMFILE;
open NOBOMFILE, ">$filename";
foreach $line (@file) {
print NOBOMFILE $line;
}
close NOBOMFILE;
}
else { # STDIN -> STDOUT
while (<>) {
if (!$lineno++) {
s/^\xEF\xBB\xBF//;
}
push @file, $_ ;
}
foreach $line (@file) {
print $line;
}
}
This archive was generated by hypermail 2.1.5 : Mon Feb 17 2003 - 20:27:32 EST