From: Roozbeh Pournader (roozbeh@sharif.edu)
Date: Sat Feb 15 2003 - 21:25:18 EST
Found it! It's forbidden to start a HTML 4.0 page with a UTF-8 BOM. Proof:
1. Open the main page of Unicode. You can see that the HTML header says:
<!doctype HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"><html>
So, we are talking about HTML 4.0 here. The reference for HTML 4.0 is:
http://www.w3.org/TR/1998/REC-html40-19980424/
The section about HTML header is Section 7.1, Introduction to the
structure of an HTML document:
http://www.w3.org/TR/1998/REC-html40-19980424/struct/global.html#h-7.1
which mentions:
"An HTML 4.0 document is composed of three parts:
1. a line containing HTML version information,
2. a declarative header section (delimited by the HEAD element),
3. a body, which contains the document's actual content. The body
may be implemented by the BODY element or the FRAMESET element.
White space (spaces, newlines, tabs, and comments) may appear before or
after each section. Sections 2 and 3 should be delimited by the HTML
element."
So "White space" is allowed before the line containing HTML version
information. But what is a white space? It is define in Section 9.1, White
space:
"The document character set includes a wide variety of white space
characters. Many of these are typographic elements used in some
applications to produce particular visual spacing effects. In HTML,
only the following characters are defined as white space characters:
* ASCII space ( )
* ASCII tab (	)
* ASCII form feed ()
* Zero-width space (​)
Line breaks are also white space characters."
So, we need to know what is a line break! Well, section 9.3.2 defines
that:
"A line break is defined to be a carriage return (
), a line feed
(
), or a carriage return/line feed pair."
That's all. So the only characters that are allowed in a HTML 4.0 web page
before the HTML header, are U+0009, U+000A, U+000C, U+000D, U+0020, and
U+200B. QED.
roozbeh
PS: UTF-16 is an exception to that, since the BOM is not part of the
document and should be removed for processing.
This archive was generated by hypermail 2.1.5 : Sat Feb 15 2003 - 22:10:42 EST