Robert M. Gerlach wrote:
> When saving a webpage from within Microsoft Internet Explorer,
Which Version? I've tested your issue with version 5 SP2 (more
precisely: 5.00.3314.2101).
> there are a few notable options...
...for the encoding of the file.
> and I'm really unsure as to what the differences are,
- "Unicode" saves the HTML source in UTF-16-LE encoding with BOM,
cf. <http://www.unicode.org/unicode/faq/utf_bom.html>,
- "Unicode (UTF-8)" saves it in -- guess, what? -- UTF-8 encoding,
cf. <http://www.unicode.org/unicode/faq/utf_bom.html>,
- "Western European (ISO)" saves it in ISO 8859-1 encoding,
cf. <http://czyborra.com/charsets/iso8859.html#ISO-8859-1>,
- "Western European (Windows)" saves it in MS CP 1252 encoding,
cf. <http://czyborra.com/charsets/codepages.html#CP1252>.
> which is "better," etc.
Some thoughts:
- CP 1252 is a proprietary encoding (though widely understood);
I'd prefer a standard encoding for the sake of portability.
- Both ISO 8859-1 and and CP 1252 comprise a limited character
set; if your HTML source contains characters outside this set,
the UTFs are preferable. IE 5 SP2 does not warn you of this
situation; rather, it replaces every single character not
represantable in the encoding chosen with the pertinent NCR
(cf. <http://www.w3.org/TR/html401/charset.html#h-5.3.1>).
Drawbacks:
· NCRs are hard to edit.
· NCRs take excessive storage (6 to 7 byte per character).
· NCRs outside the current encoding are not correctly dis-
played by Netscape 4.7x browsers.
- UTF-8 is more common for HTML sources than UTF-16.
- UTF-8 does not suffer from the BE vs. LE issue.
- For all alphabetic scripts, a UTF-8 encoded HTML source
takes less storage than an UTF-16 encoded one:
UTF-8 takes 1 byte per ASCII character (used for the HTML
tags, and in Latin-based scripts also for the bulk of the
text); it takes two bytes per character for the rest of the
alphabetic scripts. UTF-16 takes two bytes per character for
both ASCII and non-ASCII characters from alphabetic scripts.
- Both ISO 8859-1 and CP 1252 are handled easily by all text
editors; for the UTFs, you will need a Unicode-capable
text editor (which is no big deal in Win 2000 and Win XP,
otherwise cf. <http://www.hclrss.demon.co.uk/unicode/>
and <http://www.unicode.org/unicode/onlinedat/products.html>).
So, if your HTML source is in a "Western" language, I'd re-
commend "Western European (ISO)", otherwise "Unicode (UTF-8)".
Best wishes,
Otto Stolz
This archive was generated by hypermail 2.1.2 : Tue Dec 04 2001 - 04:49:27 EST