Re: FW: Unicode and the UTF8 encoding in HTML

From: Otto Stolz ([email protected])
Date: Mon Sep 17 2001 - 10:36:27 EDT


Hello Unicoders,

this should finally go to the FAQ.

Hello JG,

on Thursday, September 13, 2001 1:23 AM, James Gardner

wrote:
> I have a Microsoft Active Server Page which is saved

> as an ANSI file.

Note that older Microsoft documentation abused the term
ANSI for Microsoft's proprietary CP 1252 code, cf.
<http://czyborra.com/charsets/codepages.html#CP1252>.
(The lower half of that code table equals ASCII, hence
it is not repeated in codepages.html; you can see it in
<http://czyborra.com/charsets/iso646.html>.) Though I do
not know Microsoft Active Server and you did not mention
the editor you were using, I guess your page is in CP 1252.

If you would give a sample URL, I could verify this conjecture.

> In this file I specify that it should use UTF8 encoding.

Does that mean, you have in your page a line saying
<meta http-equiv=Content-Type content="text/html; charset=UTF-8">
?

If so, that line does specify the encoding to the browser,
but does not cause the server to convert it to UTF�8 (as
you apparently are assuming). In other words, if you include
that line in a CP1252 coded file, you are sending the browser
astray. Read more about HTML Document Representation in
<http://www.w3.org/TR/html401/charset.html>.

Note also that (according to the official specifications)
pre-4.0 HTML could only contain Latin-1 characters, cf.
<http://czyborra.com/charsets/iso8859.html>.

What you really have to do:
- write your source in HTML 4.0 (or later) or in XML,
   including an approprate document type declaration, cf.
   <http://www.w3.org/TR/html401/struct/global.html#h-7.2>;
- include the above-mentioned Meta tag in your HTML source;
- store the HTML source file in UTF-8 encoding;
- make sure that your server does not generate a
   HTTP header field that would contradict your charset
   setting.
Then, a suitable up-to-date browser should properly display
your page, provided that all required characters are contained
in the font used for display. Cf.
<http://www.hclrss.demon.co.uk/unicode/browsers.html>,
and <http://www.hclrss.demon.co.uk/unicode/fonts.html>,
respectively.

You may wish to study some examples:
<http://www.rz.uni-konstanz.de/y2k_uralt/test/Euro-UTF.htm>,
<http://www.rz.uni-konstanz.de/y2k_uralt/test/Go-UTF.htm>.

Furthermore, I recommend to have your HTML syntax
(including the proper specification of the encoding)
checked by <http://validator.w3.org/>.

Cf. also
<http://www.hclrss.demon.co.uk/unicode/htmlunicode.html>.

> The data (text) that is put into the page when it is created

> by the server is stored as unicode.

This seems to contradict the first sentence quoted above.
Now, I am completely at loss about your real problem.

> Do I need to save a file as unicode as well as specifying utf8

> encoding to properly display unicode on the web?

Definitely yes. You have to create a standard-complying file
and you have to tell the reader to which of several possible
standards your file actually complies, so the reader can make
heads and tail of it.

> This electronic communication is confidential and for the

> exclusive use of the addressee.

So you post it to a list read word-wide??

Best wishes,
   Otto Stolz



This archive was generated by hypermail 2.1.2 : Mon Sep 17 2001 - 09:32:25 EDT