Re: FW: Unicode and the UTF8 encoding in HTML

From: Otto Stolz (Otto.Stolz@uni-konstanz.de)
Date: Mon Sep 17 2001 - 10:36:27 EDT


Hello Unicoders,

this should finally go to the FAQ.

Hello JG,

on Thursday, September 13, 2001 1:23 AM, James Gardner

wrote:
> I have a Microsoft Active Server Page which is saved

> as an ANSI file.

Note that older Microsoft documentation abused the term
ANSI for Microsoft's proprietary CP 1252 code, cf.
<http://czyborra.com/charsets/codepages.html#CP1252>.
(The lower half of that code table equals ASCII, hence
it is not repeated in codepages.html; you can see it in
<http://czyborra.com/charsets/iso646.html>.) Though I do
not know Microsoft Active Server and you did not mention
the editor you were using, I guess your page is in CP 1252.

If you would give a sample URL, I could verify this conjecture.

> In this file I specify that it should use UTF8 encoding.

Does that mean, you have in your page a line saying
<meta http-equiv=Content-Type content="text/html; charset=UTF-8">
?

If so, that line does specify the encoding to the browser,
but does not cause the server to convert it to UTFß8 (as
you apparently are assuming). In other words, if you include
that line in a CP1252 coded file, you are sending the browser
astray. Read more about HTML Document Representation in
<http://www.w3.org/TR/html401/charset.html>.

Note also that (according to the official specifications)
pre-4.0 HTML could only contain Latin-1 characters, cf.
<http://czyborra.com/charsets/iso8859.html>.

What you really have to do:
- write your source in HTML 4.0 (or later) or in XML,
   including an approprate document type declaration, cf.
   <http://www.w3.org/TR/html401/struct/global.html#h-7.2>;
- include the above-mentioned Meta tag in your HTML source;
- store the HTML source file in UTF-8 encoding;
- make sure that your server does not generate a
   HTTP header field that would contradict your charset
   setting.
Then, a suitable up-to-date browser should properly display
your page, provided that all required characters are contained
in the font used for display. Cf.
<http://www.hclrss.demon.co.uk/unicode/browsers.html>,
and <http://www.hclrss.demon.co.uk/unicode/fonts.html>,
respectively.

You may wish to study some examples:
<http://www.rz.uni-konstanz.de/y2k_uralt/test/Euro-UTF.htm>,
<http://www.rz.uni-konstanz.de/y2k_uralt/test/Go-UTF.htm>.

Furthermore, I recommend to have your HTML syntax
(including the proper specification of the encoding)
checked by <http://validator.w3.org/>.

Cf. also
<http://www.hclrss.demon.co.uk/unicode/htmlunicode.html>.

> The data (text) that is put into the page when it is created

> by the server is stored as unicode.

This seems to contradict the first sentence quoted above.
Now, I am completely at loss about your real problem.

> Do I need to save a file as unicode as well as specifying utf8

> encoding to properly display unicode on the web?

Definitely yes. You have to create a standard-complying file
and you have to tell the reader to which of several possible
standards your file actually complies, so the reader can make
heads and tail of it.

> This electronic communication is confidential and for the

> exclusive use of the addressee.

So you post it to a list read word-wide??

Best wishes,
   Otto Stolz



This archive was generated by hypermail 2.1.2 : Mon Sep 17 2001 - 09:32:25 EDT