Re: charset in HTTP vs. HTML meta (was Re: UTF-16 and HTML META charset)

From: Erik van der Poel (erik@netscape.com)
Date: Tue Feb 22 2000 - 17:08:37 EST


Hi Glen,

You may already know that there is a pretty good open mailing list at
W3C to discuss Web I18N issues. To subscribe, send an email to

  www-international-request@w3.org

with the word "subscribe" in the Subject field, nothing in the body. The
archives are here:

  http://lists.w3.org/Archives/Public/www-international/

Glen Perkins wrote:
>
> Yes, this is a question I was discussing with Andrea Vine and some others a
> few days ago: whether 'tis nobler to use HTTP headers, HTML meta tags, or
> both, under various real-world circumstances. What are the rules for this in
>
> 1) any standard

Some of the specs have been pretty clear about this. In particular,
HTML4 has a pretty good section on this:

  http://www.w3.org/TR/REC-html40/charset.html#h-5.2.2

From the client's point of view, this spec is clear. For server-side
software implementors, the spec is also quite clear, though you need to
look at yet another section to discover that servers are allowed to peek
inside HTML documents to find META HTTP-EQUIV elements:

  http://www.w3.org/TR/REC-html40/struct/global.html#edef-META

For intermediaries (such as transcoding proxies and translation
services), the spec is not very explicit, but implementors may assume
that since the HTTP charset has a higher priority than META charset, it
is OK to modify only the HTTP charset, leaving the META charset as is
(i.e. wrong, after transcoding). So, if you include a META charset but
omit the HTTP charset, these transcoders may not work. I haven't tested
them.

In the real world, I have heard but not confirmed that transcoding
proxies are still in use for Cyrillic today (and for Japanese, but only
in the past(?)), and that some Web sites provide translation services
where you can enter the URL of a page to translate, e.g.:

  http://world.altavista.com/

Now, the specs for the charset names themselves are not at all clear.
There is a central Internet registration authority called IANA (Internet
Assigned Numbers Authority) that keeps the registry of charset names. It
has been poorly managed from Day One, resulting in the current mishmash
on the Net. The situation is getting better, but not very quickly. Here
is the registry:

  ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets

At some point, I hope to find some time to generate tables of the
charset names that are actually used by Netscape 4.X, MSIE5 and Mozilla
5.0. For now, you can get some idea by looking at the following
out-of-date document:

  http://people.netscape.com/erik/nav-charsets/

> 2) in practice in various versions of Netscape Navigator

Netscape 4.X's market share is shrinking (for several reasons), so I
won't comment in detail, but here are a couple of bugs in various
versions that I'm aware of:

The early 4.X versions (roughly 4.0 - 4.05) gave priority to META
charset instead of HTTP charset. So, if both were present, and META said
something different from HTTP, Netscape would listen to the META
charset. This was wrong, and was a bug that I fixed for 4.06 (if I
recall correctly).

The same versions had a problem where it would render a document
containing META charset, twice even if it got it right the first time. I
have heard but not confirmed that this caused problems for pages with
JavaScripts (that should not have been executed twice), not to mention
the frustration of the user when the document is long and takes a long
time to render twice. This META charset problem was so bad that Jamie
Zawinski outlawed its use on mozilla.org's Web site in the early days
(before he left).

All versions of Netscape from 1.1 to 4.X (but not Mozilla 5.0) are based
on an architecture (if it deserves that term) where the document first
passes through an I18N stream module that not only converts the incoming
document to some font API's encoding, but also auto-detects charsets and
even looks for META charset. Unfortunately, that META charset parser was
too permissive, allowing virtually any "META" element followed
eventually by the string "charset=" to cause a change in behavior. Not
only was this parser too permissive (and different from the actual HTML
parser), it would sometimes succeed and sometimes fail due to a TCP/IP
peculiarity called "slow start", where the first packet is smaller than
subsequent ones. The I18N stream module's META charset parser only
worked on the first block of data (first packet). The following is the
latest version of the bad META parser as it was after Netscape released
the sources on March 31st, 1998, but before mozilla.org switched to the
new architecture based on the totally rewritten layout engine:

  http://lxr.mozilla.org/classic/source/lib/libi18n/metatag.c#23

I'll bet that this is more than you wanted to know. :-)

> Given the current state of things, what's the best approach to serving up
> dynamic content in multiple languages?

Even though there were problems with META charset in the early Netscape
4.X versions, I suspect that for many server-side applications today it
is a good idea to emit both the HTTP charset and the META charset (and
make sure they say the same thing, of course).

The HTTP charset is a good idea because it allows recent Netscape 4.X
versions to render the document correctly the first time, and because it
allows transcoding intermediaries to do their thing without having to
parse HTML. (I'm thinking particularly of transcoding proxies, of
course.)

The META charset is a good idea because many users are using browsers
that allow them to save HTML files to disk, losing the HTTP charset
info. However, those saved HTML files will have the wrong charset info
if the user retrieved them via a transcoding intermediary, of course,
but this may be relatively rare (Cyrillic, AltaVista, and some others).

> Assume you're trying to create a website with dynamically-generated pages in
> lots of languages, but only one language per page. It's not necessarily easy
> to tell the server, page by page, what encoding is being transmitted.

It may not be easy, but it's quite important that you do it, and get it
right. I have no idea what server(s) you and others are using, and I
don't know much about servers in the first place, but I have been told
that both Netscape's and Microsoft's servers have APIs that allow
server-side apps to do all sorts of things, including altering the
Content-Type header of an outgoing stream. I don't know whether Apache
has any APIs.

> Is the
> safest, most reliable approach currently to use only the most common,
> ASCII-based legacy encodings, use no HTTP Content-Type: text/html;
> charset=foo header, but instead include the ASCII meta (http-equiv) tag on
> every page?

There is no single right answer to this question, I feel. My answer is
to use ASCII-based legacy encodings with charset names that are
supported by all of the browser versions, with both HTTP and META
charset. YMMV (Your Mileage May Vary).

However, there are other examples that you can draw your own conclusions
from. Yahoo Japan uses a comment near the top of the page with a pair of
bytes that only occur in EUC-JP (and not Shift-JIS and ISO-2022-JP).
Since 99.9% of all Japanese-speaking users use browsers that have been
configured (in some cases by default) to auto-detect EUC-JP vs Shift-JIS
vs ISO-2022-JP, Yahoo Japan's method is quite effective. Don't quote me
on the 99.9% figure. It's just a wild guess, stated more for the sake of
the argument than anything else. Try typing www.yahoo.co.jp into my
HTTP/HTML source viewer:

  http://webtools.mozilla.org/web-sniffer/

> (The reason for this approach, by the way, is that it would both work
> reliably now and prepare the way nicely for a rather gradual change from
> those legacy encodings into UTF-8, which would be just another ASCII-based
> encoding in this scenario. It's too early for UTF-8 for the general,
> consumer web pages, but the same web server could begin serving UTF-8 behind
> the firewall, where we could be more daring.)

Yes, this sounds good. Let me add that there are good ways to migrate
towards UTF-8. Server-side software can look at the User-Agent header to
decide whether or not it is safe to send UTF-8. As I mentioned before,
it is *not* safe to send UTF-8 to Netscape 4.X. Server-side software can
request UTF-8, and clients can send UTF-8, if HTML forms contain the
accept-charset attribute with UTF-8. Note that Netscape 4.X does not
support accept-charset in forms, though. I'm mentioning this attribute
for future migration planning.

> Would there be problems caused by leaving off the HTTP header charset
> declaration and doing all the charset declarations in the HTML meta tag?
> Would these problems be significant enough that some method really would
> have to be found to include an HTTP header that matched the page's meta tag?

For now, you may want to choose to emit only the META charset. But I
think I have given good reasons why you ought to work on getting the
HTTP charset in there too, eventually.

> Would it actually be better to declare a wrong encoding in an HTTP header
> than declare none at all, for some reason (still assuming all pages were
> correctly meta tagged)?

No, there is no situation where this is "better" (though the early
Netscape 4.X versions wouldn't be adversely affected by this). Most
current browser versions would have trouble with that.

> I'm leaving aside the question of non-ASCII-compatible encodings like
> UTF-16, which obviously have different issues. If your meta tag is written
> in UTF-16, somehow you're going to have to know the encoding before you can
> read a meta tag, via HTTP, BOM, or some heuristic. It just doesn't seem
> likely to me that any such encoding would be practical on a busy consumer
> website that only serves one language per page, but has to have that page
> work on a very wide range of browsers. I'm willing to put those encodings
> aside for now in favor of ASCII-compatible encodings.

Wise decision. Please consider my migration proposals above.

Erik



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT