Re: UTF-8 code in HTML

From: Jonathan Coxhead (
Date: Tue Apr 11 2000 - 20:34:40 EDT

   Markus Scherer replied to,

 | > Don't we need some conventional file extensions for both plain
 | > text and H T M L encoded in U T F 8, U T F 16, etc? E g
 | >
 | > ".utf" => text/plain; charset = utf-8
 | > ".uni" => text/plain; charset = utf-16
 | > ".utfml" => text/html; charset = utf-8
 | > ".uniml" => text/html; charset = utf-16


 | It is not feasible to have a different extension per encoding, and
 | is - luckily - not necessary with HTML and XML pages since they are
 | self-describing.

   You must mean something I don't understand by "self-describing".
If I have 3 H T M L files side-by-side in a directory, one in U T F
8, another in, say, big-endian Unicode, and a third in shift-JIS,
there is no way they can be self describing, because in order to
parse the H T M L, you have to understand the encoding already.

   The server could open the file and read some of it, and guess that
if every alternate byte is a 0---or a lot of them are---then it might
be Unicode; and that if it has a lot of characters with bit 7 clear,
and otherwise obeys the syntax of U T F 8, that it might be U T F 8;
you could even hope for a BOM. But these are heuristics, only. Why
should the server have to examine the file in order to be able to
serve it? There seems to be a category error of some sort here ...

   And in any case, even if H T M L were self-describing, and we
didn't mind opening the file and checking the contents before serving
it, what about plain text? Near-arbitrary byte sequences are legal in
plain text---I imagine a short document could be contructed that is
legal, and even plausible, as both Unicode and U T F 8.

 | You should provide your HTTP server with the
 | information about your pages that you have it serve up. This
 | information would include charset, language, and maybe more.

   We are in complete agreement---but the way this is usually done is
via the file name extension, so I've been waiting for ".uni", ".utf"
etc to start appearing, and they haven't yet. I think the answer is
probably just that Unicode technology is still a little way away from
the mainstream.

 | If you don't provide this information, then the browser can still
 | get it out of the HTML page's <meta> tags.

   So what *should* the server put in the charset field of the
header? Something like

      Content-Type: text/html; charset = unknown

(or, equivalently, just remain silent on the matter) and let the
browser figure it out? It might work---it seems that it is the status
quo---but I don't see it working for plain text.

 | By the way, the default charset for HTML is ISO8859-1, not US-
 | ASCII, I hope.

   The default *document* character set is now ISO10646 (Unicode). It
used to be ISO8856-1. But this only specifies how to interpret
numeric entity references. The encoding by which the characters in
the file themselves are represented is another matter entirely. Shift-
JIS is not illegal, as far as I know, as long as it's announced
properly. Which is where I came in ...

 o o o (_|/

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT