Re: UTF-8 code in HTML

From: Jonathan Coxhead (jonathan@doves.demon.co.uk)
Date: Tue Apr 11 2000 - 20:34:40 EDT

Next message: Rick McGowan: "Re: UTF-8 code in HTML"
Previous message: Markus Scherer: "Re: UTF-8 code in HTML"
Maybe in reply to: Fady Elias: "UTF-8 code in HTML"
Next in thread: Rick McGowan: "Re: UTF-8 code in HTML"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Markus Scherer replied to,

with,

| It is not feasible to have a different extension per encoding, and
it
| is - luckily - not necessary with HTML and XML pages since they are
| self-describing.

You must mean something I don't understand by "self-describing".
If I have 3 H T M L files side-by-side in a directory, one in U T F
8, another in, say, big-endian Unicode, and a third in shift-JIS,
there is no way they can be self describing, because in order to
parse the H T M L, you have to understand the encoding already.

The server could open the file and read some of it, and guess that
if every alternate byte is a 0---or a lot of them are---then it might
be Unicode; and that if it has a lot of characters with bit 7 clear,
and otherwise obeys the syntax of U T F 8, that it might be U T F 8;
you could even hope for a BOM. But these are heuristics, only. Why
should the server have to examine the file in order to be able to
serve it? There seems to be a category error of some sort here ...

And in any case, even if H T M L were self-describing, and we
didn't mind opening the file and checking the contents before serving
it, what about plain text? Near-arbitrary byte sequences are legal in
plain text---I imagine a short document could be contructed that is
legal, and even plausible, as both Unicode and U T F 8.

| You should provide your HTTP server with the
| information about your pages that you have it serve up. This
| information would include charset, language, and maybe more.

We are in complete agreement---but the way this is usually done is
via the file name extension, so I've been waiting for ".uni", ".utf"
etc to start appearing, and they haven't yet. I think the answer is
probably just that Unicode technology is still a little way away from
the mainstream.

| If you don't provide this information, then the browser can still
| get it out of the HTML page's <meta> tags.

So what *should* the server put in the charset field of the
header? Something like

Content-Type: text/html; charset = unknown

(or, equivalently, just remain silent on the matter) and let the
browser figure it out? It might work---it seems that it is the status
quo---but I don't see it working for plain text.

| By the way, the default charset for HTML is ISO8859-1, not US-
| ASCII, I hope.

The default *document* character set is now ISO10646 (Unicode). It
used to be ISO8856-1. But this only specifies how to interpret
numeric entity references. The encoding by which the characters in
the file themselves are represented is another matter entirely. Shift-
JIS is not illegal, as far as I know, as long as it's announced
properly. Which is where I came in ...

        /|
o o o (_|/
        /|
       (_/

Next message: Rick McGowan: "Re: UTF-8 code in HTML"
Previous message: Markus Scherer: "Re: UTF-8 code in HTML"
Maybe in reply to: Fady Elias: "UTF-8 code in HTML"
Next in thread: Rick McGowan: "Re: UTF-8 code in HTML"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT