Re: UTF-8 code in HTML

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Tue Apr 11 2000 - 21:48:13 EDT


Jonathan,

you seem to underestimate HTML and long-existing browsers.
HTML can have a meta tag near the beginning of the file that emulates what the HTTP stream should carry, like
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Any application that reads HTML only needs to look at the first couple hundred bytes (at worst) to distinguish between one of the following:
- based on ASCII
- UTF-16BE
- UTF-16LE

This can be done by searching for byte sequences that would match "<html" or similar in either of the above.

From there, it can read the <meta> tag and then read the entire file with the correct encoding.

Many applications actually use the Unicode signature byte sequence (ef bb bf for UTF-8, fe ff for UTF-16BE, ff fe for UTF-16LE) additionally or alternatively to detect Unicode.

This makes HTML files self-describing. The scenario that you describe works. Same for XML, even better, because the first few characters are precisely specified, and the encoding comes earlier and in a deterministic place.

This is simple and works. Of course, a server should have the data provided by the page author and serve it before it sends a page. If the author does not set the charset, then the server could follow the above process or not send the charset, I suppose.

The document character set is indeed Unicode, as you point out. However, the default charset or encoding is not (for HTML), it is Latin-1. These are two different animals.

For ASCII files, most applications rely just on the signature byte sequence to detect the Unicode encoding, or default to whatever is common on the platform if there is no Unicode signature.

There is no need for any new extensions for as long as existing and common techniques are used.

markus



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT