Re: UTF-8 code in HTML

From: Yves Arrouye (yves@realnames.com)
Date: Wed Apr 12 2000 - 01:09:49 EDT


>> If I have 3 H T M L files side-by-side in a directory, one in U T F
>> 8, another in, say, big-endian Unicode, and a third in shift-JIS,
>> there is no way they can be self describing, because in order to
>> parse the H T M L, you have to understand the encoding already.

Most encodings commonly used are a superset of ASCII, and thus one can
safely reach the point where a meta tag for the content type can be parsed.
This meta tag is in ASCII itself. So for these encodings, there is no work
to be done by the parser, and the author can use the appropriate meta tag to
make its document self-describing.

Today, very few people publish HTML documents encoded in UCS-2, UCS-4,
UTF-16 or UTF-32. When they do, the readers of these documents need to be
able to recognize these encodings since they are not supersets of ASCII.
Recognizing these is trivial, and some browsers do it. If you want to avoid
being at the mercy of browser's recognition of encodings, UTF-8 is an
appropriate encoding that is a superset of ASCII.

Modern browsers implement some sort of automatic encoding detection anyway
(amazing, the number of Japanese Web pages without any charset information;
some even play tricks to force the recognition of a given charsets: for
example, Yahoo! includes a comment with a byte sequence that only exists in
EUC-JP in order to "help" the browser recognize the encoding as such very
early). Why they usually can't help you if you have files in iso-8859-1, -2
and -15, they're still very useful. And if you indicate the encoding of your
files in the files themselves, you'll be very safe (still, caveat for 16 or
32 bits eencodings today).

YA.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT