RE: help

From: Mike Brown (mbrown@corp.webb.net)
Date: Tue Aug 29 2000 - 00:49:35 EDT


> Are there any HTML pages in the Unicode character set
> i.e The entire HTML page is in Unicode ( including the
> tags , attributes ) .

Based on the way you asked your question, I think some clarifications are in
order.

An HTML or XML document exists in abstract form as a sequence of abstract
characters from a very large subset of the repertoire covered by Unicode
(actually, ISO/IEC 10646-1). The document may exist in tangible form, for
storage or transmission, as a sequence of bits, which in turn may be grouped
into bytes or other fixed bit widths.

The procedure for mapping the abstract Unicode characters to certain
bit/byte/whatever sequences is an encoding "scheme". A mapping of particular
characters to particular bit/byte/whatever sequences is a "character set".
Most character sets map single characters to single octets (8-bit bytes),
but it is not uncommon for characters to be mapped to sequences of more than
one octet (UTF-8 and Shift-JIS, for example). Unicode Technical Report #17
describes a number of intermediate layers of abstraction, but for your
purposes, you are probably only concerned with these kinds of character
sets.

There is a list of character sets approved for use on the Internet at
http://www.isi.edu/in-notes/iana/assignments/character-sets. The character
set names and aliases in this list are what may go in "charset" parameters
of MIME and HTTP "Content-Type" headers, or in the "encoding" attribute in
the prolog of an XML entity.

Right now the Unicode character repertoire is expressed in terms of "Unicode
values" which are a sequence of 1 or 2 values that are 16 bits wide, and
notated like "U+1234" in print, leading people to believe that there is a
"Unicode character set" that maps abstract characters to specific bit
sequences. The reality, as expressed in UTR #17, is that 16 bit code values
may manifest in different ways in different computer architectures. The
issues basically boil down to matters of endianness when the values are
split into 8-bit chunks, and additional bits that might be added to the
beginnings of encoded documents to signify this situation (byte order
marks).

Since Unicode covers every abstract character, *every* character set maps
some subset of Unicode's repertoire to bit/byte sequences; so in a sense,
all encoded documents are "in the Unicode character set". The Unicode
Standard, certain IETF RFCs, and certain amendments to ISO/IEC 10646-1
define a few encoding schemes / transformation formats that effectively map
the entire Unicode repertoire to bit/byte sequences. There are a few
character sets implied or specified by these schemes/formats, and these do
appear in the IANA's character set list:

ISO-10646-UTF-1 or csISO10646UTF1
UNICODE-1-1 or csUnicode11
UNICODE-1-1-UTF-7 or csUnicode11UTF7
UTF-7
UTF-8
UTF-16
UTF-16BE
UTF-16LE

The first 4 are deprecated and all but abandoned, and you will have a hard
time finding any UTF-16, UTF-16BE, UTF-16LE encoded HTML documents, because,
as someone else pointed out, few browsers support them. You can find UTF-8
encoded HTML documents pretty easily, though. Any document consisting purely
of ASCII bytes will do, even if it uses Ӓ or &SGMLentity; references
to non-ASCII characters. This is because UTF-8 supersets ASCII (0x20-0x7E).

If you want to actually see some non-ASCII characters represented as UTF-8
byte sequences in pages that declare themselves to be UTF-8 encoded, have a
look through the HTML at http://czyborra.com/.

Perhaps after reading this you may decide you don't really want to see HTML
"in the Unicode character set" at all :)

   - Mike
____________________________________________________________________
Mike J. Brown, software engineer at My XML/XSL resources:
webb.net in Denver, Colorado, USA http://www.skew.org/xml/



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT