Re: Unicode in web pages

From: Otto Stolz (Otto.Stolz@uni-konstanz.de)
Date: Mon Sep 04 2000 - 18:44:09 EDT


On Mon, 4 Sep 2000 03:21 (GMT-0800), you have written:
> I need to take input from a web-page, and store it in a database.

I have no experience in this part of your problem. I think that some
expert from the Unicode forum will step in; otherwise, ask me for the
copies of some earlier contributions to the Unicode list I have kept
on this topic.

> Web pages are then driven from this database.

As a WWW author, I can give you a few hints on this side of your problem,
and answer your general questions.

> [...] UTF-8. What then reaches the database is a series of ASCII values
> with foreign characters such as Japanese, or accented characters, con-
> verted to a few symbols.

What you probably get is the UTF-8 encoded text. The ASCII, or possibly
Latin-1 (ISO 8859-1), characters you are seeing are an artefact of your
displaying method: If you use a tool that only knows about Latin-1, it
will display any byte-sequence as a sequence of Latin-1 characters.
Cf. <http://czyborra.com/charsets/iso8859.html> for an explanation of
"Latin-1", "ISO 8859-1", and similar terms.

UTF-8 is an encoding (the prevailing one in WWW pages) of Unicode.
A Unicode value is a 20 bit number (roughly) representing one character;
UTF-8 uses one through four bytes to represent a Unicode value; cf.
<http://czyborra.com/utf/> for the details.

> Some web sites seem to say that for html, unicode must be changed
> to [the &#xxxx;] numeric character reference format.

This is a wrong perceiption. Actually, you have several options.

First of all, you have to mark your HTML source as HTML 4.0 or better,
as HTML 3.2 (and below) allows only Latin-1 (i. e. 8-bit) characters.
(It has been argued that you do not have to do so, as the major browsers
display your pages alright even if you omit that declaration; however,
this is not standard-conforming, so you may run into trouble. I think,
the risk is not worth to be taken.)
Cf. <http://www.w3.org/TR/REC-html40/struct/global.html#h-7.2>.

In HTML 4, the document character set is invariably Unicode (or UCS, which
is kept in sync with Unicode), i. e. all numeric character references are
based on Unicode/UCS values. However, you can choose amongst several en-
codings to transfer your HTML source to the browser; your HTML source can
contain any character the encoding is capable to transfer. If you choose
a limited encoding, such as ASCII (7-bit), Latin-1 (8-bit), or some of the
far-east national standards, you can natively use the characters from its
respective character set; if you choose UTF-8, you can transfer any
character. Cf. <http://www.w3.org/TR/REC-html40/charset.html> for details.

As mentioned above, you can always include the numeric reference to any
character, irrespective of the encoding chosen; however, Netscape Navigator
used to have a bug, so it would interpret the numeric character references
only for those characters contained in the character set of the transfer
encoding (which are virtually useless, as you can send them, anyway). In
the current 4.7 version, this bug has gone (at least in my Unix box). You
may want to experiment with my example HTML files illustrating this topic,
and to study their HTML sources:
  <http://www.rz.uni-konstanz.de/y2k/test/Euro-Latin-1.htm>
  <http://www.rz.uni-konstanz.de/y2k/test/Euro-Latin-9.htm>
  <http://www.rz.uni-konstanz.de/y2k/test/Euro-UTF.htm>
  <http://www.rz.uni-konstanz.de/y2k/test/Go-Latin.htm>
  <http://www.rz.uni-konstanz.de/y2k/test/Go-UTF.htm>
For the three Euro files, you need a font containing the Euro currency
symbol U+20AC; for the two Go files, you need a far-east character,
U+7881.

Best wishes,
   Otto Stolz



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT