Re: Unicode, European languages and HTML - help !

From: Otto Stolz (
Date: Mon May 29 2000 - 13:00:08 EDT


I haven't seen any response to your query in the Unicode list, so
I am going to contribute my 0,02 EUR.

On Tue, 16 May 2000 12:01:21 -0800 (GMT-0800) you have written:
> Since Unicode covers all European character sets, we decided to setup
> both web server and Oracle database to use Unicode. Is that a correct
> assumption ?

Yes, definitely.

Though I am not proficient in designing Oracle databases. I guess, you
have to decide which encoding to use
- either UTF-16 (which equals UCS-2 for contemporary languages)
- or UTF-8.

UCS-2 uses 16 bit per character; there is an extension mechanism to code
rarely used characters (such as hieroglyphs) in 32 bits per character.

UTF-8 uses 8 bits, 16 bits, 24 bits or 32 bits per character; an ASCII
character occupies 8 bits, almost any other European character 16 bits
(the Euro currency symbol occupies 24 bits). HTML pages can be encoded
directly in UTF-8.

> We have decided to code all our HTML pages using Unicode values
> (i.e. &#...;). However, countries seem to complain that it is not
> displaying properly.

This depends on the browsers used by your customers.

- For Netscape 3, the end-user has to manually enable Unicode-support,
  cf. <>. This is probably
  not a feasable way to go, in your case.

- Netscape 4, and above, will display decimal Numerical Character References
  (NCRs) alright -- but only if you declare your WWW page as UTF-8 encoded
  (in contrast to the HTML 4.0 specification!). Furthermore, Netscape does
  not coorectly display hexadekadic NCRs.

- IE 5 displays both decimal and hexadekadic NCRs (only tested with the
  examples below). I do not have old versions readily available to test;
  if I remember correctly, also IE 4 would display decimal (at least)
  NCRs, in my examples.

In any case, you have to declare your HTML source to be HTML 4.0 or later,
because HTML 3.2 (and earlier versions) did not allow for NCRs beyond 255.
Try my examples
Of course, only characters locally installed in one of your fonts
can be displayed. Have a look also at the examples' source texts:
They contain both a DOCTYPE declaration (to declare them as HTML 4.0)
and a META tag (to announce UTF-8 encoding for Netscape -- this does not
interfere with other browsers). Alternatively, you could announce the
latter in the corresponding HTTP header. These example have links to the
pertinent official specifications.

If a significant part of your user community uses old browsers that
do not handle UCS NCRs well (even if UTF-8 is announced) then you
will have to resort to browser-sniffing CGI scripts: A server-side
script could evaluate the browser and re-code the HTML source on the
fly, as suitable for the respective browser. I haven't pursued this
way, though.

> If we use unicode for all of our HTML pages, [...]
> will the user input be stored as unicode characters in the Oracle
> database ?

This has been discussed, some time ago, in the Unicode list. I have not
much experience with HTML forms, so I think, somebody else should answer
this part of the question.

> When we send the bulk email to each country with their data - do we need
> to convert the unicode data coming from the database into each
> individual charset ?

This should work without converting, if you provide the correct MIME tags
with your mail, cf. <>,
<> and
Of course, your clients will have to exploit up-to-date mail clients
to display and print your mail, cf.
If I understand your application correctly, you can advice the adressees
of this mail to use such software (in contrast to the customers replying
to your HTML forms).

Best wishes,
   Otto Stolz

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT