Re: UTF-8 code in HTML

From: Addison Phillips [GSC] (
Date: Sat Apr 15 2000 - 19:40:29 EDT

Those are some useful ideas, Mark, but I doubt that it'll sway anyone to
using UTF-8 sooner.

The real issue here, if I understand correctly, is when to switch the
Received Wisdom from "serve legacy" to "serve UTF-8". The relative ability
of browsers it not really at issue here, since it is a problem that is a)
correctable by the user and b) going away anyways.

I actually see this as a server architecture problem. We have *plenty* of
reasons left on the server side not to use UTF-8, even though Unicode
encodings simplify our lives tremendously.

Some examples:

1. Content creators ("page designers"----> not programmers) need to have
tools that invisibly support UTF-8, including boring old text editors. This
is actually a serious obstacle: if the site is created and maintained in
Latin-1: then we have to maintain a whole infrastructure for just that
purpose. Do NOT tell me that translating 3000 HTML snippets into UTF-8
"automagically" is the answer!
2. Our scripting and cgi languages need fixing. Perl 5.6 has the requisite
support (but it is *brand* new). So many other Web technologies do not. For
example, I've got a guy busy next week lobotomizing PHP using ICU...
3. The JVMs need to be updated. Even NN4.7 still carries around a less than
completely recent JVM. Also next week I have a guy making a chat client that
sends everything to the server as an unprocessed stream (plus a locale tag)
because the server can run J2 and the client has to be able to run J1.0
(basically). One significant issue is that I have to transmit the JVM
version to the server so that more modern JVMs that can do
real-honest-to-betsy-UTF8 can actually do the conversion themselves. You
never know when the JRE is going to be installed...
4. Template languages processors need to be updated. Yes, UTF-8 can "sneak
by" the processor in most cases, but what about things like toUpper( )?
Awareness of Unicode is valuable here too.
5. URL encodings, storage, and the like. Some web servers are darned cranky
if the characters encoded in the %hh don't actually match the file system
byte values. Universal use of UTF-8 here would *really* help.

What I've actually been telling my customers for the last while is that
UTF-8 is coming: it's a matter of time now. Legacy encodings are all very
nice for specific applications, but use Unicode internally to build pages,
interact with your database, etc. One area where I've been unsure has been
whether to convert actual File System Assets (files stored on disk) to
UTF-8, if I later intend to serve them as legacy encoding. For now I've
defaulted to leaving them in code page and converting everything to match
the container (file). But it is time to actually say "*everything* as UTF-8"
and convert it based on the browser version string as necessary.

Let's say that the figures I posted earlier in the week are accurate. If 2/3
of the Mac users run Netscape and 35% of the population runs IE5, then that
is awfully close to half of the eyeballs out there able to view UTF-8
without interruption (in their own language). Another six months plus NN6
should make this case compelling, no?

I think restraint should be used, in the meantime, in telling people simply
that "UTF-8 solves all your problems"... like most I18N issues, it
substitutes one set of problems for another. It's just (as Bill Hall always
says) that "one set of problems is *much* more interesting than the other."

Oh, a fly in Mark's ointment: how many versions of how many browsers would
we have instructions for? IE is maddeningly different from version to
version. Netscape is at least somewhat consistent, but it has several
versions (and more limited localization. Do we show the English or the
Japanese versions of the browser in Korea?)



Addison P. Phillips
Senior Globalization Consultant
Global Sight Corporation
(+1) 408.350.3600 - Telephone
Going global with your web site? Global Sight provides Web-based software
solutions that simplify the process, cut costs, and save time.

----- Original Message -----
From: Mark Davis <>
To: Glen Perkins <>
Cc: Unicode List <>; Addison Phillips [GSC]
Sent: Saturday, April 15, 2000 1:51 PM
Subject: Re: UTF-8 code in HTML

> One thought:
> 1. Make a simple web page explaining how to set up different browsers with
the right fonts to read UTF-8.
> 2. Make a button-like GIF that says something like "Display Problems?"
with a link to the page.
> 3. Get volunteers to translate this page and the text in the GIF into
multiple languages.
> 4. Post the pages and GIFs on the Unicode site in an accessible area.
> 5. Encourage people to use the linked GIFs on their own sites, and/or copy
them and modify as they see fit.
> Do you think this kind of thing would help?
> Mark

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT