Re: UTF-8 code in HTML

From: Glen Perkins (Glen.Perkins@NativeGuide.com)
Date: Sat Apr 15 2000 - 05:19:12 EDT


I wonder how big a problem a typical large corporation would actually face
if they switched from the current "legacy encodings" in each world market to
UTF-8. I'm not wondering if there would be a problem, yes or no, from a
purist perspective. I mean what are the numbers, market by market, of people
who would have problems with UTF-8 vs. the numbers of people who have
problems caused by the current encodings, weighted by the seriousness of
those problems.

For example, how big is the risk of using UTF-8 for the US market? It's
seems as though it's probably a little riskier than Latin-1, but is it
really? How much riskier? By "how much", I mean what percentage of visitors
to the site would have a problem with UTF-8 vs. what percentage would have a
problem with Latin-1. It's not as if there are no Latin-1 problems, after
all. If you build a "Latin-1" app server, people will immediately start
shoving CP1252 curly quotes, trademark signs, etc. into it, which will
probably break when served to a Mac or Unix box.

Then, what percentage of the French market would have trouble with UTF-8 vs
Latin-1? You have similar CP1252 problems, plus the Euro issue. What
percentage of browsers would have problems with a well-built UTF-8 page *in
French*, given the actual installed base of browsers in France today?

What of the Polish market? Addison, your point is well taken about the
browsers of non-Polish-OS users needing special setup for viewing UTF-8
encoded Polish text, but for a realistic market analysis, that may not
matter very much. More likely, the questions would be, how would a UTF-8
encoded web page fare in the actual Polish market and, again realistically,
even if it had some problems, how much would it matter, given that the
Polish market is likely to comprise only a very small percentage of your
worldwide viewership. I believe that with a Polish OS and a reasonably
recent browser in default configuration, UTF-8 would work fine. (Correct me
if I'm wrong.) Then, the question would be, how many Polish speaking users
of non-Polish OSes are there, and would you be targeting them anyway? In
fact, would you have Polish content on your website at all if you couldn't
just piggyback on the app server's UTF-8 infrastructure built for other
markets? After all, your US viewers who still browse with Netscape 1.0 (or
Lynx) may outnumber all of your Polish viewers, and you probably don't make
major design decisions based on the needs of Netscape 1.0 or Lynx users. And
if the Polish user isn't using a Polish OS because he works in Germany, for
example, then maybe your German pages are actually more applicable to him
anyway. If so, then the browser's failure to handle Polish in UTF-8 by
default probably won't matter.

Then there's Japan. Now here's where it appears that a native speaking
Japanese using a Japanese OS and Netscape 4.x in its default state will be
unable to render Japanese encoded in UTF-8 because the default font for
UTF-8 is a western font. Jungshik is saying that Netscape does "font
switching", if I'm understanding him correctly, which should obviate this
problem, but it was my understanding that this didn't become a feature until
Mozilla. Maybe it's true of Netscape 4.7, but I thought all Netscape 4.x's
in Japan had a "one default font per encoding" limitation, and that a
non-Japanese font was made the default for Japanese Netscape 4.x.

Well, Japan's a big market, so the question then becomes what percentage of
Japanese viewers would have trouble viewing Japanese (not Polish) in UTF-8
vs. what percentage would have various troubles viewing, say, EUC-JP. Not,
"does the problem exist", but to what extent, and how fast is it
disappearing.

Then Korea, Taiwan, China, etc. Are they the same as Japan? By that I mean
1) in the behavior of the browsers, 2) in the current market share of the
browsers, and 3) (hard to measure but important) how much do those users of
UTF-8-challenged browsers really correspond to your target market anyway?

I'd be interested to know if anyone has more detailed info on just what the
current magnitude of the problem would be per market. (Not just standard
browser stats, but the degree to which UTF-8 would work for the native
language on each browser used in that locale/market.) I'd like to see
statistics of this sort tracked on the home page of Unicode.org. We could
have projections of milestones that would certainly make for good PR:
"According to the Unicode Consortium, the percentage of browsers worldwide
that are unable to handle a Unicode page in the user's native language will
drop below 5% by mid-August...." It may be that the benefits are going to
outweigh the remaining problems sooner than even we realize. Since a lot of
people rely on the folks on this list to tell them when UTF-8 is "safe to
use", we ought to keep our eye on these numbers. We don't want the idea that
"the market isn't really ready for UTF-8 to the browser" to become
fossilized as conventional wisdom, carved in stone like "Unicode doubles the
size of all text data", independent of changing market statistics.

__Glen Perkins__

----- Original
Message ----------------------------------------------------------
From: Addison Phillips [GSC]

It's exciting that we're on the cusp of general support for Unicode in the
browsers, operating systems, and languages (perl just got a transfusion, for
example)... and the support will "just be there" without thinking. About
time.

Best regards,

Addison



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT