RE: Major site in unicode?

From: addison@inter-locale.com
Date: Mon Oct 02 2000 - 13:29:39 EDT


It knows because:

1. You sent the page in that character set, or;
2. You embedded a token in the page to tell the CGI program what the
character set was, or;
3. You used the (IE only) hack to get the browser to embed it in a hidden
field, or;
4. You guessed it based on a heuristic (or from the user's session
information, maintained in the URL or in a cookie).

This sounds complex, but it isn't all that bad. Very few users will be
foolish enough to change their display encoding to something that displays
the page incorrectly...

Actually, all this talk of "setting browser to Unicode" and "setting the
browser to code page" is based on a poor assumption or set of
assumptions. What's getting set is the character encoding of the HTML page
itself. If done correctly, the browser will read it from the HTTP header
and(or) the META tag.

The current best practice for creating multilingual capable web sites
(even if they happen to be mono-lingual at any one URL) is to use Unicode
(either UTF-8 or UTF-16, depending on your operating
environment) internally at the server. A decision can be made to deliver
either UTF-8 or a non-Unicode legacy encoding at page delivery time. At
this point in time, most pages are NOT delivered as UTF-8, even though the
server-side systems are entirely Unicode, because of the problems cited
earlier with older Netscape and IE browsers and their still relatively
large market share.

Choosing this architecture allows you to construct single-source code
systems, access databases and data warehouses, and build applications in a
locale independent way. This vastly simplifies maintenance, testing, and
deployment compared to legacy charset systems.

... many programmers, of course, would like to eliminate the complexity of
the charset conversion at delivery time, and this day is coming. I suggest
that you parse UserAgent strings at the start of a session with a user and
determine if UTF-8 can be sent to the browser (it can in the majority of
cases and the vast majority of Western and Eastern European cases: Asian
locales are the big hangup here), and set the result into the session
(see #4 above).

Hope this helps.

Addison

===========================================================
Addison P. Phillips Principal Consultant
Inter-Locale LLC http://www.inter-locale.com
Los Gatos, CA, USA mailto:addison@inter-locale.com

+1 408.210.3569 (mobile) +1 408.904.4762 (fax)
===========================================================
Globalization Engineering & Consulting Services

On Mon, 2 Oct 2000, Raghu Kolluru wrote:

> > >> I assume that "the ISO standard" refers to ISO/IEC 8859-1 and
> > >> possibly 8859-2 as well. Unicode is an ISO standard too (ISO/IEC
> > >> 10646-1).
> > >
> > > So if my browser is set to ISO 8859-1 or ISO 8859-2, but a
> > > Central Euopean or Western European site is only in
> > Unicode, then all
> > > will show up correctly?
> >
> > If your browser is old enough that it can only be "set to" a single
> > character set, and this setting cannot be overridden by a "charset=X"
> > tag in the HTML page, then no, it will not be displayed
> > correctly. But
> > this sort of rigidity is not present in modern browsers.
>
> How does the CGI program know that the data submitted is of "charset=EUC-JP"
> ?
>
> Raghu Kolluru, Software Engg.
> GO.com | Walt Disney Internet Group
> 206-664-4267 | raghu.kolluru@dig.com
>
>
>
> > -----Original Message-----
> > From: Doug Ewell [mailto:dewell@compuserve.com]
> > Sent: Sunday, October 01, 2000 11:48 PM
> > To: Unicode List
> > Subject: Re: Major site in unicode?
> >
> >
> > >> I assume that "the ISO standard" refers to ISO/IEC 8859-1 and
> > >> possibly 8859-2 as well. Unicode is an ISO standard too (ISO/IEC
> > >> 10646-1).
> > >
> > > So if my browser is set to ISO 8859-1 or ISO 8859-2, but a
> > > Central Euopean or Western European site is only in
> > Unicode, then all
> > > will show up correctly?
> >
> > If your browser is old enough that it can only be "set to" a single
> > character set, and this setting cannot be overridden by a "charset=X"
> > tag in the HTML page, then no, it will not be displayed
> > correctly. But
> > this sort of rigidity is not present in modern browsers.
> >
> > >> The browser you are thinking of is Netscape Navigator (pre-4.7).
> > >> Support for Unicode in all browsers is improving steadily,
> > and as it
> > >> does, your 'adamant' programmers will end up using Unicode-encoded
> > >> sites without even realizing it.
> > >
> > > When? 5 years from now? As for using Unicode without realizing
> > > it, what do you mean? If a Russian's browser is set to CP1251, what
> > > happens if the site is in Unicode? At present he gets
> > garbage. I've
> > > tried the setting that automatically changes to the character set of
> > > the page. Doesn't work very well. I think the character set
> > > indication gets left out in many sites.
> >
> > Browsers are supposed to be able to switch automatically to the
> > character set used by the target page, but they cannot necessarily do
> > this blindly by auto-detecting the character set. It is
> > supposed to be
> > indicated by the page using the "charset=X" tag. Sites that do not do
> > this are not giving browsers a fair chance to display the page
> > properly. This is not the fault of Unicode or the browser, but of the
> > HTML author.
> >
> > > I don't disagree with this. It's just at present
> > moment, Netscape
> > > and Explorer don't seem ready. What would really be needed is the
> > > browser automatically detects the site as being in Unicode, and
> > > switches to that character set. Then sites could switch
> > over without
> > > worry. That is not the case at the moment. So the user has to
> > > change the character set himself.
> >
> > Try using a recent version of your favorite browser (IE version 5.0 or
> > above, or NN version 4.7 or above).
> >
> > I think the real problem here is that you, your team, and your users
> > in Russia are working with older versions of software that did not
> > properly handle Unicode, and are assuming that newer versions will not
> > support Unicode either. Thankfully, this is not the case.
> >
> > -Doug Ewell
> > Fullerton, California
> >
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT