RE: Unicode on a non-Unicode web page

From: Helena Shih (hshih@jtcsv.com)
Date: Fri Sep 08 2000 - 18:45:15 EDT


Hi Paul. I am curious to know if,

1. The ICU conversion code is buggy, or
2. The XMLConverter sample is buggy.

If you can kindly point out the bugs in ICU code to us, we would really
appreciate that. Instead of using XMLConverter sample, which is not designed
and coded to be robust and easy to use. I would recommend using uconv
application instead in the 'icuapps' module also checked into the CVS
repository for ICU.

Please feel free to submit your bug report to us at
http://oss.software.ibm.com/developerworks/opensource/icu/bugs. Thank you!!

-----Original Message-----
From: Paul Deuter [mailto:Paul.Deuter@plumtree.com]
Sent: Thursday, September 07, 2000 1:47 PM
To: Unicode List
Subject: RE: Unicode on a non-Unicode web page

Your question is essentially "How do I mix characters encoded in more than
one character set on a single page?"

A normal page has one document and that one document will expect characters
to be encoded in the character set specified in the meta tag in the header.
It is possible to have a compound document consisting of one or more
documents each in its own FRAME. Each frame will have its own header and
therefore can have a different character set than the main page (see example
below). It is also possible to use IFRAMEs which also have their own
header. IFRAMEs however are not supported by Netscape. These are the only
ways I know of using multiple character sets on one page.

Finally you also have the solution already suggested of encoding everything
as UTF-8 and using that as your main character set. I don't know of an easy
way of transliterating 8859-2 to UTF-8. The hard ways are using Notepad on
Windows 2000 on a machine that has 8859-2 as the ANSI character set and
saving to UTF-8. There is also an XMLConverter program that comes with the
ICU source - but I have found this to be buggy.

FRAMES example:

<HTML>
<HEAD>
<TITLE>Simple set of frames</TITLE></HEAD>
<FRAMESET COLS="*,500">
   <FRAME SRC="FRM1.HTM">
   <FRAME SRC="FRM2.HTM">
</FRAMESET>
</HTML>

FRM1.HTM:
<HTML>
<HEAD>
<meta http-equiv="Content-Type" content="text/html; charset=shift-jis">
<meta http-equiv="Content-Language" content="ja">
<TITLE>Frame 1 HTML</TITLE></HEAD>
<BODY>
<P>
Japanese 'Ü'ę'É Text
</P>
</BODY>
</HTML>

FRM1.HTM:
<HTML>
<HEAD>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1251">
<meta http-equiv="Content-Language" content="ru">
<TITLE>Frame 2 HTML</TITLE></HEAD>
<BODY>
<P>
Russian ĺíăĺíăĺíă Text
</P>
</BODY>
</HTML>

Paul Deuter
paul.deuter@plumtree.com

-----Original Message-----
From: Gary P. Grosso [mailto:gpg@arbortext.com]
Sent: Thursday, September 07, 2000 7:32 AM
To: Unicode List
Subject: Unicode on a non-Unicode web page

Hi Unicoders,

I am working on software to emit HTML in the encoding
and character set of the user's choice, from SGML/XML
documents which can contain any Plane 1 Unicode character.
The question is what to do with characters outside the
selected encoding. I thought I would use the "numeric"
character entity reference and IE5 at least seems to
render that well, but Netscape Communicator 4.6 doesn't.

One way to look at this is: how do I use unicode as an
"escape" to include some isolated content on a web page
of arbitrary encoding?

For example, I have something such as:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html><head><title>Unicode in a Latin 2 page</title>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-2">
</head>
<body style="line-height: 16pt"><div class="pgbrk" style="padding-top:
48pt">
<p>Článek Úvod Žádný čest čin činěn činů činům činnost činnosti
jakmile jako jakož jakožto jazyka jež jediné jednat jednotkou
jednotlivec</p>
<p>CYRILLIC CAPITAL LETTER DJE: &#1026;</p>
<p>CAPITAL LETTER GAMMA: &#x0393;</p>
<p>HIRAGANA LETTER KA: &#12363;</p>
<p>jeho jejich jemu jimi jiného jinému jiných jiným jinými jsou každému
každý
</p>
</body>
</html>

which probably looks awful since your email client is not likely
set to display Latin 2, but which can also be seen at:

http://www.angelfire.com/mi/virtualattic/latin2_test.html

If I change the meta tag to:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
then Netscape does slightly better (still stumbles over &#x-anything
and doesn't display the hiragana, but does display the DJE and GAMMA
if I use decimal values) but of course now the Czech words are not
displayed properly.

My question(s):

Is there some way I can nudge Netscape's browser to display these?

Is there a better way to write this admittedly mongrel HTML content?
I have heard somewhere that it is possible to change charset choice
"on the fly" and if would work, I would appreciate a pointer to
somewhere that says how best to do this.

Thanks in advance for any insights.

---
Gary Grosso
ggrosso@arbortext.com
Arbortext, Inc.
Ann Arbor, MI, USA



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:13 EDT