re: encoding??????

From: Sandeep Krishna (sandeepkrishna@noida.hcltech.com)
Date: Thu Sep 28 2000 - 06:07:50 EDT


HI,
thankx a lot for providing solutions of many of my problems......

but i shall take the liberty to ask some more....

* actually i have been trying to use ASPs (UTF-8 encoding..) to write
unicode cahracters to an Oracle DB table (varchar2 field)... and then
retrieve them back..
(i used UTF-8 encoding for both writing to the database and also for
retriving and displaying..)

there were some amazing observations...

* each unicode character was taking 7 bytes in the database. (instead of
expected 2 or 3...)
* some unicode characters(or rather code points.) like' F95F' when encoded
in UTF-8 was being encoded as EF A5 BF, when it should have been encoded as
EF A5 9F.. in fact many unicode charcters whose encoded form had to had a
byte in the range (80..9F) were being somehow changed to BF ... thus
resulting in incorrect retrieval....

I was unable to find the reasons for these strange occurrences....

THEN SOMEONE SUGGESTED CHANGES IN THE REGISTRY.....

actually the registry entries for oracle shows 3 entries for NLS_LANG.
and that too at the WEB SERVER end and at the DATABASE SERVER end.....
so that makes tooooooo many combinations...

AND FINALLY HOW DOES THE CHANGING OF REGISTRY TO AMERICAN_AMERICA.UTF8
IMPACT THE DATABASE STORAGE OR DISPLAY PROCESS??????

kindly suggest......

thankx and regads,

Sandeep

----- Original Message -----
From: <Jukka.Korpela@hut.fi>
To: Unicode List <unicode@unicode.org>
Sent: Thursday, September 28, 2000 2:18 PM
Subject: Re: Encoding????????????

On Wed, 27 Sep 2000, Sandeep Krishna wrote:

> can someone tell me...what does the Encoding in the browser (IE5)
imlpy.....

That's a good question. Internet Explorer 5 is relatively advanced in the
area of handling different encodings. It seems to honor the encoding
("charset") as advertized in HTTP headers, and it seems to try to make
an educated guess (based on the actual content of data) when no encoding
is specified. The details are somewhat obscure and undocumented, though.

IE 5 _also_ lets the user override its guess of the encoding; a good thing
to do, since quite a many pages are still sent without proper designation
of the encoding. The Encoding menu on IE 5 has dual purpose: you can check
from it what encoding the browser has assumed when interpreting the data
(either from the HTTP headers, or from META tags which try to simulate
them, or by heuristic guessing, or by user's explicit selection) - you'll
see that alternative checked - or you can make your own guess of what
the encoding really is.

> does it mean that the Encoding (say UTF-8 or Chinese Big5) shall be
> used for encoding/ decoding any data ..(page) to be displayed or
> sent....

Basically, for interpreting the data that the browser has to display.
(It may affect e.g. how data sent via forms is encoded by the browser, but
I've never studied that side of the matter.)

> i mean if i use an encoding like Big5 ....
> how does it encode a chinese character...similar to utf-8 or
> differently..???

Do you mean as an IE 5 user, or as a Web document author? If you, as a
user, change the selection in the Encoding menu, you're telling the
browser to treat the data according to that encoding. Whether it makes
sense depends on how the data has actually been encoded.

Big5, also known as "Traditional Chinese" is not a Unicode encoding at
all, so it is surely different from UTF-8. For a short characterization of
Big5, see http://www.dpliv.com/nckuaa/tech/bg5hist.html

> and can i display a Korean charactrer... using big5???

Depends on what you mean by a "Korean character". I suppose you mean
Hangul. As far as I know, Big5 doesn't contain them. For Hangul, you
can use either some Korean standard, or Unicode (see part "10.4 Hangul"
in the Unicode standard). There are various practical considerations.
For example, software used by people in Korea might be better equipped
to handle data encoded according to a Korean standard. People elsewhere
might cope with Unicode encoded data better. (For example, my IE 4,
in a fairly vanilla Windows environment with a few Unicode fonts
installed, can display UTF-8 encoded Korean texts just fine - I hope I
could just understand them! - but doesn't seem to be able to handle any
Korean encoding.) To gain maximum audience, as a Web author, you might
consider making your documents available in both (or several) encodings,
and link them together (for obvious reasons, with link texts in plain
Ascii, which probably means you'd have to use plain English) so that
people can try the other version if the first one is illegible.

> pls explain the Encoding.part??????

It's a somewhat confusing issue, and not directly related to Unicode
(though it naturally affects Unicode encodings too). If you mean the
encoding concept in general, perhaps my
http://www.hut.fi/u/jkorpela/chars.html#encoding
illustrates a bit; that tutorial of mine contains references which you
might find more readable presentations on the topic - mine is somewhat
technical (and, to be honest, it does not use _quite_ the same terminology
as the Unicode standard). If you mean encoding as relevant to HTML
documents on the World Wide Web, then I'd refer to my attempt to simple
instructions on using "extended" character repertoires there
http://www.hut.fi/u/jkorpela/HTML/chars.var
But it's still somewhat confusing I'm afraid. There's an authorative
(though, I'm afraid, perhaps even more confusing) presentation at
http://www.w3.org/TR/html4/charset.html#h-5.2
http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.4

P.S. Please send only plain text to mailing lists, not as multipart
in plain text and "HTML" (cf. to
http://extra.newsguy.com/%7eschramm/nhtml.html ).

--
Yucca, http://www.hut.fi/u/jkorpela/ or http://yucca.hut.fi/yucca.html



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT