Re: encoding??????

From: Michael \(michka\) Kaplan (michka@trigeminal.com)
Date: Thu Sep 28 2000 - 07:21:10 EDT


Hello Sandeep,

One important issue you have to account for there: if you are using ASP and
your script is writing to a database, then UTF-8 is not involved in your
script. ASP *code* is always in UCS-2/UTF-16 encoding. Period. If you set
the @CODEPAGE/Session.CodePage to 65001 it will affect data you send to the
browser or retrieve from the browser, but not things that go through COM
interfaces, such as the ADO (via ODBC) that you are using in script to talk
to the database.

The conversion to UTF-8 which is flawed is therefore actually happening in
the ODBC layer, which is where such conversions would be taking place. If
you are using Windows 2000, then your ODBC drivers should be recent enough
(I assume you are using Windows 2000 since it is not really possible to
encode data as UTF-8 in ASP on NT4). You might want to try using the Oracle
OLE DB provider, instead, as it should do a much better job on Unicode text.

michka

a new book on internationalization in VB at
http://www.i18nWithVB.com/

----- Original Message -----
From: "Sandeep Krishna" <sandeepkrishna@noida.hcltech.com>
To: "Unicode List" <unicode@unicode.org>
Sent: Thursday, September 28, 2000 3:03 AM
Subject: re: encoding??????

> HI,
> thankx a lot for providing solutions of many of my problems......
>
> but i shall take the liberty to ask some more....
>
> * actually i have been trying to use ASPs (UTF-8 encoding..) to write
> unicode cahracters to an Oracle DB table (varchar2 field)... and then
> retrieve them back..
> (i used UTF-8 encoding for both writing to the database and also for
> retriving and displaying..)
>
> there were some amazing observations...
>
> * each unicode character was taking 7 bytes in the database. (instead of
> expected 2 or 3...)
> * some unicode characters(or rather code points.) like' F95F' when encoded
> in UTF-8 was being encoded as EF A5 BF, when it should have been encoded
as
> EF A5 9F.. in fact many unicode charcters whose encoded form had to had a
> byte in the range (80..9F) were being somehow changed to BF ... thus
> resulting in incorrect retrieval....
>
> I was unable to find the reasons for these strange occurrences....
>
> THEN SOMEONE SUGGESTED CHANGES IN THE REGISTRY.....
>
> actually the registry entries for oracle shows 3 entries for NLS_LANG.
> and that too at the WEB SERVER end and at the DATABASE SERVER end.....
> so that makes tooooooo many combinations...
>
> AND FINALLY HOW DOES THE CHANGING OF REGISTRY TO AMERICAN_AMERICA.UTF8
> IMPACT THE DATABASE STORAGE OR DISPLAY PROCESS??????
>
> kindly suggest......
>
> thankx and regads,
>
> Sandeep
>
>
> ----- Original Message -----
> From: <Jukka.Korpela@hut.fi>
> To: Unicode List <unicode@unicode.org>
> Sent: Thursday, September 28, 2000 2:18 PM
> Subject: Re: Encoding????????????
>
>
> On Wed, 27 Sep 2000, Sandeep Krishna wrote:
>
> > can someone tell me...what does the Encoding in the browser (IE5)
> imlpy.....
>
> That's a good question. Internet Explorer 5 is relatively advanced in the
> area of handling different encodings. It seems to honor the encoding
> ("charset") as advertized in HTTP headers, and it seems to try to make
> an educated guess (based on the actual content of data) when no encoding
> is specified. The details are somewhat obscure and undocumented, though.
>
> IE 5 _also_ lets the user override its guess of the encoding; a good thing
> to do, since quite a many pages are still sent without proper designation
> of the encoding. The Encoding menu on IE 5 has dual purpose: you can check
> from it what encoding the browser has assumed when interpreting the data
> (either from the HTTP headers, or from META tags which try to simulate
> them, or by heuristic guessing, or by user's explicit selection) - you'll
> see that alternative checked - or you can make your own guess of what
> the encoding really is.
>
> > does it mean that the Encoding (say UTF-8 or Chinese Big5) shall be
> > used for encoding/ decoding any data ..(page) to be displayed or
> > sent....
>
> Basically, for interpreting the data that the browser has to display.
> (It may affect e.g. how data sent via forms is encoded by the browser, but
> I've never studied that side of the matter.)
>
> > i mean if i use an encoding like Big5 ....
> > how does it encode a chinese character...similar to utf-8 or
> > differently..???
>
> Do you mean as an IE 5 user, or as a Web document author? If you, as a
> user, change the selection in the Encoding menu, you're telling the
> browser to treat the data according to that encoding. Whether it makes
> sense depends on how the data has actually been encoded.
>
> Big5, also known as "Traditional Chinese" is not a Unicode encoding at
> all, so it is surely different from UTF-8. For a short characterization of
> Big5, see http://www.dpliv.com/nckuaa/tech/bg5hist.html
>
> > and can i display a Korean charactrer... using big5???
>
> Depends on what you mean by a "Korean character". I suppose you mean
> Hangul. As far as I know, Big5 doesn't contain them. For Hangul, you
> can use either some Korean standard, or Unicode (see part "10.4 Hangul"
> in the Unicode standard). There are various practical considerations.
> For example, software used by people in Korea might be better equipped
> to handle data encoded according to a Korean standard. People elsewhere
> might cope with Unicode encoded data better. (For example, my IE 4,
> in a fairly vanilla Windows environment with a few Unicode fonts
> installed, can display UTF-8 encoded Korean texts just fine - I hope I
> could just understand them! - but doesn't seem to be able to handle any
> Korean encoding.) To gain maximum audience, as a Web author, you might
> consider making your documents available in both (or several) encodings,
> and link them together (for obvious reasons, with link texts in plain
> Ascii, which probably means you'd have to use plain English) so that
> people can try the other version if the first one is illegible.
>
> > pls explain the Encoding.part??????
>
> It's a somewhat confusing issue, and not directly related to Unicode
> (though it naturally affects Unicode encodings too). If you mean the
> encoding concept in general, perhaps my
> http://www.hut.fi/u/jkorpela/chars.html#encoding
> illustrates a bit; that tutorial of mine contains references which you
> might find more readable presentations on the topic - mine is somewhat
> technical (and, to be honest, it does not use _quite_ the same terminology
> as the Unicode standard). If you mean encoding as relevant to HTML
> documents on the World Wide Web, then I'd refer to my attempt to simple
> instructions on using "extended" character repertoires there
> http://www.hut.fi/u/jkorpela/HTML/chars.var
> But it's still somewhat confusing I'm afraid. There's an authorative
> (though, I'm afraid, perhaps even more confusing) presentation at
> http://www.w3.org/TR/html4/charset.html#h-5.2
> http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.4
>
> P.S. Please send only plain text to mailing lists, not as multipart
> in plain text and "HTML" (cf. to
> http://extra.newsguy.com/%7eschramm/nhtml.html ).
>
> --
> Yucca, http://www.hut.fi/u/jkorpela/ or http://yucca.hut.fi/yucca.html
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT