From: Yung-Fong Tang (ftang@netscape.com)
Date: Thu Mar 13 2003 - 19:54:29 EST
I have not touch Java for years (probably 5 years) ... so, I could be wrong.
Jain, Pankaj (MED, TCS) wrote:
> Hi ftang/james..
>
> thanks for the details explanation. and now I the root problem of my
> error.
>
> I have following string is in database as Long in which the special
> character(?) is equivalent to ndash(-)
>
> E8C ? 6 to 10
>
> And i am using following code to write the string from database to
> property file, and in property file i am getting following string.
>
> value= E8C \uFFE2\uFF80\uFF93 6 to 10
>
> And as \uFFE2\uFF80\uFF93 is not equivalent to ndash, I am not able to
> figure out why it is coming in property file.
>
> Do we need to specify in my java program any type of encoding like utf-8.
>
> pls let me know where is the problem.
>
> Here is my code..
>
> while(rsResult.next())
>
> {
>
> /*Get the file contents from the value column*/
>
> ipStream = rsResult.getBinaryStream("VALUE");
>
what is rsResult? Blob?
you probably need to use
BufferedInputStream
and
DataInputStream
to pipe the InputStream
and use readChar or readUTF in the InputStream interface instad.
See http://www.webdeveloper.com/java/java_jj_read_write.html and
http://java.sun.com/j2se/1.4/docs/api/java/io/DataInputStream.html#readUTF()
for more info.
> strBuf = new StringBuffer();
>
> while((chunk = ipStream.read())!=-1)
>
> {
>
> byte byChunk = new Integer(chunk).byteValue();
>
> strBuf.append((char) byChunk);
>
> }
>
Here is your problem, you read it in byte to byte. Each byte of the
UTF-8 will be read in as a Byte instead of a Char in Java.
> prop.setProperty(rsResult.getString("KEY"), strBuf.toString());
>
> }
>
> /*Write to o/p stream*/
>
> //opFile = new FileOutputStream(strFileName+".properties");
>
> opFile = new FileOutputStream(strFileName);
>
> /*Store the Properties files*/
>
> prop.store(opFile, "Resource Bundle created from Database View
> "+vctView.get(i));
>
>
>
> Thnaks
>
> -Pankaj
>
>
>
>
>
>
>
>
>
> -----Original Message-----
> From: ftang@netscape.com [mailto:ftang@netscape.com]
> Sent: Tuesday, March 11, 2003 6:09 PM
> To: Jain, Pankaj (MED, TCS)
> Cc: 'jameskass@att.net'; 'unicode@unicode.org'
> Subject: Re: Unicode character transformation through XSLT
>
>
> Because the following code got apply to your unicode data
>
> 1. convert \u to unicode -
>
>\uFFE2\uFF80\uFF93
>
> become
> three unicode characters-
>
>U+FFE2, U+FF80, U+FF93
>
> This is ok
> 2. a "Throw away hihg 8 bits got apply to your code" so
> it became 3 bytes
> E2 80 93
>
> 3. and some code treat it as UTF-8 and try to convert it to UCS2
> again, so
>
> E2 = 1110 0010 and the right most 4 bits 0010 will be used for UCS2
> 80 = 1000 0000 and the right most 6 bits 00 0000 will be used for UCS2
> 93 = 1001 0011 and the right most 6 bits 01 0011 will be used for UCS2
>
> [0010] [00 0000] [01 0011] = 0010 0000 0001 0011 = 2013
> U+2013 is EN DASH
>
> so... in your code there are something very very bad which will
> corrupt your data.
> Step 2 and 3 are very bad. You probably need to find out where
> they are and remove that code.
>
> read my paper on
> http://people.netscape.com/ftang/paper/textintegrity.html
> Probably your Java code have one or two bugs which listed in my
> paper.
>
> Jain, Pankaj (MED, TCS) wrote:
>
>>James,
>>thanks, its working for me now.
>>But still I have a doubt that why \uFFE2\uFF80\uFF93 is giving ndash in
>>html.
>>if you have any information on this, than pls let me know.
>>
>>Thanks
>>-Pankaj
>>
>>-----Original Message-----
>>From: jameskass@att.net [mailto:jameskass@att.net]
>>Sent: Monday, March 10, 2003 7:59 PM
>>To: Jain, Pankaj (MED, TCS)
>>Cc: 'unicode@unicode.org'
>>Subject: Re: Unicode character transformation through XSLT
>>
>>
>>.
>>Pankaj Jain wrote,
>>
>>
>>
>>>My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93)
>>>from resource bundle property file which is equivalent to ndash(-) and
>>>its
>>>
>>>
>>
>>U+2013 is the ndash (aEUR"). It is represented in UTF-8 by three
>>hex bytes: E2 80 93.
>>
>>But, \uFFE2 is fullwidth pound sign
>>\uFF80 is half width katakana letter ta
>>and \uff93 is half width katakana letter mo.
>>
>>Perhaps the reason you see three question marks is that the font
>>you are using doesn't support full width and half width characters.
>>
>>What happens if you replace your string \uFFE2\uFF80\uFF93 with
>>\u2013 ?
>>
>>Best regards,
>>
>>James Kass
>>.
>>
>>
>>
>
This archive was generated by hypermail 2.1.5 : Thu Mar 13 2003 - 20:37:47 EST