From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Wed Mar 12 2003 - 17:38:42 EST
Generally, try instantiating an InputStreamReader or similar from your input, with an explicit
encoding="UTF8". That will perform the conversion from UTF-8 to the internal 16-bit Unicode that
Java processes.
Always use XYZReader classes for text input and XYZWriter classes for text output.
java.sun.com has tutorials on Internationalization etc. that I recommend.
See also http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/
Your code takes UTF-8 byte values, mis-casts them to signed then unsigned 16-bit values and
re-interprets these mistreated UTF-8 byte values as if they were 16-bit UTF-16 code units.
Let's take this line by line to see what happens:
Jain, Pankaj (MED, TCS) wrote:
> Here is my code..
>
> while(rsResult.next())
> {
> /*Get the file contents from the value column*/
> ipStream = rsResult.getBinaryStream("VALUE");
This is the source of the problem. You read the input as binary instead of as UTF-8 text.
> strBuf = new StringBuffer();
> while((chunk = ipStream.read())!=-1)
> {
> byte byChunk = new Integer(chunk).byteValue();
Now you get one byte at a time. In Java, byte is a signed type, so 0x80..0xff are actually negative
values: 0x80=-128 .. 0xff=-1.
> strBuf.append((char) byChunk);
This widens the signed integer value to 16 bits and then casts it to an unsigned 16-bit unit (Java
char is 16 bits wide). 0x80 became negative (-128), was widened to 16 bits and cast to unsigned,
which is 0xff80. You append this mistreated value to a StringBuffer which reinterprets it as a
UTF-16 code unit.
> }
> prop.setProperty(rsResult.getString("KEY"), strBuf.toString());
> }
markus
This archive was generated by hypermail 2.1.5 : Wed Mar 12 2003 - 18:32:52 EST