Re: Unicode character transformation through XSLT

From: Yung-Fong Tang ([email protected])
Date: Tue Mar 11 2003 - 19:08:49 EST

Next message: Michael \(michka\) Kaplan: "Re: sorting order between win98/xp"

Previous message: Yung-Fong Tang: "sorting order between win98/xp"
In reply to: Jain, Pankaj (MED, TCS): "RE: Unicode character transformation through XSLT"
Next in thread: Jain, Pankaj (MED, TCS): "RE: Unicode character transformation through XSLT"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Because the following code got apply to your unicode data

1. convert \u to unicode -

\uFFE2\uFF80\uFF93

become
three unicode characters-

U+FFE2, U+FF80, U+FF93

This is ok
2. a "Throw away hihg 8 bits got apply to your code" so
it became 3 bytes
E2 80 93

3. and some code treat it as UTF-8 and try to convert it to UCS2 again, so

E2 = 1110 0010 and the right most 4 bits 0010 will be used for UCS2
80 = 1000 0000 and the right most 6 bits 00 0000 will be used for UCS2
93 = 1001 0011 and the right most 6 bits 01 0011 will be used for UCS2

[0010] [00 0000] [01 0011] = 0010 0000 0001 0011 = 2013
U+2013 is EN DASH

so... in your code there are something very very bad which will corrupt
your data.
Step 2 and 3 are very bad. You probably need to find out where they are
and remove that code.

read my paper on http://people.netscape.com/ftang/paper/textintegrity.html
Probably your Java code have one or two bugs which listed in my paper.

Jain, Pankaj (MED, TCS) wrote:

>James,
>thanks, its working for me now.
>But still I have a doubt that why \uFFE2\uFF80\uFF93 is giving ndash in
>html.
>if you have any information on this, than pls let me know.
>
>Thanks
>-Pankaj
>
>-----Original Message-----
>From: [email protected] [mailto:[email protected]]
>Sent: Monday, March 10, 2003 7:59 PM
>To: Jain, Pankaj (MED, TCS)
>Cc: '[email protected]'
>Subject: Re: Unicode character transformation through XSLT
>
>
>.
>Pankaj Jain wrote,
>
>
>
>>My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93)
>>from resource bundle property file which is equivalent to ndash(-) and
>>its
>>
>>
>
>U+2013 is the ndash (aEUR"). It is represented in UTF-8 by three
>hex bytes: E2 80 93.
>
>But, \uFFE2 is fullwidth pound sign
>\uFF80 is half width katakana letter ta
>and \uff93 is half width katakana letter mo.
>
>Perhaps the reason you see three question marks is that the font
>you are using doesn't support full width and half width characters.
>
>What happens if you replace your string \uFFE2\uFF80\uFF93 with
>\u2013 ?
>
>Best regards,
>
>James Kass
>.
>
>
>

Next message: Michael \(michka\) Kaplan: "Re: sorting order between win98/xp"
Previous message: Yung-Fong Tang: "sorting order between win98/xp"
In reply to: Jain, Pankaj (MED, TCS): "RE: Unicode character transformation through XSLT"
Next in thread: Jain, Pankaj (MED, TCS): "RE: Unicode character transformation through XSLT"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Mar 11 2003 - 19:46:09 EST