Re: Unicode character transformation through XSLT

From: Yung-Fong Tang (ftang@netscape.com)
Date: Tue Mar 11 2003 - 19:08:49 EST

  • Next message: Michael \(michka\) Kaplan: "Re: sorting order between win98/xp"

    Because the following code got apply to your unicode data

    1. convert \u to unicode -

    \uFFE2\uFF80\uFF93

     become
    three unicode characters-

    U+FFE2, U+FF80, U+FF93

    This is ok
    2. a "Throw away hihg 8 bits got apply to your code" so
    it became 3 bytes
    E2 80 93

    3. and some code treat it as UTF-8 and try to convert it to UCS2 again, so

    E2 = 1110 0010 and the right most 4 bits 0010 will be used for UCS2
    80 = 1000 0000 and the right most 6 bits 00 0000 will be used for UCS2
    93 = 1001 0011 and the right most 6 bits 01 0011 will be used for UCS2

    [0010] [00 0000] [01 0011] = 0010 0000 0001 0011 = 2013
    U+2013 is EN DASH

    so... in your code there are something very very bad which will corrupt
    your data.
    Step 2 and 3 are very bad. You probably need to find out where they are
    and remove that code.

    read my paper on http://people.netscape.com/ftang/paper/textintegrity.html
    Probably your Java code have one or two bugs which listed in my paper.

    Jain, Pankaj (MED, TCS) wrote:

    >James,
    >thanks, its working for me now.
    >But still I have a doubt that why \uFFE2\uFF80\uFF93 is giving ndash in
    >html.
    >if you have any information on this, than pls let me know.
    >
    >Thanks
    >-Pankaj
    >
    >-----Original Message-----
    >From: jameskass@att.net [mailto:jameskass@att.net]
    >Sent: Monday, March 10, 2003 7:59 PM
    >To: Jain, Pankaj (MED, TCS)
    >Cc: 'unicode@unicode.org'
    >Subject: Re: Unicode character transformation through XSLT
    >
    >
    >.
    >Pankaj Jain wrote,
    >
    >
    >
    >>My problem is that, I am getting Unicode character(\uFFE2\uFF80\uFF93)
    >>from resource bundle property file which is equivalent to ndash(-) and
    >>its
    >>
    >>
    >
    >U+2013 is the ndash (aEUR"). It is represented in UTF-8 by three
    >hex bytes: E2 80 93.
    >
    >But, \uFFE2 is fullwidth pound sign
    >\uFF80 is half width katakana letter ta
    >and \uff93 is half width katakana letter mo.
    >
    >Perhaps the reason you see three question marks is that the font
    >you are using doesn't support full width and half width characters.
    >
    >What happens if you replace your string \uFFE2\uFF80\uFF93 with
    >\u2013 ?
    >
    >Best regards,
    >
    >James Kass
    >.
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Tue Mar 11 2003 - 19:46:09 EST