Re: Encoding issue, clues needed

From: George W Gerrity (g.gerrity@gwg-associates.com.au)
Date: Sun Dec 23 2007 - 19:40:13 CST

  • Next message: Jeroen Ruigrok van der Werven: "Re: Encoding issue, clues needed"

    On 2007-12-24, at 08:15, Jeroen Ruigrok van der Werven wrote:

    > On my FreeBSD system I am trying to track down an encoding issue
    > with ncurses
    > and Python. After having beating my head against it for the entire
    > day I
    > figured someone on this list would have a clue.
    >
    > Characters from the basic latin block are ok, but any multibye
    > character seems
    > to get mangled in one way or other.
    >
    > For example, a character such as 的 (U+7684) gets changed to U+fffd
    > ~Z ~D. In
    > general most characters get transformed to a U+fffd + ~.. + ~..
    > sequence.
    > Where .. is a Basic Latin printable character (apparently between U
    > +0040 -
    > U+007e).
    >
    > I am not seeing a, probably, very simple mangling. Even having
    > written out
    > everything in bit and hex sequences did not show much of a system,
    > aside from
    > the last digit being preserved, e.g. U+7684 still has an 4 at the
    > end since D
    > is U+0044.
    >
    > Python uses UTF-16 (UCS-2) internally and my locale, to which
    > everything is
    > decoded, uses UTF-8.
    >
    > So to take the example, U+7684 would be e7.98.84 in UTF-8 and the
    > sequence I
    > got, aside from U+fffd, is 7e.5a.7e.44.

    The e7 in the correct encoding and the 7e in the incorrect one
    suggest to me a byte-order and/or bit-order problem. Are all the
    Little-Endian/Big-Endian flags correct in the entire system build?
    Are definitions such as Unsigned Integer and the sizes for all
    integers defined correctly in all source code for the entire build?

    George
    ------
    Dr George W Gerrity Ph: +61 2 6386 3431
    GWG Associates Fax: +61 2 6386 4431
    P O Box 229 Time: +10 hours (ref GMT)
    Harden, NSW 2587 PGP RSA Public Key Fingerprint:
    AUSTRALIA 73EF 318A DFF5 EB8A 6810 49AC 0763 AF07

    > To give two more examples for completeness sake:
    >
    > 居 - U+5c45 - e5.b0.85 - U+fffd ~E (7e.45)
    > 把 - U+628a - e6.88.8a - U+fffd ~J~J (7e.4a.7e.4a)
    >
    > Is this some sort of signed/unsigned issue?
    >
    > Mmm, of course, ideas strike when you are about to send this...
    >
    > í - U+00ed gives me two U+fffd, in UTF-8 it would c3.ad, which are
    > both above
    > 7e. There's some cut off happening, but I am not seeing in which
    > direction I
    > need to continue seeking.
    >
    > Ideas are very much welcome!
    >
    > --
    > Jeroen Ruigrok van der Werven <asmodai(-at-)in-nomine.org> / asmodai
    > イェルーン ラウフロック ヴァン デル ウェルヴェン
    > http://www.in-nomine.org/ | http://www.rangaku.org/
    > In every stone sleeps a crystal...
    >



    This archive was generated by hypermail 2.1.5 : Sun Dec 23 2007 - 19:43:07 CST