Re: Problems encoding the spanish o

From: Pim Blokland (pblokland@planet.nl)
Date: Mon Nov 17 2003 - 07:26:19 EST

  • Next message: Marco Cimarosti: "RE: Problems encoding the spanish o"

    pepe pepe schreef:

    > We have the following sequence of characters "...ización Map.."
    that is
    > the same than "...ización Map..." that after suffering some
    > transformations becomes to "...izaci�&56333;ap...."
    > AS you can see the two characters 56186 and 56333 seem to
    represent this
    > sequences "ón M". Any idea?.

    Yes, your input text obviously gets flagged as being in UTF-8
    format, even if it is Latin-1 (or any codepage that has a ó at index
    243).
    Not only that, but the process making the mistake of thinking it is
    UTF-8 also makes the mistake of not generating an error for
    encountering malformed byte sequences, AND of outputting the result
    as two 16-bit numbers instead of one 21-bit number.

    If you take the byte sequence (hex) F3 6E 20 4D and treat it as
    UTF-8 and don't care it's not valid, this maps to the value
    (hex)1EE80D. Again, not caring this is not a valid codepoint,
    turning this into UTF-16 would yield U+DB7A U+DC0D, which is what
    you got in your output.

    Pim Blokland



    This archive was generated by hypermail 2.1.5 : Mon Nov 17 2003 - 08:16:50 EST