From: Pim Blokland (pblokland@planet.nl)
Date: Mon Nov 17 2003 - 07:26:19 EST
pepe pepe schreef:
> We have the following sequence of characters "...ización Map.."
that is
> the same than "...ización Map..." that after suffering some
> transformations becomes to "...izaci�&56333;ap...."
> AS you can see the two characters 56186 and 56333 seem to
represent this
> sequences "ón M". Any idea?.
Yes, your input text obviously gets flagged as being in UTF-8
format, even if it is Latin-1 (or any codepage that has a ó at index
243).
Not only that, but the process making the mistake of thinking it is
UTF-8 also makes the mistake of not generating an error for
encountering malformed byte sequences, AND of outputting the result
as two 16-bit numbers instead of one 21-bit number.
If you take the byte sequence (hex) F3 6E 20 4D and treat it as
UTF-8 and don't care it's not valid, this maps to the value
(hex)1EE80D. Again, not caring this is not a valid codepoint,
turning this into UTF-16 would yield U+DB7A U+DC0D, which is what
you got in your output.
Pim Blokland
This archive was generated by hypermail 2.1.5 : Mon Nov 17 2003 - 08:16:50 EST