Just in case it helps, Ruby (since version 1.9) also uses 3).
Regards, Martin.
On 2012/11/17 6:48, Buck Golemon wrote:
> When decoding bytes to unicode using the "latin1" scheme, there are three
> options for bytes not defined in the ISO-8859-1 standard.
>
> 1) Throw an error.
> 2) Insert the replacement glyph (fffd), indicating an unknown character.
> 3) Insert the unicode character with equal value. This means that
> completely random bytes will always decode successfully.
>
> The Python language currently implements option three. Is this correct?
> There is an option to produce errors or replacements for encodings which
> have undefined characters, but as implemented, latin1 currently defines
> characters for all 256 bytes, so the option does nothing.
>
> Restated, are the first 256 characters of unicode intended to be exactly
> compatible with a latin1 codec?
> This would imply that unicode has inserted character definitions into the
> ISO-8859-1 standard.
>
Received on Sat Nov 17 2012 - 06:45:35 CST
This archive was generated by hypermail 2.2.0 : Sat Nov 17 2012 - 06:45:37 CST