Re: latin1 decoder implementation from Martin J. Dürst on 2012-11-17 (Unicode Mail List Archive)

From: Martin J. Dürst <duerst_at_it.aoyama.ac.jp>
Date: Sat, 17 Nov 2012 21:42:00 +0900

Just in case it helps, Ruby (since version 1.9) also uses 3).

Regards, Martin.

On 2012/11/17 6:48, Buck Golemon wrote:
> When decoding bytes to unicode using the "latin1" scheme, there are three
> options for bytes not defined in the ISO-8859-1 standard.
>
> 1) Throw an error.
> 2) Insert the replacement glyph (fffd), indicating an unknown character.
> 3) Insert the unicode character with equal value. This means that
> completely random bytes will always decode successfully.
>
> The Python language currently implements option three. Is this correct?
> There is an option to produce errors or replacements for encodings which
> have undefined characters, but as implemented, latin1 currently defines
> characters for all 256 bytes, so the option does nothing.
>
> Restated, are the first 256 characters of unicode intended to be exactly
> compatible with a latin1 codec?
> This would imply that unicode has inserted character definitions into the
> ISO-8859-1 standard.
>
Received on Sat Nov 17 2012 - 06:45:35 CST

This archive was generated by hypermail 2.2.0 : Sat Nov 17 2012 - 06:45:37 CST