Re: cp1252 decoder implementation from Buck Golemon on 2012-11-16 (Unicode Mail List Archive)

From: Buck Golemon <buck_at_yelp.com>
Date: Fri, 16 Nov 2012 15:56:18 -0800

So I find that the unicode.org cp1252 file leaves those bytes undefined as
well, so the issue stems from there.

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and to
map it to the equally-non-semantic U+81 ?

This would allow systems that follow the html5 standard and use cp1252 in
place of latin1 to continue to be binary-faithful and reversible.

On Fri, Nov 16, 2012 at 3:38 PM, Buck Golemon <buck_at_yelp.com> wrote:

> cp1252 (aka windows-1252) defines 27 characters which iso-8859-1 does not.
> This leaves five bytes with undefined semantics.
>
> Currently the python cp1252 decoder allows us to ignore/replace/error on
> these bytes, but there's no facility for allowing these unknown bytes to
> round-trip through the codec, as the latin1 codec does.
>
> I'd like to get this "fixed" but I will have a very hard time convincing
> anyone that it's wrong.
>
Received on Fri Nov 16 2012 - 17:57:57 CST

This archive was generated by hypermail 2.2.0 : Fri Nov 16 2012 - 17:57:57 CST