So I find that the unicode.org cp1252 file leaves those bytes undefined as
well, so the issue stems from there.
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and to
map it to the equally-non-semantic U+81 ?
This would allow systems that follow the html5 standard and use cp1252 in
place of latin1 to continue to be binary-faithful and reversible.
On Fri, Nov 16, 2012 at 3:38 PM, Buck Golemon <buck_at_yelp.com> wrote:
> cp1252 (aka windows-1252) defines 27 characters which iso-8859-1 does not.
> This leaves five bytes with undefined semantics.
>
> Currently the python cp1252 decoder allows us to ignore/replace/error on
> these bytes, but there's no facility for allowing these unknown bytes to
> round-trip through the codec, as the latin1 codec does.
>
> I'd like to get this "fixed" but I will have a very hard time convincing
> anyone that it's wrong.
>
Received on Fri Nov 16 2012 - 17:57:57 CST
This archive was generated by hypermail 2.2.0 : Fri Nov 16 2012 - 17:57:57 CST