Aw: Re: Re: Re: Re: Re: Do you know a tool to decode "UTF-8 twice" from Jörg Knappen on 2014-01-29 (Unicode Mail List Archive)

From: Jörg Knappen <jknappen_at_web.de>
Date: Wed, 29 Jan 2014 15:59:43 +0100 (CET)

A little postscrptum to this old thread:

On pyPi, there is now a codec available that handles the peculiar definition of "latin1" inside mysql.

The package is called mysql-latin1-codec and features an encoding consisting of cp1252 plus

0x81, 0x8D, 0x8F, 0x90, 0x9D (the latter five characters are undefined in the python codec for cp1252).

https://pypi.python.org/pypi/mysql-latin1-codec/1.0

--Jörg Knappen

Gesendet: Mittwoch, 30. Oktober 2013 um 19:14 Uhr
Von: "Buck Golemon" <[email protected]>
An: "Frédéric Grosshans" <[email protected]>
Cc: "Jörg Knappen" <[email protected]>, unicode <[email protected]>
Betreff: Re: Aw: Re: Re: Re: Re: Do you know a tool to decode "UTF-8 twice"

On Wed, Oct 30, 2013 at 9:56 AM, Frédéric Grosshans <[email protected]> wrote:

Le 30/10/2013 17:32, "Jörg Knappen" a écrit :

The data did not only contain latin-1 type mangling for the non-existent Windows characters, but also sequences with the raw
C1 control characters for all of latin-1. So I had to do them, too.
The data weren't consistent at all, not even in their errors.
--Jörg Knappen

Your question helped me dust off and repair a non working python snippet I wrote for a similar problem. I was stuck with the mixing of windows-1252 and latin1 controls (linked with a chinese characters). I write it below for reference.

The python snippet below does not need sed, defines a function (unscramble(S)) which works on strings. The extension to files should be easy.

Frédéric Grosshans

def Step1Filter(S):
for c in S :
#works character/character because of the cp1252/latin1 ambiguity
try :
yield c.encode('cp1252')
except UnicodeEncodeError :
yield c.encode('latin1')
#Useful where cp1252 is undefined (81, 8D, 8F, 90, 9D)

def unscramble(S):
return b''.join(c for c in Step1Filter(S)).decode('utf8')

PS: If anyone is interested in a licence, I consider this simple enough to be in the public domain an uncopyrightable.

This encoding you've implemented above is known as windows-1252 by the whatwg and all browsers [1][2].

The implementation of cp1252 in python is instead a direct consequence of the unicode.org definition [3].

[1] http://encoding.spec.whatwg.org/index-windows-1252.txt

[2] http://bukzor.github.io/encodings/cp1252.html

[3] http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Wed Jan 29 2014 - 09:01:01 CST

This archive was generated by hypermail 2.2.0 : Wed Jan 29 2014 - 09:01:02 CST