Le 30/10/2013 17:32, "Jörg Knappen" a écrit :
> The data did not only contain latin-1 type mangling for the 
> non-existent Windows characters, but also sequences with the raw
> C1 control characters for all of latin-1. So I had to do them, too.
> The data weren't consistent at all, not even in their errors.
> --Jörg Knappen
Your question helped me dust off and repair a non working python snippet 
I wrote for a similar problem. I was stuck with the mixing of 
windows-1252 and latin1 controls (linked with a chinese characters). I 
write it below for reference.
The python snippet below does not need sed, defines a function 
(unscramble(S)) which works on strings. The extension to files should be 
easy.
     Frédéric Grosshans
def Step1Filter(S):
     for c in S :
     #works character/character because of the cp1252/latin1 ambiguity
         try :
             yield c.encode('cp1252')
         except UnicodeEncodeError :
             yield c.encode('latin1')
             #Useful where cp1252 is undefined (81, 8D, 8F, 90, 9D)
def unscramble(S):
     return b''.join(c for c in Step1Filter(S)).decode('utf8')
PS: If anyone is interested in a licence, I consider this simple enough 
to be in the public domain an uncopyrightable.
Received on Wed Oct 30 2013 - 11:58:30 CDT
This archive was generated by hypermail 2.2.0 : Wed Oct 30 2013 - 11:58:30 CDT