Re: Re: Re: Re: Re: Re: Do you know a tool to decode "UTF-8 twice"

From: Buck Golemon <buck_at_yelp.com>
Date: Wed, 29 Jan 2014 10:21:55 -0800

Jörg:

This is the definition of cp1252 used by the whatwg and all current browser
implementations.
I've appealed to the cp1252 maintainer to update the definition so that we
don't have two competing standards, but I was rejected.
I've been considering naming it cp1252-whatwg.

On Wed, Jan 29, 2014 at 6:59 AM, "Jörg Knappen" <jknappen_at_web.de> wrote:

> A little postscrptum to this old thread:
>
> On pyPi, there is now a codec available that handles the peculiar
> definition of "latin1" inside mysql.
> The package is called mysql-latin1-codec and features an encoding
> consisting of cp1252 plus
> 0x81, 0x8D, 0x8F, 0x90, 0x9D (the latter five characters are undefined in
> the python codec for cp1252).
>
> https://pypi.python.org/pypi/mysql-latin1-codec/1.0
>
> --Jörg Knappen
>
> *Gesendet:* Mittwoch, 30. Oktober 2013 um 19:14 Uhr
> *Von:* "Buck Golemon" <buck_at_yelp.com>
> *An:* "Frédéric Grosshans" <frederic.grosshans_at_gmail.com>
> *Cc:* "Jörg Knappen" <jknappen_at_web.de>, unicode <unicode_at_unicode.org>
> *Betreff:* Re: Aw: Re: Re: Re: Re: Do you know a tool to decode "UTF-8
> twice"
>
>
> On Wed, Oct 30, 2013 at 9:56 AM, Frédéric Grosshans <
> frederic.grosshans_at_gmail.com> wrote:
>>
>> Le 30/10/2013 17:32, "Jörg Knappen" a écrit :
>>
>>>
>>> The data did not only contain latin-1 type mangling for the non-existent
>>> Windows characters, but also sequences with the raw
>>> C1 control characters for all of latin-1. So I had to do them, too.
>>> The data weren't consistent at all, not even in their errors.
>>> --Jörg Knappen
>>
>> Your question helped me dust off and repair a non working python snippet
>> I wrote for a similar problem. I was stuck with the mixing of windows-1252
>> and latin1 controls (linked with a chinese characters). I write it below
>> for reference.
>>
>> The python snippet below does not need sed, defines a function
>> (unscramble(S)) which works on strings. The extension to files should be
>> easy.
>>
>> Frédéric Grosshans
>>
>>
>> def Step1Filter(S):
>> for c in S :
>> #works character/character because of the cp1252/latin1 ambiguity
>> try :
>> yield c.encode('cp1252')
>> except UnicodeEncodeError :
>> yield c.encode('latin1')
>> #Useful where cp1252 is undefined (81, 8D, 8F, 90, 9D)
>>
>> def unscramble(S):
>> return b''.join(c for c in Step1Filter(S)).decode('utf8')
>>
>> PS: If anyone is interested in a licence, I consider this simple enough
>> to be in the public domain an uncopyrightable.
>>
>
> This encoding you've implemented above is known as windows-1252 by the
> whatwg and all browsers [1][2].
> The implementation of cp1252 in python is instead a direct consequence of
> the unicode.org definition [3].
>
> [1] http://encoding.spec.whatwg.org/index-windows-1252.txt
> [2] http://bukzor.github.io/encodings/cp1252.html
> [3]
> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
>

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Wed Jan 29 2014 - 12:23:05 CST

This archive was generated by hypermail 2.2.0 : Wed Jan 29 2014 - 12:23:06 CST