Aw: Re: Re: Re: Re: Do you know a tool to decode "UTF-8 twice" from Jörg Knappen on 2013-10-30 (Unicode Mail List Archive)

From: Jörg Knappen <jknappen_at_web.de>
Date: Wed, 30 Oct 2013 17:32:13 +0100 (CET)

The data did not only contain latin-1 type mangling for the non-existent Windows characters, but also sequences with the raw

C1 control characters for all of latin-1. So I had to do them, too.

The data weren't consistent at all, not even in their errors.

--Jörg Knappen

Gesendet: Mittwoch, 30. Oktober 2013 um 16:58 Uhr
Von: "Frédéric Grosshans" <[email protected]>
An: "Jörg Knappen" <[email protected]>
Cc: [email protected]
Betreff: Re: Aw: Re: Re: Re: Do you know a tool to decode "UTF-8 twice"

Le 30/10/2013 16:13, "Jörg Knappen" a écrit :
> Thanks again!
> My updated sed pattern generator now looks like:
> r = range(0xa0, 0x170)
> file = open("fixu8.sed", "w")
> for i in r:
> pat1 =
> "s/"+unichr(i).encode("utf-8").decode("latin-1").encode("utf-8") + "/"
> + unichr(i).encode("utf-8") +"/g"
> print >>file, pat1
> try:
> pat2 =
> "s/"+unichr(i).encode("utf-8").decode("windows-1252").encode("utf-8")
> + "/" + unichr(i).encode("utf-8") +"/g"
> except:
> pat2 = pat1
> if (pat1 != pat2):
> print >>file, pat2
> doing both latin-1 and windows-1252 mangled double utf-8. This is
> probably enough for now, the rate of errors is low
> enough for practical purposes (i.e., lower than the natural error rate
> introduced by typing errors)
>
Why to you do both latin1 and windows-1252 ? Windows-1252 is supposed to
be a superset of latin1, so it should be enough. Or is there a problem
with the few undefined bytes of windows-1252 (81, 8D, 8F, 90, 9D) ?

Frédéric

Received on Wed Oct 30 2013 - 11:34:06 CDT

This archive was generated by hypermail 2.2.0 : Wed Oct 30 2013 - 11:34:07 CDT