Gesendet: Mittwoch, 30. Oktober 2013 um 16:58 Uhr
Von: "Frédéric Grosshans" <frederic.grosshans@gmail.com>
An: "Jörg Knappen" <jknappen@web.de>
Cc: unicode@unicode.org
Betreff: Re: Aw: Re: Re: Re: Do you know a tool to decode "UTF-8 twice"
Le 30/10/2013 16:13, "Jörg Knappen" a écrit :
> Thanks again!
> My updated sed pattern generator now looks like:
> r = range(0xa0, 0x170)
> file = open("fixu8.sed", "w")
> for i in r:
> pat1 =
> "s/"+unichr(i).encode("utf-8").decode("latin-1").encode("utf-8") + "/"
> + unichr(i).encode("utf-8") +"/g"
> print >>file, pat1
> try:
> pat2 =
> "s/"+unichr(i).encode("utf-8").decode("windows-1252").encode("utf-8")
> + "/" + unichr(i).encode("utf-8") +"/g"
> except:
> pat2 = pat1
> if (pat1 != pat2):
> print >>file, pat2
> doing both latin-1 and windows-1252 mangled double utf-8. This is
> probably enough for now, the rate of errors is low
> enough for practical purposes (i.e., lower than the natural error rate
> introduced by typing errors)
>
Why to you do both latin1 and windows-1252 ? Windows-1252 is supposed to
be a superset of latin1, so it should be enough. Or is there a problem
with the few undefined bytes of windows-1252 (81, 8D, 8F, 90, 9D) ?
Frédéric