On Fri, 7 Dec 2012 17:48:12 -0800
Buck Golemon <buck_at_yelp.com> wrote:
> > If you already have existing data in 1252 or a variation (and can’t
> > tell
> them apart), then nothing’s gained by making NEW requirements for 1252
> which the old data won’t conform to.
>
>
> Old latin1 documents can contain 0x81 and still be valid.
> All browsers decode latin1 documents with cp1252.
> In all cases, such a document would decode with a U+0081 character,
> with no error.
Are there *valid* Latin-1 documents with 0x81? 0x81 looks more like a
bit of mojibake. Surely what's more at issue is finding the least bad
handling of partially corrupt text, e.g. with a view to correcting
errors, just as we don't discard emails with grammatical errors in the
text.
As for Shawn Steele's recommendation to create new data in UTF-8,
there are 8-bit channels that corrupt UTF-8, such as replies via the
Yahoo groups web interface, which irrecoverably mangles some
continuation bytes.
Richard.
Received on Sat Dec 08 2012 - 08:05:00 CST
This archive was generated by hypermail 2.2.0 : Sat Dec 08 2012 - 08:05:02 CST