Re: data for cp1252 from Richard Wordingham on 2012-12-08 (Unicode Mail List Archive)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sat, 8 Dec 2012 14:00:58 +0000

On Fri, 7 Dec 2012 17:48:12 -0800
Buck Golemon <buck_at_yelp.com> wrote:

> > If you already have existing data in 1252 or a variation (and can’t
> > tell
> them apart), then nothing’s gained by making NEW requirements for 1252
> which the old data won’t conform to.
>
>
> Old latin1 documents can contain 0x81 and still be valid.
> All browsers decode latin1 documents with cp1252.
> In all cases, such a document would decode with a U+0081 character,
> with no error.

Are there *valid* Latin-1 documents with 0x81? 0x81 looks more like a
bit of mojibake. Surely what's more at issue is finding the least bad
handling of partially corrupt text, e.g. with a view to correcting
errors, just as we don't discard emails with grammatical errors in the
text.

As for Shawn Steele's recommendation to create new data in UTF-8,
there are 8-bit channels that corrupt UTF-8, such as replies via the
Yahoo groups web interface, which irrecoverably mangles some
continuation bytes.

Richard.
Received on Sat Dec 08 2012 - 08:05:00 CST

This archive was generated by hypermail 2.2.0 : Sat Dec 08 2012 - 08:05:02 CST