Erik hit the nail on the head. To go a bit further, there are two different issues here:
1. Handling cp1252.
Dropping or misinterpreting the bytes is simply not acceptable business practice, nor is the inability to deal with such a common code page, at least at the boundary. If a system cannot deal with particular bytes, there is little choice but to map the text to a 'safe' character set that can be processed by the system. UTF-EBCDIC (UTR#16) can help with the situation on EBCDIC machines.
2. Mislabeled text.
There is plenty of mislabeled text out on the web; this is particularly ugly. The people on the front lines receiving that text have to do their best to deal with the situation. For now, that involves lots of messy heuristics -- I'm sure that Erik and Frank could tell us some stories. For the future, the specifications need to require charset parameters (HTML is definitely flawed there -- XHTML's increased rigor will help), and the tools generating the text have to improve to the point where it is correctly labeled. It would help, for example, if FrontPage and the ilk *always* generated a charset parameter.
Mark
Erik van der Poel wrote:
> Alain LaBonté wrote:
> >
> > Erik van der Poel a écrit:
> > >
> > > If you don't use browsers in mainframe environments, then why are we
> > > even talking about that here?
> >
> > Because texts can be copied and pasted in email messages from HTML files
> > and then inevitably they can go to a mainframe environment (or UNIX) and
> > back... Data loss guaranteed if the C1 space is used for graphic
> > characters. In French this is dramatic (for the EURO sign as well, and for
> > Finnish too).
>
> There is a boundary between mainframes and the Internet. There is a
> gateway at that boundary. The gateway should take care of the octets in
> the C1 range, so that the big mainframe doesn't choke on the data
> produced by the little PC. The gateway will need to do this for UTF-8
> *anyway*, so it might as well do it for windows-1252 too.
>
> On Unix, these C1 octets can be mapped to appropriate glyph codes, if
> the user has installed some of the more modern X fonts, such as
> *-iso10646-1. The Unix apps are still somewhat behind perhaps, but they
> will eventually catch up or die. We are planning to make some changes to
> Unix Mozilla 5.0 to deal with these windows-1252 characters (and UTF-8
> too of course).
>
> Erik
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT