Terminals, E-Mail, and non-standard encodings (was: CP1252 under UNIX)

From: Otto Stolz (Otto.Stolz@uni-konstanz.de)
Date: Mon Apr 03 2000 - 15:26:58 EDT

Am 2000-03-31 um 07:41 h PST hatte Doug Ewell geschrieben:
> It was stated that "1252 violates the very basis for character set
> standards" and "All standard character sets comply with ISO 4873 and ISO
> 2022." This is based on the fact that terminal-host communication
> relies on character sets that comply with 4873 and 2022,

Am 2000-03-31 um 08:35 h PST hat Frank da Cruz geschrieben:
> Some have suggested that those who read their email with terminals or
> emulators should "upgrade" their email clients to "properly" handle CP1252
> and other private code pages. That's absurd. If I write or buy an email
> client that obeys all the rules, I should not need to change it constantly
> as creative new ways are found to break the rules.

> Others have suggested that it is the responsibility of end users to take
> defensive measures to prevent nonstandard character sets from hanging or
> confusing their terminals or emulators. That's not right either. Our
> terminals and emulators *use* the C1 controls in valid, real-life,
> standards-conforming applications. I do not know in advance if a particular
> email message is going to fry my terminal, especially if its character set
> is not announced. Anyway, what if I have a real terminal, such as a VT420?
> In that case, there is no recourse -- whenever such a message arrives,
> the terminal becomes useless and must be reset. The message can not be
> read unless you put the terminal into debug mode, in which case it will
> show C1 Control Pictures in place of "smart quotes" (and for that matter,
> C0 Control Pictures in place of CR, LF, Tab, etc).

Obeying al rules means that the combination of E-Mail client and operating
system must make sure that the only control characters sent to the terminal
are really meant to control the terminal, i. e. stem from this very software.
Any controll character stemming from elsewhere, in particular from sources
outside the system at hand, must be filtered out! Otherwise, any rogue would
be able to do harm to the system from the outside via its network connections.

An E-Mail client that passes illegal bytes from the incoming message to
the display system is deficient, imho. Even layout control, such as carriage
return, line-feed, and tabulators need to be transformed from the SMTP con-
ventions to the local system conventions.

A decent E-mail client will filter out illegal characters from the incoming
mail; i. e. C0 and C1 characters from properly tagged ISO 8859-1 messages,
and C0 characters from properly tagged CP 1252 messages (if it is designed
to support the latter). A decent E-mail client (if it is prepared to handle
CP 1252, at all) will translate characters in the 80..9F range from in-
coming CP 1252 messages to the equivalent escape sequences, if the terminal
is capable to display the respective character, or to fall-back represent-
ations, otherwise. For messages not tagged with their respective encoding,
a decent E-mail client will asume ISO 646, hence filter out all characters
above 7E, as well as all C0 characters.

So, the possible harm C1 characters can do to a terminal session is not
a valid argument against CP-1252-encoded messages. If a terminal indeed
does hang on such message (or is fried by it), it is always the fault of
the E-mail client, or the operating system.

Whether all E-mail clients should indeed be able to accept (and handle
correctly!) CP-1252-encoded messages, is an entirely unrelated question.
This can only be answered on account of the market demands which I will
not comment on.

Am 2000-03-31 um 08:35 h PST hat Frank da Cruz geschrieben:
> IBM has done an excellent job of keeping their private EBCDIC code pages
> private, and converting them to standard character sets for interchange,
> and for that matter even publishing official mappings, so ISVs don't have
> to guess and come up with incompatible ones.

Actually, IBM's users' groups, such as SHARE Europe (formerly SEAS) had to
exert much pressure to get IBM there. Here are two quotes from the very same
official report on the SHARE Europe 1990 annual meeting (in October 1990):
- Klaus [Daube, Chairman of the SHARE Europe STWG for National Language
  Architecture] gave a [...] survey of the work undertaken by SEAS/SHARE
  Europe from the positive step taken at Spring Meeting 1980 [...] through
  to the 1990 SHARE Europe White Paper[...]
- Ted [Sasala, Director of NL Support, IBM] assured the audience that IBM
  now recognize the problems presented by National Language differences[...]
Note that, in 1990, IBM still thought of their character set, and character
encoding, problems in terms of "National Languages differences". (Accord-
ingly, on the SEAS Spring Meeting 1989, Ed Hart had given a presentation
titled "National Language Problems in the U.S.?", and has answered this
question with a definite "Yes!", talking of US-english codepages only.)

The official mappings were published only in September 1990 (in "Character
Data Representation Architecture, Level 1, Registry", SC09-1391-00). Until
then, ISVs did have to guess and indeed came up with incompatible mappings,
often based on the myth of there being just one EBCDIC and just one ISO
7-bit code dubbed ASCII. These incompatible mappings lasted much longer
than 1990, as IBM did not reveal their de-facto encoding built into their
software (such as compilers); only after more pressing from the users'
groups, they eventually defined and published their de-facto EBCDIC page;
I think it was dubbed CP 1047 and published in 1992, but I don't have the
pertinent documents at hand.

I think, only market demand (or call it pressure of users' organizations)
will eventually cause vendors to comply with character-encoding standards.

I suggest that E-mail clients (and browsers, btw.) should *never* try to
guess the character encoding; rather they should notify the end-user of
all inconsistencies in incoming messages (or HTML pages, respectively)
and offer the following choices:
- send a complaint to the author of the offending message, explaining the
- filter out illegal characters, replacing them with some sort of sub-
- let the user choose an encoding to try (explaining the possible choices
  and their effects).
Only if all offences (such as tagging CP 1252 data as ISO 8859-1) are
reported, we will be able to accumulate enough market demand for correct
implementations. IBM eventually has succueeded just because their users
were aware of the issues and have demanded solutions.

Best wishes,
   Otto Stolz

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT