RE: DEC multilingual code page, ISO 8859-1, etc.

From: Frank da Cruz (fdc@columbia.edu)
Date: Fri Mar 24 2000 - 10:29:41 EST


Chris Pratley wrote:

> Doug, although I was no where near Microsoft when the decision to define
> Windows-1252 was taken, I think your paragraph is good:
>
> No doubt MS added the extra graphics characters to the 0x80-0x9F range
> without any worries about conflicting with the C1 range, since few
> people dreamed in the mid-1980s that one day PCs would all be connected
> to the Internet and would be exchanging data with ISO 2022-compliant
> systems.
>
Actually everybody dreamed it. If you had your own desktop computer around
1980, the very first thing you wanted was to communicate with your company
or university minis and mainframes. Even those who had no access to minis
and mainframes immediately set about inventing BBSs, Fidonet, etc. In fact,
I might go so far as to say that the whole point of the desktop computing
revolution WAS communications.

Before the IBM PC made its debut, we had Kermit communications software
working on everything from Apple II's, Commodores, Ataris, and CP/M-80 and
86, to Unix, VMS, IBM mainframes, PDP-11s, and DEC-10s and 20s. Within
weeks of the PC's appearance, we had Kermit working there too, and it
quickly became one of the most popular packages for DOS. Review of any of
the mailing lists (Info-IBMPC, the Simtel archives, etc) will show the
importantance of communications -- and, in particular, terminal-oriented
communications -- at the time.

> When I played around on Atari400/800 and Apple II computers in the late
> 70s early 80s, those all had their own character sets just like DOS
> did. It doesn't seem to bother anyone now that those vendors made up their
> owns sets of characters.
>
Every company has the right to build any kind of computer they want and to
make up any kind of character sets they can imagine for private use within
their own proprietary environment, and of course many companies did this,
and continue to do it -- Apple, IBM, Hewlett Packard, etc. That's not the
problem.

> I remember at that time no one could share files because you
> couldn't even read each other's disks let alone character encodings, and
> modems were 300baud and only used by the very few. Sharing data across
> computer systems was just not part of the design spec.
>
But they all did it anyway because we (the Kermit Project) provided the
software that let them do it. The success of Kermit protocol and software
is largely due to its adherence to standards and to the fundamental
principle of data communications and networking: the concept of a "standard
intermediate representation" for all data that goes on the wire. No matter
how crazy a computer's internal character set, it is converted to a standard
one when transferring data in text mode to another computer. Thus each
computer only needs to know its own private encodings and the corresponding
standard ones. NO COMPUTER OR APPLICATION should have to understand some
OTHER computer's private, nonstandard encodings. For a more detailed
treatment of this concept, see:

  ftp://kermit.columbia.edu/kermit/e/accents.ps

> And if we had the benefit of hindsight, there are a bunch more things I
> would fix before the conflict of windows codepages with iso-2022-jp in
> plain-text data transfer.
>
Me too. But CP437 and its successors were not, in themselves, a bad thing.
What's bad is putting them on the wire. This is a cardinal principle of all
networking protocols, and in particular of the Internet, at least until its
commercialization. See, for example, the RFCs of Padlipsky (1971-1985) or
his book "Elements of Networking Style".

> BTW, with IE and Office, we try to support the text and HTML encodings used
> on Windows, DOS, Mac, Unix, and even EBCDIC (no, we don't handle Atari or
> TRS-80 yet nor will we)...
>
But you shouldn't have to do any of that. Handle your own code pages
internally, and convert to and from standard encodings (ISO, JIS, UTF-8,
etc) on the wire. Apple, IBM, and everybody else should do the same thing.
Don't put Windows Code Pages in email because you can't assume the recipient
is reading the mail on Windows. Don't put them in Web pages, because the
Web is browsed by clients running on every imaginable platform, and also
even by Lynx as seen through an ISO-2022 compliant terminal or emulator.

> ... but I see a lot of complaints from Unix users on this list about not
> being able to read HTML pages encoded in windows-1252 when the characters
> in the 80-CF range are used. I'm curious why the makers of whatever
> browsers these are don't simply add support for non-ISO encodings like
> windows-1252 and be done with it (whether windows-1252 is registered at
> the glacial IANA or not shouldn't matter - we tried to register
> windows-1252 there for years with no response, yet the missing
> registration is claimed to be the fault of Microsoft. Bizarre). Isn't it
> fairly trivial and also worthwhile to support this encoding? I'm genuinely
> curious, so no flames please.
>
It's just plain wrong. If you can do it then anyone can. Standards are
there for a reason. They are hard-fought -- as every reader of this mailing
list knows -- and represent compromises that all parties to their making can
live with and subscribe to.

Unicode is a standard in this true sense, arrived at by dialog and concensus
among competing and conflicting interests, as is ISO 10646 as well as ISO
8859, 2022, 4873, 6429, and 646 before them. MIME character-set
registration, on the other hand, blithely undercuts this time-honored and
proven practice by conferring the aura of "standard" on quite literally any
character set any company cares to make up, and this makes life increasingly
difficult for everybody -- developers and users alike -- as a plethora of
private character sets appears on web pages, in email, in netnews, and
everywhere else. There is no excuse for this. If you make email or Web-
authoring software, you have more than enough standard character sets to
choose from, including Unicode. Use them.

Sorry if this seems like a detour from Unicode. It's not. As
D.V. Henkel-Wallace <gumby@henkel-wallace.org> pointed out yesterday,
if we are tolerant of standards abuse, then why shouldn't Unicode itself
be the next victim?

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT