RE: CP1252 under Unix

From: Frank da Cruz (fdc@columbia.edu)
Date: Sat Mar 25 2000 - 12:36:22 EST


Chris Pratley <chrispr@MICROSOFT.com> wrote:
> Frank, please. It is of no benefit in this always-connected world for any
> large corporation to push its own character encoding over Unicode. Let me
> debunk that myth yet again.
>
No need. Of course we all agree that Microsoft is out in front in deploying
Unicode -- good!

> The reason is that Unicode solves the problems that used to require new
> encodings (such as, how do you support "smart" typographically correct
> quotes ... even though the proposed 8-bit standard does not provide them?).
>
As Markus's later message demonstrates, "smart" quotes are not so smart. A
standard is a compromise. By its very nature it does not please anybody --
all parties to making it give up something in order to get something. No
standards body dealing with 8-bit sets ever thought it was worthwhile to
include a variety of typographical quote variations. That means:

 1. Companies that are passionate about providing such things to their
    customers can take a more active role in the standards process to
    argue the case for "smart" quotes as against, say, Icelandic letters,
    OE digraphs, or dotless i's; or:

 2. They make do with regular quotes, leaving it up to the rendering
    software to make them look nice; or:

 3. They issue a private code page that includes them.

When a company chooses (3), it is the company's responsibility to keep the
code page private. Of course it can't control what users and ISVs do, but
the company itself should not field software that uses a private code page
for interchange: that is quite simply antisocial behavior.

> cp-1252 was not designed to subvert iso-8859-1. If that were the
> case, it would not have been designed as a superset.
>
A subtle point, but one which does not hold for the other CP12xx's.

> There is no holy war against non-Unicode encodings that has to be fought
> anymore from where I stand.
>
Because you're standing at the top of the hill? :-)

Look at the other messages in this thread. Everybody except me, with only
one other exception so far, is quite willing to treat CP1252 as a standard
character set, despite the fact that it has never been approvied by any
standards body and that it, like all PC code pages, violates the very basis
for character set standards by overstepping the structural bounds.

> For example, at Microsoft we have customers and
> governments coming to us on a regular basis and asking for a new code page
> for their language. The standing answer is: you have it already - Unicode.
>
Excellent!

> As Peter Constable and a few others will note, that doesn’t quite cut
> ^
Markus was right: here's a "smart" quote, Right Single Quotation Mark
(should it be an apostrophe?). Yet the mail in which it arrived includes no
announcement of CP1252 or any other character set. Luckily 0x92 is the ISO
6429 "Private Use 2" control, which has no function in my terminal window,
and so was simply ignored. (It could have been worse; see below.) It only
made it seem to me that you didn't know how to spell didn't :-)

> Markus seems to have the practical approach to the current problems. I was
> curious why, if cp-1252 on the net is such a problem for Unix (Mac?) users,
> no one is rushing to fix the browsing experience.
>
Here is exactly why: All standard character sets comply with ISO 4873 and
ISO 2022 because terminals, printers, and myriad communication and other
devices *use* the control regions for control, not graphics. Obviously,
however, Web browsers do not -- neither do word processors. Thus there
is *apparently* no problem using CP1252 on the Web.

The point of a standard character set is that you can use it for anything,
not just Web browsing or word processing. This is why PC code pages are
not, and never can be, standard character sets: You CAN'T use them for
anything.

Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk>, who we must blame for instigating
this discussion, said:
>
> Before I start flaming Chris Pratley <chrispr@microsoft.com> for using
> unannounced CP1252 characters in the very same email (2000-03-25
> 00:40:17Z) in which he claims proper handling of exactly this issue by
> contemporary Microsoft® products, let me first test whether the observed
> effect isn’t by any chance the fault of the unicode@unicode.org list
> server or some other system beyond the control of Microsoft.
>
Then Markus went on to list the graphics in the 0x80-0x9F range of CP1252.
Now, I was reading his message in a terminal window (in Windows, by the
way, not Unix) that conforms to ISO 2022, 4873, and 6429. Here's what
happened:

0x95: LATIN CAPITAL LETTER Y WITH DIAERESIS
  This is C1 control APC (Application Program Command). It makes any ANSI
  X3.64 / ISO-6429 compliant terminal hang forever waiting for the rest of
  the APC sequence, which never comes. Thank goodness for the reset button.

0x98: SMALL TILDE
  This is C1 control SOS. Another string-bearing C1 escape sequence.
  Again, the terminal hangs forever waiting for rest of string.

And so on. It is quite impossible to read this message from a standards-
compliant terminal.

Later, Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk> wrote:
> I hope, you can all read (and save and load print and edit and quote and
> reply to!!!) the following nicely with your email and web clients.
>
He encloses the same list of characters, but this time in UTF-8. No problems
whatsoever, and I can even see the characters this time.

I realize I tend to beat this topic to death in a quixotic manner, but if we
forget all the lessons of the first 40-50 years of computing, then we're
going to go through all those bad times again and again. Why do these
topics come up again and again? It's because so many of the readers of this
list do not use standards-compliant terminals or emulators; most probably
don't use terminals at all; some don't even know what they are. That does
not mean they don't exist or are going away or that the standards are no
longer in effect. Since these issues tend not to come up in the
word-processing, Web-browsing environment, we must keep reminding ourselves
that Unicode is a PLAIN TEXT standard that does not require a GUI, and just
because we might not personally feel the effects when standards are violated
does not meant that other people will not be affected.

We all agree that every effort must be made to move from 7- and 8-bit
character sets to Unicode. Good, let's do it! However, we must not believe
that CP1252 is in some way a step in that direction, let alone a NECESSARY
step. It isn't -- it is nothing but a time waster, and advocating its use
on the Web, in email, or for any other interchange purpose is just plain
wrong.

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT