Re: CP1252 under UNIX

From: Doug Ewell (dewell@compuserve.com)
Date: Fri Mar 31 2000 - 10:51:57 EST


Here's what I have to contribute to this hot topic.

1. Tagging CP1252 text as "ISO-8859-1" is evil.

Don't do it. It should not be hard for either e-mail software or Web
tools to determine if the source text contains characters in the range
0x80 to 0x9F and to apply a Windows-1252 tag if it does. For that
matter, it should apply a US-ASCII tag if there are no characters above
0x7F. If you create Web pages tagged as 8859-1 that contain 1252
characters, your pages have a problem and users should not have to
second-guess your tag; and if you write software that tags e-mail or Web
pages incorrectly, shame on you.

2. "CP1252 is not a standard."

Oh, but it is. True, it's not an ISO or ANSI standard, not a de jure
standard, but it IS a de facto standard. It is an industry standard.
It is used by a LOT of people.

The difference between a de jure standard and a de facto standard is
sort of like the difference in American law between direct evidence and
circumstantial evidence. Any judge in the U.S. will tell you that
circumstantial evidence IS evidence, although of a lower "quality" than
direct evidence. Similarly, a de jure standard is usually preferable
to a de facto standard, but the latter is still a standard.

It was stated that "1252 violates the very basis for character set
standards" and "All standard character sets comply with ISO 4873 and ISO
2022." This is based on the fact that terminal-host communication
relies on character sets that comply with 4873 and 2022, and it implies,
quite in contrast to the misguided who believe that terminals don't
matter, that terminals are the ONLY thing that matter! By this metric,
no EBCDIC code page could ever be a standard. Even UTF-8 could not be
a standard, because of its use of characters in the 0x80-0x9F range.
(Or are the ISO 2022 escape sequences mentioned in Annex R what make
UTF-8 a standard?)

That said, CP1252 is not supported by everyone (any more than UTF-8 is,
at least yet) and you can make your text available to a great many more
people by encoding it in ISO 8859-1 instead.

3. "If you support 1252 you have to support the hundreds of private
character sets being created every day."

They are? In Western Europe and North America the REALLY COMMON 8-bit
character sets are ASCII, ISO 8859-1, CP1252, MacRoman, and maybe a
smattering of CP437. Are any others so common that they present the
kind of headache we are talking about? In Central and Eastern Europe,
of course, there is a lot more diversity in encoding, but this has
nothing to do with 8859-1 vs. 1252. There is a difference between the
burden of supporting commonly used character sets like CP1252 and that
of supporting every character set ever invented (e.g. CP861).

I can think of one significant 8-bit character set that has been created
in the last 3 years and that has the potential for really widespread
use: ISO 8859-15. Everything else has already been out there for years.

<off-topic>

4. "Does anyone else find it bizarre that we engage in so much
internecine warfare on this list, when the whole purpose we are on it
is to further the cause of Unicode?"

No, this is typical. I am on other Internet mailing lists, and most of
them that support a cause have some sort of bickering between list
members about who is more devoted to the cause or who represents the
"true" solution or what is best for the cause. People are like that.
That is why there are family squabbles, civil wars, and battles among
and between religious groups.

</off-topic>

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT