Re: XML Blueberry Requirements

From: Otto Stolz (Otto.Stolz@uni-konstanz.de)
Date: Thu Jun 21 2001 - 21:01:05 EDT


Misha Wolf hat written::
> In addition, XML 1.0 attempts to adapt to the line-end conventions of
> various modern operating systems, but discriminates against the
> convention used on IBM and IBM-compatible mainframes. XML 1.0 documents
> generated on mainframes must either violate the local line-end
> conventions, or employ otherwise unnecessary translation phases before
> and after XML parsing and generation.

I think, a formal language definition (such as XML or a programming
language) should be independend of any particular encoding conventions.
Source text should always comply with the system conventions, so the
normal tools (particular the system's text editors) can handle it; and
the compilers and interpreters should handle this source according to
these conventions. Whenever a plain text is sent to another system, it
should be converted so as to comply with the target system and retaining
its original meaning.

On Thu, 21 Jun 2001 09:40:22 -0400, Elliotte Rusty Harold wrote:
> The concern with respect to IBM is that one of the world's largest
> corporations, with thousands of patents, legions of programmers,
> billions of dollars in revenue, and resources pouring out of every
> orifice is somehow unable to handle documents where lines end with
> carriage returns and line feeds, as documents do on every non-IBM
> system on the planet.

Not true. There are lots of non-IBM systems on this planet having
other line separators than "carriage returns and line feeds", on
the other hand, there are lots of IBM systems on this planet that
comply with this very convention.

Both MS-DOS and MS-Windows delimit lines in plain-text files by the
2-character sequence CR+LF. It is the IBM systems in this area I was
referring to, in the previous paragraph. If I am not mistaken, also
OS/2 (an IBM invention) complies with this convention.

In Unix, both IBM (AIX) and non-IBM, lines are separated by single CR
characters rather than CR+LF.

> The only reason there's a problem here at all is because IBM
> tried to go it alone as a monopoly and set standards by fiat for years
> rather than working with the rest of the industry. Consequently their
> mainframe character sets don't really interoperate well with everybody
> else's character sets.

This is an entirely distorted view of the pertinent history.

As I have perceived it:

- EBCDIC is based on punched-card technology. On punched cards, the line-
  end is identified with the card-boundary; so, the line-separator is
  perceived as a physical phenomenon rather than a character. Hence,
  systems based on this technology tend to designate line-ends off-band
  (not as characters) and handle them in their file system. I used sys-
  tems based on this design principle, from various vendors (indeed,
  IBM was the last one I encountered); typically, the system's API would
  have functions to read, write, or seek, a line per call, and the system
  would supply (or require, respectively) the line-length and the line-
  content, in distinct fields (record-oriented I/O).

- ASCII (and, in due course, ISO 8859) is based on punched-tape and
  teletype technology. On punched tape, the line-separator is a special
  hole-combination, as any character is; likewise, on a teletype con-
  nection, the new-line function is transmitted like a sequence of
  particular characters. Hence, systems based on this technology tend
  to separate lines by particular (sequences of) control-characters;
  typically, the system's API would have functions to read, write, or
  seek, a block (or an arbitrary number of characters) per call, and
  it's the application's job to handle the lines by scanning for separ-
  ator characters on input, and inserting them, on output (stream-
  oriented I/O).

It is a mere coincidence that Windows and Unix systems are prevailing,
these days, and this is certainly not due to line-separator issues.

When the ISO 8859 series was defined, IBM enhanced their character codes
to provide round-trip conversions with all ISO 8859 parts: the respective
mainframe codes were dubbed CECP (= Country-extended code page, if memory
serves correctly); later the CDRA (= Character Data Representation Archi-
tecture, or some such) gave a rigid definition of characters, coded
character sets, conversions, and so forth. So, the IBM mainframe character
sets really interoperate as well with other character sets as possible
(short of abandoning them altogether).

The real discrepancy w. r. t. line-separators is between the MS-Windows
and Unix worlds; a third system (such as EBCDIC) cannot serve both
parties at the same time.

Best wishes,
  Otto Stolz



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT