Re: ISO compliance of Linux console

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Wed Jan 13 1999 - 10:44:18 EST


Andries.Brouwer@cwi.nl wrote on 1999-01-13 14:05 UTC:
> From: "H. Peter Anvin" <hpa@transmeta.com>
>
> Something that is a major concern to me is the fact that the Linux
> console is a botch of ISO 2022, in a way that really isn't fixable
> (ISO 2022 treats the two halves of an 8-bit charset separately, Linux
> doesn't. The current select code <ESC> ( B for ISO 8859-1 is really
> the code for US-ASCII in the G0 area, and so is the *same* for *all*
> the ISO 8859 codes -- they differ only in their G1 codes.) I'm
> somewhat at a loss for how to fix it, too. One option is to say that
> we don't care, and assign our own escape codes to everything. That
> would rather suck, though.
>
> There is no reason why we couldnt do it precisely following
> ISO 2022. Nobody knows these details. Nobody uses these details.
> And if someone does, then this is part of the changes to 2.4.
> I think we can make the changes almost invisible.
>
> But of course parts of ISO 2022 are obsolete. (The
> G0 part is fixed by definition following ISO 4873.)

Agreed. The DEC VT100 terminal was supposed to be ISO 2022 and ISO 6429
compliant. The Linux console is supposed to be VT100 compatible. So if
the Linux console violates these ISO standards, this is a bug that
should be fixed.

I also agree that it is an excellent idea not to allow G0 to be anything
else but US-ASCII. Allowing G0 to become filled with non-ASCII
characters is just an annoying hazard ("Help, my terminal is messed
up!") and not a useful feature. Alternative non-ASCII 7-bit codes should
never be supported for robustness reasons. If ISO/IEC 4873:1991
(Information technology -- ISO 8-bit code for information interchange --
Structure and rules for implementation) says so, the better.

By the way, many of these ISO standards are also freely available from
ECMA under a different name, and nobody should write or extend a VT100
terminal emulator these days without having a copy of them:

  ECMA-35, Character Code Structure and Extension Techniques,
  6th edition (December 1994). (= ISO 2022)

  ECMA-43, 8-Bit Coded Character Set Structure and Rules,
  3rd edition (December 1991). (= ISO 4873)

  ECMA-48, Control Functions for Coded Character Sets,
  5th edition (June 1991). (= ISO 6429)

All these can be downloaded or ordered free of charge on paper and/or
CD-ROM from

  http://www.ecma.ch/

A big warning to first read ISO 6429 and ISO 2022 should be added to the
header commends in Linux console.c.

The ESC command sequence parser in the Linux console has clearly been
written by somebody who did not read ISO 6429 or ISO 2022. There are
very simple rules for the syntax of ESC sequences (only certain
characters are allowed as inner or terminator elements after an ESC,
etc.). A ESC sequence parser should first read an ESC sequence into a
buffer based on these syntax rules and interpret the sequence when a
terminating character has been received. This allows the parser to
correctly jump over unknown ESC sequences. There are also clear rules
for private extension codes in ISO 6429 and ISO 2022, which some of the
Linux specific extensions violate.

For the precise syntax of ISO 6429 control sequences have a look at
<ftp://ftp.ecma.ch/ecma-st/e048-pdf.pdf> in section 5.4.

Below follows a quick tutorial on the syntax of ESC sequences for those
not familiar with the relevant standards:

The format of an ISO 6429 control sequence is

   CSI P ... P I ... I F

where

a) CSI is represented by bit combinations 0x1b (representing ESC) and
"[" in a 7-bit code or by bit combination 0x9b in an 8-bit code, see
5.3;

b) P ... P are Parameter Bytes, which, if present, consist of bit
combinations from "0" to "?" (second 16-character column in G0);

c) I ... I are Intermediate Bytes, which, if present, consist of bit
combinations from " " to "/" (first 16-character column in G0). Together
with the Final Byte F, they identify the control function;

d) F is the Final Byte; it consists of a bit combination from "@" to "~"
(last four 16-character columns in G0); it terminates the control
sequence and together with the Intermediate Bytes, if present,
identifies the control function. Bit combinations "p" to "~" are
available as Final Bytes of control sequences for private (or
experimental) use.

ISO 2022 (section 6.3) defines additional ESC sequences that do not
start with CSI or ESC [. ISO 2022 sequences have the form

  ESC I ... I F

where I is in the range " " to "/" (first 16-character column in G0) and
F is in the range "0" to "~" (last five 16-character columns in G0).
Values for F in the range "0" to "?" (second 16-character column in G0)
are reserved for private extensions.

So if Linux uses say ESC ( U to switch to the IBM CP437 set, then Linux
has used a sequence that is clearly reserved for use by the ISO
registry. In addition, IBM CP437 is a set with ASCII plus 128 graphical
characters and therefore only a private ESC % ... sequences as specified
in ISO 2022 section 6.3.11 can be applied here. All these ESC sequences
should be replaced by ones that form appropriate private extensions in
the sense of ISO 2022.

Even if it is necessary to leave for backwards compatibility some
ISO-violating ESC sequences in the Linux console, we should introduce
additional proper ISO 6429 and ISO 2022 private extensions for all
non-standard functions, document only those, and declare the old
ISO-violating ESC sequences to be deprecated.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT