DRAFT                                                 J.W. van Wingen
                                                      1999-09-08
                                                      version 0.0

TOWARDS A GENERAL MODEL FOR CHARACTER CODING AND TRANSFORMATIONS

The problem
+++++++++++

A text, considered as a sequence of symbols sometimes cannot be
represented on a medium or transferred to another, because some of the
symbols are not there available for use. To avoid loss of information
these symbols are to be transformed to sequences of the available
symbols, in such a way that each sequence can be transformed back to
the original unambiguously.
Let T be a text, and f a function, then
  f: T -> T' ,
where T' is the transformed text.
Equally, a function f' will transform
  f' T' -> T .
A general and convenient way to specify those functions is a "Turing
Machine".

Turing machines
+++++++++++++++

A Turing machine (TM) is an abstract concept, consisting of a
"processor", a "tape", with numbered positions, each occupied by a
"symbol". and a read/write "head", pointing to some position on the
tape. The processor has one of a finite number of "states". On it a
program runs that directs the movement of the head and changes the
state, based on what is read from the tape. If a "final state" is
reached, the machine stops (this may never occur).

A TM is determined by a sextuple (S,T,Q,q0,F,P)
S:  a finite set of symbols
T:  a finite set of  input symbols, subset of S
Q:  a finite set of states
q0: the start state
F:  a finite set of final states
P:  a program

The set S we call the "alphabet" of the TM. One element is the empty
symbol "e". The program consists of a number of rules of the forms:
    q,t -> q',s, move
where q,q' E Q, t E T, s E S, move = one position to the left, the
right, or nothing at all. (Here E means "is element of".)
    q,e -> f,e, none
(f E F, the machine stops at finding an empty symbol on the tape.)

Turing machines are being used to describe several theories in computer
science. Important is that of "computability". Another is that for the
transformation of strings. Those TMs are called "finite transducers".
(D Wood, Ch. 7.2)

TMs are general. Any function that can be computed can be expressed in
terms of a Turing machine (Church's these, 1936). This implies also
that additions to the simple TM concept (as described above) do not
provide any more functionality. Such a TM, for example one with more
than one tape, may facilitate understandability of a description,
without resorting to extending the basic concept of a TM.

Finite transducers of coded characters
++++++++++++++++++++++++++++++++++++++

A multitape TM may describe the decoding of a text, coded according to
some standard. Therefore we consider as input symbols "bytes".
A byte is in essence a bit pattern, usually of 8 bits. This implies
that there exist 256 different bytes. We denote these by xXX, where
XX is a hexadecimal number. We assume, without introducing a
restriction, that the TM is able to assign a decimal value to any byte.

Then we consider a three tape TM, with three heads. One is the input
tape A, read only, containing bytes at the first N positions, and the
the empty symbol at the rest. The output tape Z, write only, will
contain symbols as have been coded, for example the ASCII repertoire of
graphic characters. There is an intermediate tape G of 256+1 positions,
numbered from 0, read only, on each of which a symbol stands at the
position decimally equivalent to an ASCII graphic character. All other
positions contain the empty symbol (that is, 0-32, 127-255). G[256]=e.

We may formulate the working of this TM by a program written in some
pseudo-language.

    i:=0; x:=0;
    while x /=e do
        i:=i+1;
        x:=A[i]; y:=Z[x];
        if y/=e then
            j:=j+1;
            Z[j]:=y;
    endwhile;
    stop;

This program has the following properties:
1.  Apart from the final state, there is only one state,
2.  Bytes which do not represent an ASCII character are ignored, and do
    not occur in the output,
3.  The head on the tape only moves to the right.
(It is left to the reader to reformulate the program into the strict
rules for writing of TMs given above.)

It is easy to generalize this TM to, say, the graphic repertoire of
8859-1, or to process some control characters. However, some parts of
8859 contain a sentence like "This position shall not be used". What to
do if a byte marked like that will occur on the input tape?

It may come as a surprise, but decoding is hardly mentioned in the SC2
standards at all. Thus the implementer may make his own choice. He may
ignore these bytes, or he may signal an "error", introducing a second
final state, at which the TM stops "without success".

In the same way UCS-2 may be handled. Two bytes are read, and the length
of G extended to 65536+1. But what if the second byte read is the empty
symbol?

Multi-state transducers
+++++++++++++++++++++++

Up to now we only discussed TMs where the symbol read is uniquely mapped
on a single other symbol, if the result is defined. That is, we have
defined a partial function (partial because not defined for every value
of the argument). With other words:
    TM:  f: x -> f(x), where x E T and f(x) E S.
Now we will consider TMs where f(x) depends on a "state" of the TM. The
state q E Q may change if a certain symbol c E C is read from the input
tape (C is subset of T).

As an example we take the structure specified in ISO/IEC 4873, level 3
(somewhat simplified). There are 3 elements in C, called "locking
shifts", LS1R, LS2R, LS3R. The TM has 4 intermediate tapes, each 128
bytes long, called here G0,G1,G2,G3. We then can describe the TM by a
program:

    i:=0; x:=0; q:=1;
    while x /=e do
        i:=i+1;
        x:=A[i];
        if x=LS1R then q:=1 else
        if x=LS2R then q:=2 else
        if x=LS3R then q:=3 else
        begin
            if x<128 then y:=G0[x] else
            if q=1 then y:=G1[x] else
            if q=2 then y:=G2[x] else
            if q=3 then y:=G3[x];
            j:=j+1;
            Z[j]:=y;
        end;
    endwhile;
    stop;

We ignore here the problems of having included control characters in
the G sets. Note, however, that the three shift symbols are not being
copied to the output tape.

The states of the TM as modified by the occurence of a locking shift on
the input tape remain as they are until another locking shift is met.
But 4873 has also a level 2, which calls for single shifts, SS2, SS3.
These correspond with two states, additional to the basic, or start
state, each pertaining only to the next symbol from the input. Thus the
tape contains single symbols and pairs of symbols, both resulting in a
single output symbol. The TM may then be illustrated by the following
program:

    i:=0; x:=0; q:=1;
    while x /=e do
        i:=i+1;
        x:=A[i];
        if x=SS2 then q:=2 else
        if x=SS3 then q:=3;
        if q>1 then
            i:=i+1;
            x:=A[i];
            if x<128 then x:=e;
        if x<128 then y:=G0[x] else
            if q=1 then y:=G1[x] else
            if q=2 then begin y:=G2[x]; q:=1 end else
            if q=3 then begin y:=G3[x]; q:=1 end;
        j:=j+1;
        Z[j]:=y;
    endwhile;
    stop;

Not all pairs define an output symbol. A state changing symbol, like SS2
or SS3, must be followed by a member of a specified subset of the input
alphabet. In particular, pairs of SS2 or SS3 and an ASCII or control
character are illegal.

In general, the input alphabet T is the union of the sets:
    A  non-pairable symbols
    B  symbols pairable with a preceding member of C
    C  state changing symbols
    e  the empty symbol

It is not needed that B and C be disjunct, provided that no ambiguity
arises at pairing. The situation may be further complicated by
specifying that a look-up table has to be consulted to verify whether a
pair is legal. This table may be expressed by a number of intermediary
tapes to the TM.

State-starting and -stopping sequences
++++++++++++++++++++++++++++++++++++++

As we have seen, a certain symbol makes the state of the TM change. Will
it remain in that state forever? It may not, if a certain other symbol
is met. Then:
 -  the TM returns to its previous state (pushdown automaton)
 -  the TM changes to some other state
These, and more complex situations may be described with the help of
graph theory. But this would lead us too far for the time being. An
example of state-changing and -stopping gives SGML. The state of the TM
is changed to word identification by finding an "&", while a ";" stops
the sequence, changing to the basic state, after which the word found
can be looked up in a table, and an output symbol produced.
Alternatively, return to the previous state may be effectuated when a
fixed number of symbols is read after the start (or "escape") symbol.
An example is that of "pairs" described above, another that of MIME,
where an equals sign ("=") is followed by two hexadecimal digits, which
may be looked up in a table to produce a single symbol.

Look-ahead languages
++++++++++++++++++++

A well-known problem is that of representing members of a large alphabet
by members of a small alphabet. It should be possible, if required, to
reconstruct the originals without ambiguity. Therefore, the rules must
be chosen with great care. If we take the previous example from MIME
all non-ASCII characters from Latin-1 are being turned into 3-character
strings, each beginning with a "=" sign, which acts as single state-
changing symbol, an escape character. The TM expects two hexadecimal
characters after it. What to do if something else is found? An equals
sign may occur in the text, like x:=y;. There are some solutions:
1.  Treat any occurrence of a non-hex after = as an error,
2.  Do not change anything, but then x:=37 is ambiguous,
3.  Transform all = in the original to =3D.

Anyhow, before deciding what to do, two symbols after the equals sign
have to be read. This is what is called in the formal language theory
"look ahead". A set of reading rules, requiring looking ahead k symbols
is called a "LL(k) language". In our case with MIME we have LL(2). Using
this terminology for pairs of symbols as specified for 4873, level 2, we
may call this convention LL(1). It describes a mixed coding system,
where some characters are coded with one, others with two bytes. Of a
pair the first byte represents SS2 or SS3. Thus the input alphabets
consists of (for the meaning of A,B,C see above):

    A:  x20-x7E (ASCII)
    B:  xA0-xFF (right half)
    C:  SS2,SS3
    e

But 4873 is not the only standard specifying a mixed coding system with
one or two byte character representations. ISO/IEC 6937 does the same
thing. There the input alphabet consists of:

    A:  x20-x7E (except those in B)
    B:  selected letters
    C:  xC0-xCF (so-called non-spacing diacritical marks)
    e

Note that not all pairs c-b are legal, only those occurring in a list,
the "repertoire" of 6937.

Another example is UTF8 for the characters coded in UCS-2 from x0000 to
x07FF. Here we have:

    A:  x20-x7E (ASCII)
    B:  x80-xBF
    C:  xC2-xDF
    e

Note that not all pairs c-b are legal, for some may be marked in UCS-2
with "this position shall not be used". Also a c may not be followed
by an arbitrary byte, invalid pairs may occur in the input, and should
be checked for.

Conclusion
++++++++++

Turing machines are a very effective tool to describe coding and
decoding rules for characters, and for transformations between large and
small character sets. They show clearly difference and likeness which
otherwise may be hidden in a verbal description. Application of TMs may
be very helpful to explain why a certain approach is preferable to some
other. With composing music knowledge of the theory of harmony is
indispensable to avoid a horrible noise. So is theoretical computer
science a requirement for not making a mess of a system.

References
++++++++++

JE Hopgood & JD Ullman
    Formal languages and their relation to automata
    Addison-Wesley 1969
    2nd ed. Introduction to Automata Theory, Languages, and Computation
    Addison-Wesley 1979

Derick Wood
    Theory of Computation
    NY Wiley, 1987

R Gregory Taylor
    Models of Computation and Formal Languages
    Oxford University Press, 1998

VJ Rayward-Smith
    A First Course in Formal Languages
    Oxford, Blackwell,1986

VJ Rayward-Smith
    A First Course in Computability
    Oxford, Blackwell,1985