DRAFT J.W. van Wingen 1999-09-08 version 0.0 TOWARDS A GENERAL MODEL FOR CHARACTER CODING AND TRANSFORMATIONS The problem +++++++++++ A text, considered as a sequence of symbols sometimes cannot be represented on a medium or transferred to another, because some of the symbols are not there available for use. To avoid loss of information these symbols are to be transformed to sequences of the available symbols, in such a way that each sequence can be transformed back to the original unambiguously. Let T be a text, and f a function, then f: T -> T' , where T' is the transformed text. Equally, a function f' will transform f' T' -> T . A general and convenient way to specify those functions is a "Turing Machine". Turing machines +++++++++++++++ A Turing machine (TM) is an abstract concept, consisting of a "processor", a "tape", with numbered positions, each occupied by a "symbol". and a read/write "head", pointing to some position on the tape. The processor has one of a finite number of "states". On it a program runs that directs the movement of the head and changes the state, based on what is read from the tape. If a "final state" is reached, the machine stops (this may never occur). A TM is determined by a sextuple (S,T,Q,q0,F,P) S: a finite set of symbols T: a finite set of input symbols, subset of S Q: a finite set of states q0: the start state F: a finite set of final states P: a program The set S we call the "alphabet" of the TM. One element is the empty symbol "e". The program consists of a number of rules of the forms: q,t -> q',s, move where q,q' E Q, t E T, s E S, move = one position to the left, the right, or nothing at all. (Here E means "is element of".) q,e -> f,e, none (f E F, the machine stops at finding an empty symbol on the tape.) Turing machines are being used to describe several theories in computer science. Important is that of "computability". Another is that for the transformation of strings. Those TMs are called "finite transducers". (D Wood, Ch. 7.2) TMs are general. Any function that can be computed can be expressed in terms of a Turing machine (Church's these, 1936). This implies also that additions to the simple TM concept (as described above) do not provide any more functionality. Such a TM, for example one with more than one tape, may facilitate understandability of a description, without resorting to extending the basic concept of a TM. Finite transducers of coded characters ++++++++++++++++++++++++++++++++++++++ A multitape TM may describe the decoding of a text, coded according to some standard. Therefore we consider as input symbols "bytes". A byte is in essence a bit pattern, usually of 8 bits. This implies that there exist 256 different bytes. We denote these by xXX, where XX is a hexadecimal number. We assume, without introducing a restriction, that the TM is able to assign a decimal value to any byte. Then we consider a three tape TM, with three heads. One is the input tape A, read only, containing bytes at the first N positions, and the the empty symbol at the rest. The output tape Z, write only, will contain symbols as have been coded, for example the ASCII repertoire of graphic characters. There is an intermediate tape G of 256+1 positions, numbered from 0, read only, on each of which a symbol stands at the position decimally equivalent to an ASCII graphic character. All other positions contain the empty symbol (that is, 0-32, 127-255). G[256]=e. We may formulate the working of this TM by a program written in some pseudo-language. i:=0; x:=0; while x /=e do i:=i+1; x:=A[i]; y:=Z[x]; if y/=e then j:=j+1; Z[j]:=y; endwhile; stop; This program has the following properties: 1. Apart from the final state, there is only one state, 2. Bytes which do not represent an ASCII character are ignored, and do not occur in the output, 3. The head on the tape only moves to the right. (It is left to the reader to reformulate the program into the strict rules for writing of TMs given above.) It is easy to generalize this TM to, say, the graphic repertoire of 8859-1, or to process some control characters. However, some parts of 8859 contain a sentence like "This position shall not be used". What to do if a byte marked like that will occur on the input tape? It may come as a surprise, but decoding is hardly mentioned in the SC2 standards at all. Thus the implementer may make his own choice. He may ignore these bytes, or he may signal an "error", introducing a second final state, at which the TM stops "without success". In the same way UCS-2 may be handled. Two bytes are read, and the length of G extended to 65536+1. But what if the second byte read is the empty symbol? Multi-state transducers +++++++++++++++++++++++ Up to now we only discussed TMs where the symbol read is uniquely mapped on a single other symbol, if the result is defined. That is, we have defined a partial function (partial because not defined for every value of the argument). With other words: TM: f: x -> f(x), where x E T and f(x) E S. Now we will consider TMs where f(x) depends on a "state" of the TM. The state q E Q may change if a certain symbol c E C is read from the input tape (C is subset of T). As an example we take the structure specified in ISO/IEC 4873, level 3 (somewhat simplified). There are 3 elements in C, called "locking shifts", LS1R, LS2R, LS3R. The TM has 4 intermediate tapes, each 128 bytes long, called here G0,G1,G2,G3. We then can describe the TM by a program: i:=0; x:=0; q:=1; while x /=e do i:=i+1; x:=A[i]; if x=LS1R then q:=1 else if x=LS2R then q:=2 else if x=LS3R then q:=3 else begin if x<128 then y:=G0[x] else if q=1 then y:=G1[x] else if q=2 then y:=G2[x] else if q=3 then y:=G3[x]; j:=j+1; Z[j]:=y; end; endwhile; stop; We ignore here the problems of having included control characters in the G sets. Note, however, that the three shift symbols are not being copied to the output tape. The states of the TM as modified by the occurence of a locking shift on the input tape remain as they are until another locking shift is met. But 4873 has also a level 2, which calls for single shifts, SS2, SS3. These correspond with two states, additional to the basic, or start state, each pertaining only to the next symbol from the input. Thus the tape contains single symbols and pairs of symbols, both resulting in a single output symbol. The TM may then be illustrated by the following program: i:=0; x:=0; q:=1; while x /=e do i:=i+1; x:=A[i]; if x=SS2 then q:=2 else if x=SS3 then q:=3; if q>1 then i:=i+1; x:=A[i]; if x<128 then x:=e; if x<128 then y:=G0[x] else if q=1 then y:=G1[x] else if q=2 then begin y:=G2[x]; q:=1 end else if q=3 then begin y:=G3[x]; q:=1 end; j:=j+1; Z[j]:=y; endwhile; stop; Not all pairs define an output symbol. A state changing symbol, like SS2 or SS3, must be followed by a member of a specified subset of the input alphabet. In particular, pairs of SS2 or SS3 and an ASCII or control character are illegal. In general, the input alphabet T is the union of the sets: A non-pairable symbols B symbols pairable with a preceding member of C C state changing symbols e the empty symbol It is not needed that B and C be disjunct, provided that no ambiguity arises at pairing. The situation may be further complicated by specifying that a look-up table has to be consulted to verify whether a pair is legal. This table may be expressed by a number of intermediary tapes to the TM. State-starting and -stopping sequences ++++++++++++++++++++++++++++++++++++++ As we have seen, a certain symbol makes the state of the TM change. Will it remain in that state forever? It may not, if a certain other symbol is met. Then: - the TM returns to its previous state (pushdown automaton) - the TM changes to some other state These, and more complex situations may be described with the help of graph theory. But this would lead us too far for the time being. An example of state-changing and -stopping gives SGML. The state of the TM is changed to word identification by finding an "&", while a ";" stops the sequence, changing to the basic state, after which the word found can be looked up in a table, and an output symbol produced. Alternatively, return to the previous state may be effectuated when a fixed number of symbols is read after the start (or "escape") symbol. An example is that of "pairs" described above, another that of MIME, where an equals sign ("=") is followed by two hexadecimal digits, which may be looked up in a table to produce a single symbol. Look-ahead languages ++++++++++++++++++++ A well-known problem is that of representing members of a large alphabet by members of a small alphabet. It should be possible, if required, to reconstruct the originals without ambiguity. Therefore, the rules must be chosen with great care. If we take the previous example from MIME all non-ASCII characters from Latin-1 are being turned into 3-character strings, each beginning with a "=" sign, which acts as single state- changing symbol, an escape character. The TM expects two hexadecimal characters after it. What to do if something else is found? An equals sign may occur in the text, like x:=y;. There are some solutions: 1. Treat any occurrence of a non-hex after = as an error, 2. Do not change anything, but then x:=37 is ambiguous, 3. Transform all = in the original to =3D. Anyhow, before deciding what to do, two symbols after the equals sign have to be read. This is what is called in the formal language theory "look ahead". A set of reading rules, requiring looking ahead k symbols is called a "LL(k) language". In our case with MIME we have LL(2). Using this terminology for pairs of symbols as specified for 4873, level 2, we may call this convention LL(1). It describes a mixed coding system, where some characters are coded with one, others with two bytes. Of a pair the first byte represents SS2 or SS3. Thus the input alphabets consists of (for the meaning of A,B,C see above): A: x20-x7E (ASCII) B: xA0-xFF (right half) C: SS2,SS3 e But 4873 is not the only standard specifying a mixed coding system with one or two byte character representations. ISO/IEC 6937 does the same thing. There the input alphabet consists of: A: x20-x7E (except those in B) B: selected letters C: xC0-xCF (so-called non-spacing diacritical marks) e Note that not all pairs c-b are legal, only those occurring in a list, the "repertoire" of 6937. Another example is UTF8 for the characters coded in UCS-2 from x0000 to x07FF. Here we have: A: x20-x7E (ASCII) B: x80-xBF C: xC2-xDF e Note that not all pairs c-b are legal, for some may be marked in UCS-2 with "this position shall not be used". Also a c may not be followed by an arbitrary byte, invalid pairs may occur in the input, and should be checked for. Conclusion ++++++++++ Turing machines are a very effective tool to describe coding and decoding rules for characters, and for transformations between large and small character sets. They show clearly difference and likeness which otherwise may be hidden in a verbal description. Application of TMs may be very helpful to explain why a certain approach is preferable to some other. With composing music knowledge of the theory of harmony is indispensable to avoid a horrible noise. So is theoretical computer science a requirement for not making a mess of a system. References ++++++++++ JE Hopgood & JD Ullman Formal languages and their relation to automata Addison-Wesley 1969 2nd ed. Introduction to Automata Theory, Languages, and Computation Addison-Wesley 1979 Derick Wood Theory of Computation NY Wiley, 1987 R Gregory Taylor Models of Computation and Formal Languages Oxford University Press, 1998 VJ Rayward-Smith A First Course in Formal Languages Oxford, Blackwell,1986 VJ Rayward-Smith A First Course in Computability Oxford, Blackwell,1985