Unicoders!
My memory was jogged by Marco Cimarosti's new RFC for ATF-8. I
felt sure that I had heard something similar somewhere.
After a long, diligent search through dusty filing cabinets,
I did indeed discover a very old RFC for BTF-8 that long
predates ATF-8, UTF-8, or WHATEVERTF-8. Because of the still
continuing relevance of Baudot Code to telexes, and because
of the current interest in the invention of TF's, I thought I
should bring it to the attention of the list.
--Ken
=======================================================================
Telegraphy Working Group K. Whistlestop
Request for Comments: 2OLD4U Creed Machinery, Ltd.
Category: Disinformational October 1916
BTF-8, an 8-bit transformation format of Baudot Code
Status of this Memo
This memo provides disinformation for the Internet community. This memo
does not specify an Internet standard of any kind. In fact, if you
think it specifies any standard, I don't know what you've been smoking
lately. Distribution of this memo is unlimited.
Abstract
The Baudot Multiplex System, as codified in the International
Telegraph Alphabet number 1 [ITA 1] defines a 5 bit character set
which encompasses one of the world's writing systems (the only one
that really counts, of course). 5-bit characters, however, are not
compatible with many current applications and protocols. BTF-8,
the object of this memo, has the characteristic of preserving the
full English alphabet range (well, the uppercase, anyway). Letters
are encoded in one octet, have the usual US-ASCII value (or rather
what will be the US-ASCII value, when US-ASCII is invented). This
provides compatibility with telegrams that rely on US-ASCII values
but are transparent to other values.
1. Introduction
The Baudot Multiplex System defines a 5 bit character set which
encompasses 56 characters for the world's most important writing
system. That's right, you heard me correctly--56 characters. But
how do they do that, since 5 bits only covers 32 combinations?, you
might ask. Well, there's nothing up my sleeves, you see--it's all
done with smoke and mirrors. 26 characters are devoted to uppercase
letters A-Z. And 26 characters are devoted to "Figures": numbers and
punctuation, plus a BELL code to wake up the sleeping operator at the
other end and a "Who are you?" code to check you have reached the correct
sleeping operator. There are two codes: #31 for LTRS and #27 for FIGS,
that switch back and forth between the letters codes and the figures
codes. That leaves four codes for BLANK, SPACE, CR, and LF, which are
valid for both letters and figures. The LTRS and FIGS encodings, however,
are hard to use in many current applications and protocols that assume
8 bit characters without state switches.
Furthermore, the Baudot Multiplex System, as implemented in Creed
teleprinter machinery, requires a start bit and 1.5 stop bits, and
is transmitted asynchronously. Newer systems able to deal with 8 bit
characters cannot process 7.5 bit asynchronous Baudot Multiplex Codes.
This situation has led to the development of so-called transformation
formats (TF), each with different, confusing characteristics.
BTF-8, the object of this memo, uses all bits of an
octet, but has the quality of preserving the full US-ASCII range:
US-ASCII characters are encoded in one octet having the normal US-
ASCII value, and any octet with such a value can only stand for an
US-ASCII character, and nothing else.
LTRS and FIGS codes are removed, and the figures values are recoded
to their US-ASCII values, so as to avoid stateful switching.
- US-ASCII values do not appear otherwise in a BTF-8 encoded charac-
ter stream. This provides compatibility with telegrams or
filing cabinets that file based on US-ASCII values but are
transparent to other values.
- Round-trip conversion is easy between BTF-8 and the Baudot Multiplex
System.
- Character boundaries are easily found from anywhere in an octet
stream.
- The lexicographic sorting order of Baudot Multiplex System strings
is mucked up beyond belief. Of course this is of limited interest
since the sort order is not culturally valid in either case. (And I'm
not sure anybody has even tried to sort asynchronous character streams
on Creed teleprinters, but that is another story anyway.)
- The octet values FE and FF never appear. But then, neither do the
octet values 5B..FD, so it isn't clear why we should single out FE
and FF, is it?
2. BTF-8 definition
In BTF-8, characters are encoded using a single octet. What could be
simpler? The letters are recoded according to their value in US-ASCII.
The figures and control codes are recoded according to their value
in US-ASCII. The LTRS and FIGS codes are tossed in the bit bucket.
The table below summarizes this format.
The letter x indicates bits available for encoding bits of the Baudot
character value.
Baudot Multiplex Code (binary) BTF-8 octet sequence (binary)
00000-11111 0xxxxxxx
Encoding from the Baudot Multiplex System to BTF-8 proceeds as follows:
1) Assume "letters" state initially.
2) Process the asynchronous stream of Baudot Multiplex codes sequentially,
stripping 1 start bit and 1.5 stop bits, to obtain the 5-bit coded value.
3) When the FIGS code is encountered, set the state to "figures".
4) When the LTRS code is encountered, set the state to "letters".
5) For all other codes encountered, if in "letters" state, convert to
US-ASCII with the LETTER_TO_BTF8 table, otherwise convert to
US-ASCII with the FIGURE_TO_BTF8 table.
Decoding from BTF-8 to the Baudot Multiplex System proceeds as follows:
1) Assume "letters" state initially.
2) For each character in the BTF-8 string, determine whether it is
in the letters set or the figures set.
3) If the character is in the letters set and "letters" state is set,
convert to Baudot code with the BTF8_TO_LETTER table.
4) If the character is in the figures set and "figures" state is set,
convert to Baudot code with the BTF8_TO_FIGURE table.
5) If the character is in the letters set and "figures" state is set,
first emit the FIGS code and then
convert to Baudot code with the BTF8_TO_LETTER table.
6) If the character is in the figures set and "letters" state is set,
first emit the LTRS code and then
convert to Baudot code with the BTF8_TO_FIGURE table.
7) Emit each converted 5-bit value, prefixing a start bit and 1.5
stop bits.
The applicable tables are shown here, expressed in C (= Baudot Code
01110). The value 0xFF is an unused value in the table, corresponding
to the LTRS or FIGS codes, or illegal values in BTF-8.
char LETTER_TO_BTF8 [32] =
{ 0x00, 0x45, 0x0A, 0x41, 0x20, 0x53, 0x49, 0x55,
0x0D, 0x44, 0x52, 0x4A, 0x4E, 0x46, 0x43, 0x4B,
0x54, 0x5A, 0x4C, 0x57, 0x48, 0x59, 0x50, 0x51,
0x4F, 0x42, 0x47, 0xFF, 0x4D, 0x58, 0x56, 0xFF };
char FIGURE_TO_BTF8 [32] =
{ 0x00, 0x33, 0x0A, 0x3D, 0x20, 0x27, 0x38, 0x37,
0x0D, 0x05, 0x34, 0x07, 0x2C, 0x40, 0x3A, 0x28,
0x35, 0x2B, 0x29, 0x32, 0x24, 0x36, 0x30, 0x31,
0x39, 0x3F, 0x2A, 0xFF, 0x2E, 0x2F, 0x3E, 0xFF };
char BTF8_TO_LETTER [91] =
{ 0x00, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* 0 */
0xFF, 0xFF, 0x02, 0xFF, 0xFF, 0x08, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* 1 */
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0x04, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* 2 */
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* 3 */
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0xFF, 0x04, 0x19, 0x0E, 0x09, 0x01, 0x0D, 0x1A, /* 4 */
0x14, 0x06, 0x0B, 0x0F, 0x12, 0x1C, 0x0C, 0x18,
0x16, 0x17, 0x0A, 0x05, 0x10, 0x07, 0x1E, 0x13, /* 5 */
0x1D, 0x15, 0x11 };
char BTF8_TO_FIGURE [65] =
{ 0x00, 0xFF, 0xFF, 0xFF, 0x09, 0xFF, 0xFF, 0x0B, /* 0 */
0xFF, 0xFF, 0x02, 0xFF, 0xFF, 0x08, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, /* 1 */
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0x04, 0xFF, 0xFF, 0xFF, 0x14, 0xFF, 0xFF, 0x05, /* 2 */
0x0F, 0x12, 0x1A, 0x11, 0x0C, 0xFF, 0x1C, 0x1D,
0x16, 0x17, 0x13, 0x01, 0x0A, 0x10, 0x15, 0x07, /* 3 */
0x06, 0x18, 0x0E, 0xFF, 0xFF, 0x04, 0x1E, 0x19,
0x0D };
Actual code to strip the start and stop bits of the asynchronous
stream, convert the 5-bit Baudot code thus extracted to a numeric
value, and then to use these tables is left as an exercise to the
reader.
3. Examples
For simplicity these examples omit the start bit (always set) and the
1.5 stop bits (also always set). Note that bit values in the Baudot
Codes start with the lowest-order bit on the left, and with higher-order
bits to the right, so that "11000" = 3, the Baudot Code for "A".
In case you haven't purchased your Creed transmitting and teleprinting
devices yet, this arrangement used to correspond to the five levers
the operator pressed on a chording keyboard (see [ITA-1] for a
photograph): the two on the left
corresponding to the first two fingers of the left hand, and the three
on the right corresponding to the first three fingers of the right hand.
However, this has all been simplified in the Creed machines to make
use of an ordinary typewriter-style keyboard--the machine automatically
translates a keypress into the activation of the appropriate combination
of levers for perforating tape, controlled by compressed air!!
Now anyone who has passed a competent secretarial course
can serve as a telegraph operator, thus opening the door to hiring
cheap, compliant female labor to keep your telegraphy operating costs down.
The Baudot sequence "A=1." (11000 11011 01111 11101 00111) may be encoded
as follows:
41 3D 31 2E
The Baudot sequence "HI MOM :-)" (00101 01100 00100 00111 00011 00111
00100 11011 01110 11000 01001) may be encoded as follows:
48 49 20 4D 4F 4D 20 3A 3D 29
The Baudot sequence representing the Han characters for the Japanese
word "nihongo" -- no wait!, what could I be thinking??
MIME registrations
This memo is meant to serve as the basis for registration of a MIME
character encoding (charset) as per [RFC1521]. The proposed charset
parameter value is "BTF-8". This string would label media types
containing text consisting of characters from the repertoire of ITA 1
encoded to a sequence of octets using the encoding scheme
outlined above.
Security Considerations
Security issues are not discussed in this memo. German spies may be
listening, and we all know what an Enigma their codes and coding
machinery are.
Acknowledgments
The following have participated in the drafting and discussion of
this memo:
Dewey, Cheetham, and Howe My Dog Fluffy
Phillip Airtime Sy Burnett
Tilly Graham
Bibliography
[ITA 1] International Telegraph Alphabet number 1. For nice
pictures of equipment, see:
http://ourworld.compuserve.com/homepages/sam_hallas/telhist2/telehist.htm
[RFC1521] Borenstein, N., and N. Freed, "MIME (Multipurpose
Internet Mail Extensions) Part One: Mechanisms for
Specifying and Describing the Format of Internet Mes-
sage Bodies", RFC 1521, Bellcore, Innosoft, September
1993.
[US-ASCII] Coded Character Set--7-bit American Standard Code for
Information Interchange, ANSI X3.4-1986.
Author's Address
Ken Whistlestop
Creed Machinery, Ltd.
Tel: Garfield exchange #42
Fax: Same to you, buddy!
EMail: What's that?
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:52 EDT