Early Years of Unicode

Early Years of Unicode

The Unicode® Standard
"…begin at 0 and add the next character"

Antecedents and The First Year - 1988

The concept of a 16-bit universal code was not new. Even the original 1984 principles of the ISO multi-byte character encoding are directed to that end:

"...develop the ISO standard for graphic character repertoire and coding for an international two byte graphic character set… Consider the needs of programming languages to have the same amount of storage for each character..." [ISO/TC97/SC2 N1436, 1984]

Other antecedents to Unicode are found in the Star workstation introduced by Xerox in 1980 and in aspects of the two-byte standards found in the Far East.

Ground work for the Unicode project began in late 1987 with initial discussions between three software engineers -- Joe Becker of Xerox Corporation, Lee Collins who was then with Xerox, and Mark Davis, then of Apple Corporation.

Early 1988 had completed three main investigations:
(a) comparisons of fixed-width and mixed-width text access
(b) investigations of the total system storage requirements with two-byte text
(c) preliminary character counts for all world alphabets.

Based on these investigations and their experience with different character encodings, Becker, Collins and Davis derived the basic architecture for Unicode.

The beginning of the Unicode Standard may be marked with the publication of the paper written in February of 1988 by Joe Becker, Unicode 88.

Unicode -- The Beginnings

“For me, the need for Unicode first struck about 12 years ago [1985]. While I had done some internationalization while working in Europe, I hadn’t worked on any of the more interesting scripts. Two programmers, Ken Krugler and I were working on a “skunkworks” project in Sapporo, Japan. Our goal was to produce the first Kanji Macintosh.

Working with our Japanese counterparts was made somewhat more challenging because of the translation issues. In the best of all possible worlds, we would all have spoken a common language. Second best would have been having a technically savvy translator, experienced with software engineering design and concepts. What we actually had was one, lone Apple marketing person, who happened to be bilingual.

Imagine yourself in that situation, having to discuss how to combine Huffman encoding and run-length encoding to compress Japanese input dictionaries. We soon learned the full impact of the phrase "to lose something in translation!"

But then our translator had to leave, and we were left with just vestiges of English on their side, and miniscule Japanese on ours. We then found out just how useful a white-board can be.

Yet one day we hit a stumbling block, and were just not making progress. We had known that Japanese needed two bytes to encompass the large character set, and we had prototyped how to adapt the system software to use two-byte characters. However, we were having trouble figuring out exactly how things fit together with our counterparts' data formats.

Remember [that] we were new to this, so it didn't hit us right away. But all of a sudden, we could see the light go on in both of our faces: we had assumed that the standard Shift-JIS character set was a uniform two-byte standard. We were so, so wrong. You needed a mixture of single and double bytes to represent even the most common text. Worse yet, some bytes could be both whole single byte-characters, and parts of double-byte characters. We weren't in Kansas anymore!

We persevered, and ended up producing a successful product [Apple KanjiTalk]. But -- although we kicked around different ideas for a radically new kind of character set -- we never did anything with these ideas. That is, not until we heard about a proposal from colleagues at Xerox for such a new kind of character set, a character set originated by Joe Becker, a character set that he baptized 'Unicode'."

Mark Davis, President and Co-founder of the Unicode Standard and the Unicode Consortium:
Quoted from Keynote Address,“10 years of Unicode”
September 1997 Eleventh International Unicode Conference #11 (©Unicode, Inc. 1997)

1985-1987

During this period of time, in addition to his co-authoring of Apple KanjiTalk, Davis was involved in further discussions of a universal character set which were being prompted by the development of the Apple File Exchange.

“We [at Xerox PARC] decided to put a Japanese system on Alto (the prototype personal computer) with Fuji Xerox [1975]. It was the first personal computer built at Xerox PARC. It was the model that Steve Jobs used [for the first Apple machine]. Thousands were built, none sold. We needed a 16-bit [character encoding]. Joe Becker came up with the initial 16-bit. This was for the Xerox STAR product [1981], which was truly multilingual. I managed the JDS product in STAR. Star went on to 27 languages, including Japanese, Chinese and English."

Bill English, first CFO of Unicode Consortium
(interview with Laura Wideburg © 1998)

At Xerox, Huan-mei Liao, Nelson Ng, Dave Opstad, and Lee Collins began work on a database to map the relationships between identical Japanese (JIS) and Chinese (simplified and traditional) characters for quickly building a font for extended Chinese characters. Xerox users (e.g. Nakajima of the University of Toronto) were using JIS to extend the Xerox Chinese character set and vice versa. This opens the discussion of Han unification.

February 1987

Peter Fenwick visited Xerox PARC, joined by Nakajima from Toronto, and Alan Tucker and Karen-Smith Yoshimura of the Research Libraries Group, Inc.(RLG). The discussion led to the architecture of what later becomes known as the Unicode Standard – “ begin at 0 and add the next character.”

At Apple, discussions of a “universal character set” are sparked by the Apple File Exchange development.

“I was one of two authors for KanjiTalk for Apple. Because of this, I met a bunch of people from Xerox doing multilingual – Joe [Becker], Lee [Collins], Andy Daniels, Dave Opstead, Eric Mader. I met them and ended up hiring Lee to work for me at Apple in 1987. We talked to Joe about his idea of Unicode and wanted to see if this was practical. If you double the text, what impact would it have? We made some trials with Shift-JIS encoding, [which] is difficult and easy to corrupt. We looked at alternatives but Unicode was the most efficient. Lee, Joe and I started meeting regularly [beginning of the Unicode Working Group].”

Mark Davis
(interview with Laura Wideburg © 1998)

Fall of 1987

Mark Davis began Apple’s participation in ANSI X3L2.

September 1987

Joe Becker from Xerox and Mark Davis from Apple begin discussing multilingual issues. Dave Opstad, of Xerox, presents his evaluation that seven years experience with the Xerox Character Code Standard (XCCS) compression scheme shows that fixed-width design is preferable to variable.

December 1987

Earliest documented use of the term “Unicode” which Joe Becker coined as the name of the new “unique, universal, and uniform character encoding.”

February 1988

Lee Collins, now at Apple works with Davis’ new character encoding proposals for future Apple systems. One system includes fixed-width, 16-bit characters, under the name “High Text” (in opposition to “Lower Text” ASCII). Collins investigates:

Total system storage requirement with two-byte text

Comparisons of fixed-width and mixed-width text access; and

Establishment of preliminary character counts for all world alphabets.

“At Apple, we were not easy converts, however. We had some serious issues, both technical and practical. On the technical side:

* Would the increase in the size of text for America and Western Europe be acceptable to our customers there?
* Could the Chinese, Japanese, and Korean ideographs be successfully unified?
* Even then, could all the modern characters in common use actually fit into 16-bits?

Our investigations, headed by Lee Collins, showed that we could get past these technical issues.

As far as the text size, when we tested the percentage of memory or disk space actually occupied by character data in typical use, we found that it was rather small. Small not in absolute terms, but small compared to the amount of overhead in data structures and formatting information. Nowadays, of course, with video and sound data taking so much space, the percentage is even smaller.

Concerning unification, when we looked at the unification of CJK ideographs, we had the successful example of the Research Libraries Group's East Asian Character (EACC) bibliographic code to show the way. We could see that by using the very same unification rules that the Japanese used for JIS, we could unify characters across the three languages.

And, in terms of character count, when we counted up the upper bounds for the modern characters in common use, we came in well under 16 bits.

Moreover, we also verified that no matter how you coded it, a mixed byte character set was always less efficient to access than Unicode was.

We ended up satisfying ourselves that the overall architecture was correct.”

Mark Davis
("10 years of Unicode")

Based upon these investigations, "Unicode Principles” were derived, outlining of the architecture of the future Unicode Standard.

“I was pushing at the beginning that we had to start assigning codes or nothing would ever happen. We started code charts. Lee on the Apple side, Joe on Xerox. I devoted Lee full time to Unicode. He was the prime force in the Unified Han. Without it, Unicode wouldn't work very well. Lee looked at tens of thousands of character codes.”

Mark Davis
(interview with Wideburg © 1998)

April 1988

First Unicode text prototypes begin at Apple. Apple decides to incorporate Unicode support into TrueType.

June 1988

Meeting at Research Libraries Group in Palo Alto in order to discuss the criteria for Han unification. Method devised for combining frequency of use orderings for Chinese, Japanese, and Korean (CJK).

“My first memory [of Unicode] is meeting at Apple on Bubb Road in Cupertino, with Joe and Lee Collins, Allen Tucker and Karen Smith-Yoshimura, who were involved in CJK.

Lee would say RLG had been in CJK for a long time. Unicode had heard about us. RLG had done a unified Han set and Lee was working along the same lines.

Why RLG? RLG is interested in meeting the needs of research librarians. Automation was insufficient because to put them into a computer required transliteration, which created deficiencies.

Most of the meetings were held at RLG in the early days. We had a meeting room for the Unicode Working Group. We ended up being the host institution for the most part. Working with Unicode has been wonderful.

In the past the librarians always had to justify themselves to computer people. The difference with Unicode is that people ARE interested in scholarship. A fair number of them have doctorates and are interested in arcane scripts. They have to use libraries and they appreciate libraries. They have an awareness of the needs of scholarship. They don't focus exclusively on the needs of the marketplace.

The other thing is that they are willing to share their knowledge and teach. In other groups the knowledge keepers make you feel inferior. [In the Unicode group], like Mark Davis, he will explain it carefully if you have a question. And you know Asmus, he is a teacher. They all have a “teacherly” way. The great people are kind and humane. They hold a high position, but they don't "put on dogs."

Joan Aliprand, Current Secretary of Unicode
(interview with Laura Wideburg © 1998)

(Joan Aliprand, who hails from Australia, did not elaborate upon this colloquialism).

July 1988

Apple purchases Research Libraries Group’s Chinese Japanese Korean (CJK) character database for study of Han unification.

Joe Becker presents the “Unicode Principles” in the document “Unicode 88” to /usr/group/International Subcommittee meeting in Dallas, August 1988.

“Joe took an early draft of “Unicode 88” and distributed it at a conference [summer of 1988 ANSI]. Other companies saw that this was the wave of the future and we started taking on more people.

But the practical question still remained: Could we get a critical mass of companies to accept this new encoding?

Here, we were taking a gamble. We were all convinced that this was the correct technical solution. But more than that, we were all moved by the dream of bringing computers to every part of the world. Into Japanese, into Chinese, into Arabic or into Hindi, it should be as easy to localize a program, as it is to localize into French, into German, or into British English. (Some people don't realize that you do have to make a separate version for the UK--if only they would use modern spelling!)

Unicode would not only make it easier and less costly to localize for our existing major countries, it would also make it possible to localize for minor languages, languages that were previously excluded by being just too costly for the effort. People would not be shut off from computers because of their native language. Of course, this also made economic sense for our companies; all major software vendors now sell a majority of their products internationally, not just domestically.

But as time went on we saw that our gamble was worth taking.

As time goes on, ever more programmers must internationalize their programs. They look at Unicode, and look at the alternatives. Once the fundamental support is provided for Unicode in the operating systems or linkable libraries, the best choice is clear. The momentum behind the wave of Unicode adoption will not let up. By the year 2000, the majority of programmers working on new development will be using Unicode."

Mark Davis
(“10 years of Unicode”)

September 1988

Joe Becker and Lee Collins go to ANSI X3L2 to argue for Han Unification; and, the use of Co C1 within ISO DP 10646. Becker later presents paper on Unicode to ISO Working Group 2.

In the fall of 1988, Collins began building a database of Unicode characters. The original design ordered characters alphabetically within scripts, and excluded all composite characters. Xerox had already built up a database of Unified Han for font construction. Collins used a database of EACC characters from RLG (The Research Libraries Group) to start a Han Unification database at Apple. Becker and Collins later correlated the two databases, and Collins continued to extend the database with further character correspondences added for other national standards.

Interviews conducted by Laura Wideburg in 1995 are printed with permission from the interviewer. © Laura Wideburg 1998. They may not be reprinted without Wideburg's permission.
All other material is ©Unicode, Inc. 1988 and may be reprinted, for purposes of educational or press releases. Any other reprinting requires permission of the Unicode Consortium.

Early Years of Unicode

The Unicode® Standard "…begin at 0 and add the next character"

Antecedents and The First Year - 1988

Unicode -- The Beginnings

1985-1987

February 1987

Fall of 1987

September 1987

December 1987

February 1988

April 1988

June 1988

July 1988

September 1988

The Unicode® Standard
"…begin at 0 and add the next character"