Early Years of Unicode
The Unicode® Standard
"…begin at 0 and add the next character"
Antecedents and The First Year - 1988
The concept of a 16-bit universal code was not new.
Even the original 1984 principles of the ISO multi-byte character
encoding are directed to that end:
"...develop the ISO standard for graphic
character repertoire and coding for an international two byte graphic
character set… Consider the needs of programming languages to have the
same amount of storage for each character..." [ISO/TC97/SC2 N1436,
1984]
Other antecedents to Unicode are found in the Star
workstation introduced by Xerox in 1980 and in aspects of the two-byte
standards found in the Far East.
Ground work for the Unicode project began in late
1987 with initial discussions between three software engineers -- Joe
Becker of Xerox Corporation, Lee Collins who was then with Xerox, and
Mark Davis, then of Apple Corporation.
Early 1988 had completed three main investigations:
(a) comparisons of fixed-width and mixed-width text access
(b) investigations of the total system storage requirements with
two-byte text
(c) preliminary character counts for all world alphabets.
Based on these investigations and their experience
with different character encodings, Becker, Collins and Davis derived
the basic architecture for Unicode.
The beginning of the Unicode Standard may be marked
with the publication of the paper written in February of 1988 by Joe
Becker, Unicode 88.
Unicode -- The Beginnings
“For me, the need for Unicode first struck
about 12 years ago [1985]. While I had done some
internationalization while working in Europe, I hadn’t worked on any
of the more interesting scripts. Two programmers, Ken Krugler
and I were working on a “skunkworks” project in Sapporo, Japan. Our
goal was to produce the first Kanji Macintosh.
Working with our Japanese counterparts was
made somewhat more challenging because of the translation issues. In
the best of all possible worlds, we would all have spoken a common
language. Second best would have been having a technically savvy
translator, experienced with software engineering design and
concepts. What we actually had was one, lone Apple marketing person,
who happened to be bilingual.
Imagine yourself in that situation, having to
discuss how to combine Huffman encoding and run-length encoding to
compress Japanese input dictionaries. We soon learned the full
impact of the phrase "to lose something in translation!"
But then our translator had to leave, and we
were left with just vestiges of English on their side, and miniscule
Japanese on ours. We then found out just how useful a white-board
can be.
Yet one day we hit a stumbling block, and were
just not making progress. We had known that Japanese needed two
bytes to encompass the large character set, and we had prototyped
how to adapt the system software to use two-byte characters.
However, we were having trouble figuring out exactly how things fit
together with our counterparts' data formats.
Remember [that] we were new to this, so it
didn't hit us right away. But all of a sudden, we could see the
light go on in both of our faces: we had assumed that the standard
Shift-JIS character set was a uniform two-byte standard. We were so,
so wrong. You needed a mixture of single and double bytes to
represent even the most common text. Worse yet, some bytes could be
both whole single byte-characters, and parts of double-byte
characters. We weren't in Kansas anymore!
We persevered, and ended up producing a
successful product [Apple KanjiTalk]. But -- although we
kicked around different ideas for a radically new kind of character
set -- we never did anything with these ideas. That is, not until we
heard about a proposal from colleagues at Xerox for such a new kind
of character set, a character set originated by Joe Becker, a
character set that he baptized 'Unicode'."
Mark Davis, President and Co-founder
of the Unicode Standard and the Unicode Consortium:
Quoted from Keynote Address,“10 years of Unicode”
September 1997 Eleventh International Unicode Conference #11
(©Unicode, Inc. 1997)
1985-1987
During this period of time, in addition to his
co-authoring of Apple KanjiTalk, Davis was involved in further
discussions of a universal character set which were being prompted by
the development of the Apple File Exchange.
“We [at Xerox PARC] decided to put a Japanese
system on Alto (the prototype personal computer) with Fuji Xerox
[1975]. It was the first personal computer built at Xerox PARC. It
was the model that Steve Jobs used [for the first Apple machine].
Thousands were built, none sold. We needed a 16-bit [character
encoding]. Joe Becker came up with the initial 16-bit. This was for
the Xerox STAR product [1981], which was truly multilingual. I
managed the JDS product in STAR. Star went on to 27 languages,
including Japanese, Chinese and English."
Bill English, first CFO of Unicode
Consortium
(interview with Laura Wideburg © 1998)
At Xerox, Huan-mei Liao, Nelson Ng, Dave Opstad,
and Lee Collins began work on a database to map the relationships
between identical Japanese (JIS) and Chinese (simplified and
traditional) characters for quickly building a font for extended
Chinese characters. Xerox users (e.g. Nakajima of the University of
Toronto) were using JIS to extend the Xerox Chinese character set and
vice versa. This opens the discussion of Han unification.
February 1987
Peter Fenwick visited Xerox PARC, joined by
Nakajima from Toronto, and Alan Tucker and Karen-Smith Yoshimura of
the Research Libraries Group, Inc.(RLG). The discussion led to the
architecture of what later becomes known as the Unicode Standard – “
begin at 0 and add the next character.”
At Apple, discussions of a “universal character set” are sparked by
the Apple File Exchange development.
“I was one of two authors for KanjiTalk for
Apple. Because of this, I met a bunch of people from Xerox doing
multilingual – Joe [Becker], Lee [Collins], Andy Daniels, Dave
Opstead, Eric Mader. I met them and ended up hiring Lee to work for
me at Apple in 1987. We talked to Joe about his idea of Unicode and
wanted to see if this was practical. If you double the text,
what impact would it have? We made some trials with Shift-JIS
encoding, [which] is difficult and easy to corrupt. We looked
at alternatives but Unicode was the most efficient. Lee, Joe and I
started meeting regularly [beginning of the Unicode Working Group].”
Mark Davis
(interview with Laura Wideburg © 1998)
Fall of 1987
Mark Davis began Apple’s participation in ANSI X3L2.
September 1987
Joe Becker from Xerox and Mark Davis from Apple
begin discussing multilingual issues. Dave Opstad, of Xerox,
presents his evaluation that seven years experience with the Xerox
Character Code Standard (XCCS) compression scheme shows that
fixed-width design is preferable to variable.
December 1987
Earliest documented use of the term “Unicode”
which Joe Becker coined as the name of the new “unique, universal,
and uniform character encoding.”
February 1988
Lee Collins, now at Apple works with Davis’ new
character encoding proposals for future Apple systems. One
system includes fixed-width, 16-bit characters, under the name “High
Text” (in opposition to “Lower Text” ASCII). Collins
investigates:
- Total system storage requirement with two-byte text
- Comparisons of fixed-width and mixed-width text access; and
- Establishment of preliminary character counts for all
world alphabets.
“At Apple, we were not easy converts, however.
We had some serious issues, both technical and practical. On the
technical side:
* Would the increase in the size of text for
America and Western Europe be acceptable to our customers there?
* Could the Chinese, Japanese, and Korean ideographs be successfully
unified?
* Even then, could all the modern characters in common use actually
fit into 16-bits?
Our investigations, headed by Lee Collins,
showed that we could get past these technical issues.
As far as the text size, when we tested the
percentage of memory or disk space actually occupied by character
data in typical use, we found that it was rather small. Small not in
absolute terms, but small compared to the amount of overhead in data
structures and formatting information. Nowadays, of course, with
video and sound data taking so much space, the percentage is even
smaller.
Concerning unification, when we looked at the
unification of CJK ideographs, we had the successful example of the
Research Libraries Group's East Asian Character (EACC) bibliographic
code to show the way. We could see that by using the very same
unification rules that the Japanese used for JIS, we could unify
characters across the three languages.
And, in terms of character count, when we
counted up the upper bounds for the modern characters in common use,
we came in well under 16 bits.
Moreover, we also verified that no matter how
you coded it, a mixed byte character set was always less efficient
to access than Unicode was.
We ended up satisfying ourselves that the
overall architecture was correct.”
Mark Davis
("10 years of Unicode")
Based upon these investigations, "Unicode
Principles” were derived, outlining of the architecture of the future
Unicode Standard.
“I was pushing at the beginning that we had to
start assigning codes or nothing would ever happen. We started
code charts. Lee on the Apple side, Joe on Xerox. I devoted
Lee full time to Unicode. He was the prime force in the Unified Han.
Without it, Unicode wouldn't work very well. Lee looked at tens of
thousands of character codes.”
Mark Davis
(interview with Wideburg © 1998)
April 1988
First Unicode text prototypes begin at Apple. Apple decides
to incorporate Unicode support into TrueType.
June 1988
Meeting at Research Libraries Group in Palo Alto in
order to discuss the criteria for Han unification. Method
devised for combining frequency of use orderings for Chinese,
Japanese, and Korean (CJK).
“My first memory [of Unicode] is meeting at
Apple on Bubb Road in Cupertino, with Joe and Lee Collins, Allen
Tucker and Karen Smith-Yoshimura, who were involved in CJK.
Lee would say RLG had been in CJK for a long
time. Unicode had heard about us. RLG had done a unified
Han set and Lee was working along the same lines.
Why RLG? RLG is interested in meeting
the needs of research librarians. Automation was insufficient
because to put them into a computer required transliteration, which
created deficiencies.
Most of the meetings were held at RLG in the
early days. We had a meeting room for the Unicode Working
Group. We ended up being the host institution for the most part.
Working with Unicode has been wonderful.
In the past the librarians always had to
justify themselves to computer people. The difference with Unicode
is that people ARE interested in scholarship. A fair number of them
have doctorates and are interested in arcane scripts. They have to
use libraries and they appreciate libraries. They have an
awareness of the needs of scholarship. They don't focus
exclusively on the needs of the marketplace.
The other thing is that they are willing to
share their knowledge and teach. In other groups the knowledge
keepers make you feel inferior. [In the Unicode group], like Mark
Davis, he will explain it carefully if you have a question. And you
know Asmus, he is a teacher. They all have a “teacherly” way.
The great people are kind and humane. They hold a high
position, but they don't "put on dogs."
Joan Aliprand, Current Secretary of
Unicode
(interview with Laura Wideburg © 1998)
(Joan Aliprand, who hails from Australia, did not elaborate upon
this colloquialism).
July 1988
Apple purchases Research Libraries Group’s Chinese
Japanese Korean (CJK) character database for study of Han unification.
Joe Becker presents the “Unicode Principles” in the
document “Unicode 88” to /usr/group/International Subcommittee meeting
in Dallas, August 1988.
“Joe took an early draft of “Unicode 88” and
distributed it at a conference [summer of 1988 ANSI]. Other
companies saw that this was the wave of the future and we started
taking on more people.
But the practical question still remained:
Could we get a critical mass of companies to accept this new
encoding?
Here, we were taking a gamble. We were all
convinced that this was the correct technical solution. But more
than that, we were all moved by the dream of bringing computers to
every part of the world. Into Japanese, into Chinese, into Arabic or
into Hindi, it should be as easy to localize a program, as it is to
localize into French, into German, or into British English. (Some
people don't realize that you do have to make a separate version for
the UK--if only they would use modern spelling!)
Unicode would not only make it easier and less
costly to localize for our existing major countries, it would also
make it possible to localize for minor languages, languages that
were previously excluded by being just too costly for the effort.
People would not be shut off from computers because of their native
language. Of course, this also made economic sense for our
companies; all major software vendors now sell a majority of their
products internationally, not just domestically.
But as time went on we saw that our gamble was
worth taking.
As time goes on, ever more programmers must
internationalize their programs. They look at Unicode, and look at
the alternatives. Once the fundamental support is provided for
Unicode in the operating systems or linkable libraries, the best
choice is clear. The momentum behind the wave of Unicode adoption
will not let up. By the year 2000, the majority of programmers
working on new development will be using Unicode."
Mark Davis
(“10 years of Unicode”)
September 1988
Joe Becker and Lee Collins go to ANSI X3L2 to argue
for Han Unification; and, the use of Co C1 within ISO DP 10646. Becker
later presents paper on Unicode to ISO Working Group 2.
In the fall of 1988, Collins began building a
database of Unicode characters. The original design ordered characters
alphabetically within scripts, and excluded all composite characters.
Xerox had already built up a database of Unified Han for font
construction. Collins used a database of EACC characters from RLG (The
Research Libraries Group) to start a Han Unification database at
Apple. Becker and Collins later correlated the two databases, and
Collins continued to extend the database with further character
correspondences added for other national standards.
Interviews conducted by Laura Wideburg in 1995 are
printed with permission from the interviewer. © Laura Wideburg 1998.
They may not be reprinted without Wideburg's permission.
All other material is ©Unicode, Inc. 1988 and may
be reprinted, for purposes of educational or press releases. Any
other reprinting requires permission of the
Unicode Consortium.