Answers about Unicode history

From: Marco Cimarosti ([email protected])
Date: Wed Feb 06 2002 - 10:25:53 EST

Previous message: Winkler, Arnold F: "FW: Bar codes using unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Here is a summary of all the answers I received to my "historical"
questions.

Sorry for the length of this post, but I think that many people will find
this worth reading. Thanks again to all the people who took the time to
reply.

_ Marco

--- --- --- ---
Q: When did the Unicode project start, and who started it?

A: [Magda Danish]
I am currently working on a few web pages that talk about the Unicode
history.

A: [Mark Davis]
While we will continue to flesh out and improve these pages, the initial
versions are publicly available, under "Historical Data" on:
<http://www.unicode.org/unicode/consortium/consort.html>

A: [Kenneth Whistler]
The short answer is that Joe Becker (Xerox) and Lee Collins (Apple) were
highly instrumental in getting the ball rolling on this, and the
preliminary work they did, primarily on Han unification, dated from 1987.
However, "the Unicode project" had many beginnings -- many points where you
could mark a milestone in its early development. And the Unicode Consortium
celebrated a number of 10-year anniversaries, starting from 1998 and
continuing through last year.

A: [Joseph Becker]
Don't forget Mark Davis (then of Apple), who was more than highly
instrumental in getting the ball rolling!
And, don't forget my "Unicode '88" manifesto, which was the clear
intentional inception of Unicode as a specific initiative. I drafted it in
February 1988 after the enthusiastic reception of my Unicode proposal at
Uniforum, its final draft being August 1988. Since the Consortium has in
fact handed it out as marking the start of Unicode, I think its mention
might be clarified in our official history, which currently says:
"September 1988 ... Becker later presents paper on Unicode to ISO WG2."

A: [Nelson H.F. Beebe]
I remember reading this article more than 15 years ago, and being impressed
by the possibilities that it represented:
@String{j-SCI-AMER = "Scientific American"}
@Article{Becker:1984:MWP,
author = "Joseph D. Becker",
title = "Multilingual Word Processing",
journal = j-SCI-AMER,
volume = "251",
number = "1",
pages = "96--107",
month = jul,
year = "1984",
CODEN = "SCAMAC",
ISSN = "0036-8733",
bibdate = "Tue Feb 18 10:44:43 MST 1997",
bibsource = "Compendex database",
abstract = "The advantages of computerized typing and editing are now being
extended to all the living languages of the world. Even a complex script
such as Japanese or Arabic be processed.",
acknowledgement = ack-nhfb # " and " # ack-rc,
affiliationaddress = "Xerox Office Systems Div, Palo Alto, CA, USA",
classification = "723",
journalabr = "Sci Am",
keywords = "Character Sets; data processing; word processing",}
It was followed up by this more formal one:
@String{j-CACM = "Communications of the ACM"}
@Article{Becker:1987:AWP,
author = "Joseph D. Becker",
title = "{Arabic} word processing",
journal = j-CACM,
volume = "30",
number = "7",
pages = "600--610",
month = jul,
year = "1987",
CODEN = "CACMA2",
ISSN = "0001-0782",
bibdate = "Thu May 30 09:41:10 MDT 1996",
bibsource = "http://www.acm.org/pubs/toc/",
URL = "http://www.acm.org/pubs/toc/Abstracts/0001-0782/28570.html",
acknowledgement = ack-nhfb,
keywords = "algorithms; design; documentation; human factors; measurement",
review = "ACM CR 8902-0084",
subject = "{\bf H.4.1}: Information Systems, INFORMATION SYSTEMS
APPLICATIONS, Office Automation, Word processing. {\bf J.5}: Computer
Applications, ARTS AND HUMANITIES, Linguistics. {\bf I.7.1}: Computing
Methodologies, TEXT PROCESSING, Text Editing, Languages.",}
The latter is not in unicode.bib, but will soon be.

--- --- --- ---
Q: Is it true Han Unification was the core of Unicode, and the idea of an
universal encoding come afterwards?

A: [Kenneth Whistler]
The effort by Xerox and Apple to do a Han unification was key to the
motivation that eventually led to a serious effort to actually *do* Unicode
and then to establish the Unicode Consortium to standardize and promote it.
However, the idea of a universal encoding predated that considerably. In
some respects the Xerox Character Code Standard (XCCS) was a serious attempt
at providing a universal character encoding (although it did not include a
unified Han encoding, but only Japanese kanji). XCCS 2.0 (1980) contained,
in addition to Japanese kanji: Latin (with IPA), Hiragana, Bopomofo,
Katakana, Greek, Cyrillic, Runic, Gothic, Arabic, Hebrew, Georgian,
Armenian, Devanagari, Hangul jamo, and a wide variety of symbols. The early
Unicoders mined XCCS 2.0 heavily for the early drafts of Unicode 1.0, and
always regarded it as the prototype for a universal encoding.
Additionally, you have to consider that the beginning of the ISO
project for a Multi-octet Universal Character Set (10646) predated the
formal establishment of Unicode. Part of the impetus for the serious work
to standardize Unicode was, of course, discontent with the then architecture
of the early drafts of 10646.

--- --- --- ---
Q: Who and when invented the name "Unicode"?

A: [Kenneth Whistler]
This one has a definitive answer: Joe Becker coined the term, for "unique,
universal, and uniform character encoding", in 1987. First documented use
is in December, 1987.

A: [Nelson H. F. Beebe]
On the origin of the name Unicode, my bibliography at
<http://www.math.utah.edu/pub/tex/bib/index-table-u.html#unicode>,
<ftp://ftp.math.utah.edu/pub/tex/bib/unicode.*> has this to say:
Historical note: a library search on the name ``Unicode'' turns up
several entries that predate its use for an international computer
character set standard. These include:
- ``Unicode'': the universal telegraphic phrase-book, London (1889, 1896,
1901, 1910).
- Unicode: three-letter difference telegraphic code, Prague, Czechoslovakia
(1956, 1967).
- UNICODE automatic coding for UNIVAC scientific data automation system
1103A or 1105, Sperry Rand Corporation. Univac Division (1959).
- Atle Grahl-Madsen, UNICODE 72: - Two-letter, three-letter and numerical -
country codes, Bergen, Norway (1971, 1972).
- David L. Szekely, Unicode: ein Verfahren zur Optimierung der begrifflichen
Denkleistung: eine Einfuhrung in die ``vereinheitlichte Wissenschaft'',
Basel (1979).
- U. S. Dept. of Health and Human Services, ``Food protection unicode''
(1988).
- Unicode single transition time coding (1976).
- Unicode state assignment techniques (1996).
My suspicion is that someone remembered the UNIVAX UNICODE system
when the name was being selected for the ISO 10646 companion project, but
let's let the people involved respond. I'm only guessing.

--- --- --- ---
Q: When did the ISO 10646 project start?

A: [Kenneth Whistler]
Unfortunately, the document register for early WG2 documents doesn't have
dates for all the early documents, and I don't have all the early documents
to check. But...
The 4th meeting of WG2 was held in London in February, 1986. The
first three meetings were in Geneva, Turin, and London, respectively. That
puts the likely timeframe for the Geneva meeting, and the establishment of
WG2 by SC2 at about 1984. The *only* project for WG2 was 10646.
Some of the older oldtimers on the list may have more exact
information about the early WG2 work.

A: [Tim Greenwood]
A paper that I wrote ("International Character Sets - the 7/8 bit story")
for an April 1985 conference at Digital references a note from Masami
Hasegawa, the original editor of 10646. This note was dated 17 October
1984. Masami's paper "Towards Multi-Lingual Data Processing" for the same
conference has the paragraph
'In the plenary meeting of TC97/SC2 of ISO, which is a sub-committee
for information coding, it was decided that an International Standard is
needed for a two byte graphic character set. Thus a working group WG2,
two-octet graphic, was formed to write a draft proposal.'

--- --- --- ---
Q: When did Unicode and ISO 10646 merge?

A: [Kenneth Whistler]
It wasn't a single date that can be pointed to, like the signing of an
armistice. In some respects, Unicode and ISO 10646 are *still* merging, as
modifications and amendments to deal with niggling little architectural
edge cases are worked out.
        However the key dates were:
        January 3, 1991. Incorporation of the Unicode Consortium, which
signalled to SC2 that the Unicoders were serious in their intentions.
        May, 1991. Meeting #19 of WG2 in San Francisco. An ad hoc meeting
took place between WG2 members and some Unicoders, which paved the way for
the later "merger" of the standards.
        June, 1991. The 10646 DIS 1 was defeated in its ballotting. This
left the only reasonable way forward an architectural compromise with the
Unicode Standard, which at that point was in copy edit and about to go to
press.
        June 3, 1991. The date of "10646M proposal draft to merge Unicode
and 10646", by Ed Hart. This was a key document in the resulting merger of
features.
        August, 1991. The Geneva WG2 meeting accepted Han unification,
combining marks, dropped byte-by-byte restrictions on code values for
UCS-2, and accepted Unicode repertoire additions. From that point forward,
the overall aspect of what became ISO/IEC 10646-1:1993 was clear.

A: [Otto Stolz]
The merger was initiated by an informal meeting of Unicode, and WG2 members,
during the JTC1/SC2/WG2 meeting in San Francisco, California, USA, in May
1991. At that time, ISO DIS 10646 (the 1st one) was still in ballot, so no
formal discussion, let alone an agreement, was allowed by JTC1's rules.
        By mid-July, DIS 10646 was formally voted down (P-members: 8 YES, 11
NO, 2 abstained; O-members: 1 YES, 3 NO, 0 abstained). 9 out of 14 NO votes
mentioned the merger ("only one universal code"), in their national
comments.
        The merger, and the basic architecture, were agreed on, at the
ISO-IEC JTC1/Sc2/WG2 meeting in Geneva, Switzerland, August 19th through
23rd, 1991
        In Octobre 1991, ISO SC2 plenary (in Rennes, France) unanimously
authorized WG2 to issue a new DIS 10646 in January 1992 for a 4-month (i.
e. shortened) vote.

A: [Tim Greenwood]
See
<http://groups.google.com/groups?q=hasegawa+ISO+10646&hl=en&selm=10635%40sun
103.crosfield.co.uk&rnu =2> for a report on the first (or one of the first)
merger meetings.

--- --- --- ---
Q: What is the name of the GB and JIS standards that have the same
repertoire as Unicode?

A: [Kenneth Whistler]
GB 13000 has the same repertoire as ISO/IEC 10646-1:1993. JIS X 0221 has the
same repertoire as ISO/IEC 10646-1:1993.
Those two were effectively national publications of 10646. You can
work out the correlations with Unicode from that.
GB 18030:2000 in principle has the same repertoire (but different
encoding) as ISO/IEC 10646-1:2000, i.e. the same as Unicode 3.0. (But there
were small problems in it.) However, the 4-byte form of GB 18030 maps all
Unicode code points, assigned or not, so it will (in theory, at least)
always have the same repertoire as Unicode.

--- --- --- ---
Q: When did Unicode stop to be "16 bits"? (I.e., when were surrogates
added?)

A: [Kenneth Whistler]
In terms of publication, with Unicode 2.0 in 1996. However, the decision was
taken by the UTC considerably before publication.
Amendment 1 to 10646-1 (UTF-16) was proposed to WG2 in WG2 N970,
dated 7 February 1994. Mark Davis was the project editor for that
amendment.

--- --- --- ---
Q: I can't remember the version when some scripts were added: Syriac,
Thaana, Sinhala, Tibetan, Myanmar, Ethiopic, Cherokee, Canadian Syllabics,
Ogham, Runes, Khmer, Mongolian, Yi, Etruscan, Gothic, Deseret, CJK ext. A,
CJK ext. B.

A: [Rick McGowan]
Tibetan was in 1.0, but was REMOVED in the ISO merger of Unicode 1.1, and
came back in a different form in Unicode 2.0. For the rest, you should go
to the Enumerated Versions page of the web site!.

A: [Kenneth Whistler]
See pp. 968-969 of TUS 3.0.
        Tibetan was in Unicode 1.0, then was removed. It was readded, in a
new encoding, in Unicode 2.0.
        Syriac, Thaana, Sinhala, Myanmar, Ethiopic, Cherokee, Canadian
Syllabics, Ogham, Runic, Khmer, Mongolian, Yi, CJK Extension A were added
in Unicode 3.0.
        Old Italic (including Etruscan), Gothic, Deseret, and CJK Extension
B were added in Unicode 3.1.

A: [Mark Davis]
For when particular characters were added to Unicode, you can also consult
the new DerivedAge.txt, currently in the BETA at:
<http://www.unicode.org/Public/BETA/Unicode3.2/DerivedAge-3.2.0d2.txt>.

--- --- --- ---
Q: Roughly, how many ideographs are in modern use in extensions A and B?

A: [Kenneth Whistler]
Not many. I'll refer to the IRG experts to make a guess there.

A: [Thomas Chan]
Recently you asked about estimates of usage of Plane 2 characters--since a
large percentage are CNS 11643-1992 characters (and perhaps the oldest IT
source), that may provide a clue. In the "Concluding Remarks" section of
Christian Wittern's "Taming the Masses"[1], the higher CNS planes (ignore 1
and 2, which are in the BMP, and perhaps some parts of 3) are rarely used in
historic texts, and he expects even lower usage in modern texts.
[1] <http://www.gwdg.de/~cwitter/cw/taming.html>.

--- --- --- ---
Q: Roughly, when will version 3.2 become official?

A: [Kenneth Whistler]
March, 2002.

--- --- --- ---
Q: Roughly, when will the version 4 book be published?

A: [Kenneth Whistler]
Currently still scheduled for March, 2003, but schedule slip is always a
possibility on a major publication project like this.

--- --- --- ---
Q: When was ASCII first published and by whom?

A: [Kenneth Whistler]
1967. By ANSI X3.4.
Actually, that was preceded by ASCII per se, the earliest form of
which was published as a standard in 1963 by ASA (American Standards
Association -- the predecessor to ANSI). But the 1963 version of ASCII had
some differences from what we now know as ASCII.

A: [Nelson H. F. Beebe]
        That was about 1964 (a few months AFTER IBM System/360 was
announced: that delay is reason we suffered the EBCDIC/ASCII mess for over
30 years).
        The best source for such early information is this book:
@String{pub-AW = "Ad{\-d}i{\-s}on-Wes{\-l}ey"}
@String{pub-AW:adr = "Reading, MA, USA"}
@Book{Mackenzie:CCS80,
author = "Charles E. Mackenzie",
title = "Coded Character Sets: History and Development",
publisher = pub-AW,
address = pub-AW:adr,
pages = "xxi + 513",
year = "1980",
ISBN = "0-201-14460-3",
LCCN = "QA268 .M27 1980",
bibdate = "Wed Dec 15 10:38:43 1993",
price = "US\$24.95",
series = "The Systems Programming Series",}
        I checked my copy, and found references on pp. 423ff to ASCII-63
(``When ASCII became an approved American standard in 1963, it was not
complete.''), and to ASCII-65, ASCII-67, and USASCII-8 (1964).

A: [John G. Otto]
ANSI 1960s (I'm thinking 1964).

A: [Otto Stolz]
some of your questions probalbly are answered in Roman Czyborra's WWW pages,
particularly in
<http://czyborra.com/unicode/standard.html>,
<http://czyborra.com/charsets/iso646.html>,
<http://czyborra.com/charsets/iso8859.html>,
<http://czyborra.com/charsets/cjk.html>,
<http://czyborra.com/charsets/codepages.html>.

--- --- --- ---
Q: What standard was current before ASCII? (BAUDOT, is it?) How many bits
did it use?

A: [Doug Ewell]
Before ASCII there was a wide variety of different encoding standards. Many
were designed on the basis of punched card codes or the character
repertoires on printer chains, and many did not seem to be "designed" at
all, but just thrown together. What was great about ASCII was that it was
the first encoding to be anywhere near "universal," even within the United
States.
        You may head that EBCDIC predated ASCII, but that is only partially
true. ASCII, being designed as a national standard from the outset, went
through years of ballotting and committee haggling. EBCDIC, designed by
IBM, went into production without comparatively little delay. That accounts
for earlier widespread usage of EBCDIC, but in fact the two were developed
concurrently.
        Some of the more popular character encodings that existed before
ASCII were FIELDATA, PTTC, and BCDIC (the 6-bit predecessor to EBCDIC).
        The links to Roman Czyborra's and Dik Winter's Web sites are
valuable. Follow them, if you have not already done so. Also, there is a
book, "Coded Character Sets, History and Development," by Charles E.
Mackenzie, that goes into extraordinary detail about these early character
sets. The book is from 1980 and is out of print; I had to pay Amazon USD 66
for a slightly used copy. But you may find it in a technical library.
        Another good source for information on early character sets is Frank
da Cruz, "Mr. Kermit."

A: [Frank da Cruz]
I don't have a lot to add to what's been said other than what's already
published in character sets chapter the C-Kermit book:
<http://www.columbia.edu/kermit/ck60manual.html>, in which the main items
of interest are some early Russian sets like DKOI and KOI-7, and the
original KOI-8, that we learned about when we visited the USSR in 1989.
An extremely detailed and thorough history of ASCII and EBCDIC can
be found in the Mackenzie book. For the full reference see number 52 in the
References section of: <http://www.columbia.edu/acis/history/>.

A: [John G. Otto]
The Baudot technique was to, rather than have separate code points for
lower-case & capitals, to have a shift/un-shift character. The last I saw
it used was on S-100 based micro-computers that generally used 8-bit ASCII,
but were driving old Teletype machines.

A: [Murray Sargent]
Btw, I didn't see anyone comment on BCD, which preceeded EBCDIC and ASCII
and was the first encoding that I used back in 1962 on the IBM 709. It was
different from the old Hollerith stuff, but I don't remember the details
and I couldn't find it documented on the Internet. If you find out the
encoding, I'd like to see it for old times' sake. Alternatively I could ask
some of my old pals if they still have documentation kicking around
somewhere.

A: [Alistair Vining]
I just found: <http://www.cwi.nl/~dik/english/codes/stand.html>, whose
author (Dik Winter) notes that he 'stop[s] approximately where Roman
Czyborra starts'. Thai EBCDIC, JISCII, 6-bit ISO codes, ASCII-1963 etc.
Looks very thorough to me, but I wasn't there...

--- --- --- ---
Q: Did the ASCII standard expire, and when?

A: [Rick McGowan]
It has not "expired". It is balloted for maintenance every 5 years, and
continues to be re-affirmed.

A: [Kenneth Whistler]
No, it is still a standard.

A: [John G. Otto]
Not that I can tell.

--- --- --- ---
Q: When was ISO 646 published?

A: [Kenneth Whistler]
1972.

A: [Nelson H. F. Beebe]
From the bibliography of this book
@String{pub-DP = "Digital Press"}
@String{pub-DP:adr = "12 Crosby Drive, Bedford, MA 01730, USA"}
@Book{daCruz:1997:UCK,
author = "Frank {da Cruz} and Christine M. Gianone",
title = "Using {C-Kermit}",
publisher = pub-DP,
address = pub-DP:adr,
edition = "Second",
pages = "xxii + 662",
year = "1997",
ISBN = "1-55558-164-1",
LCCN = "TK5105.9.D33 1997",
bibdate = "Thu Jan 13 14:33:16 2000",}
ISO 8859 is dated 1987--1995, and ISO 646 is dated 1983. I have an
entry for the latter, but not the former, in my bibliography at
<http://www.math.utah.edu/pub/tex/bib/index-table-i.html#isostd>,
<ftp://ftp.math.utah.edu/pub/tex/bib/isostd.bib>. It reads:
@Book{ISO:1983:ISB,
author = "{International Organization for Standardization}",
title = "{ISO Standard 646}, 7-Bit Coded Character Set for
Information Processing Interchange",
publisher = pub-ISO,
address = pub-ISO:adr,
edition = "Second",
year = "1983",
ISBN = "????",
LCCN = "????",
bibdate = "Mon Feb 05 17:48:01 2001",
note = "Also available as ECMA-6.",
URL = "http://www.iso.ch/cate/d4777.html",
acknowledgement = ack-nhfb,}
The preamble to the bibliography file gives Web address for ISO,
which should suffice to track down the exact references; I'll certainly
have to do that to fill the 8859 hole!

--- --- --- ---
Q: I think that ISO 646 expired. When?

A: [Kenneth Whistler]
No, it is still a standard. The current version is the ISO-646-IRV, revised
in 1991.

--- --- --- ---
Q: When was ISO 8859 published?

A: [Kenneth Whistler]
It comes in many parts, each of which has a separate publication date.

A: [Tim Greenwood]
The above paper {"International Character Sets - the 7/8 bit story"} has it
that the ECMA standard was approved in December 1984 and that ISO and ANSI
were approving it as the paper was written in early 1985.

--- --- --- ---
Q: When did the first double-byte encoding appear?

A: [John G. Otto]
At least the late 1960s. Control Data used a 6-bit byte on their 60-bit word
machines designed by Cray, but to get lower-case & a few more special
characters, they would use 6/12 in which a couple characters (caret ^ and
at @, IIRC, were borrowed to act to modify the next 6-bits to be
lower-case, etc.).

--- --- --- ---
Q: Are OpenType fonts currently implemented in any platform other than
Windows?

A: [John H. Jenkins]
OpenType fonts work without modification on Mac OS X, in that the glyphs can
be displayed. Any Mac application can access the OT data in the font, parse
it, and process it appropriately using public functions. The one piece
still missing is automatic support for OT layout data in the system.

A: [Eric Muller]
FreeType implements OpenType, including layout. By construction, FreeType
only requires an ANSI C implementation, and was written with embedded
systems in mind. Thus, the answer to your question could be "all".

A: [John Hudson]
'OpenType support' means a number of different things.
        Support for the font file format and rasterisation of the TT or CFF
outlines is widespread, including Windows, OSX (native), earlier Mac
systems (CFF only, using ATM), and implementations of FreeType.
        Support for individual OpenType Layout typographic features varies
from application to application.
        Support for script shaping features and character-level
pre-formatting, e.g. for Indic scripts, is supported in Windows apps that
use Uniscribe for text processing, and I believe the FreeType developers
have also been working on Indic shaping although I am not sure if this has
been released yet.

A: [Alan Wood]
Yes. Apple supplies 4 Japanese OpenType fonts with Mac OS X - Hiragino Kaku
Gothic Pro, Hiragino Kaku Gothic Std, Hiragino Maru Gothic Pro and Hiragino
Mincho Pro.
Adobe supplies TektonPro with InDesign 1.5 for Mac OS 9.

--- --- --- ---

Previous message: Winkler, Arnold F: "FW: Bar codes using unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Wed Feb 06 2002 - 10:08:14 EST