The ISO / Unicode Merger
Ed Hart Memo
Ed Hart memo
Accredited Standards Committee Doc. No.: X3L2/91-75
X3, Information Processing Systems Date: 31 May, 1991
X3L2, Codes and Character Sets Project: 396-D
Ref. Doc:
Reply to: Edwin Hart
To: X3L2 Applied Physics Laboratory
McClure Center 3-140
Johns Hopkins Road
Laurel, MD 20723
USA
Voice: +1 301 953-6926
FAX: +1 301 953-1093
Subject: Personal Contribution, Report on Informal Meeting to Discuss
10646-Unicode Merger
Action Requested: Information
Recently, we held an informal discussion between proponents of ISO-IEC
DIS 10646 (from JTC1/SC2/WG2) and Unicode (from the Unicode Consortium)
for the purpose of exploring the possibility of merging the two
incompatible codes into one code. We achieved a breakthrough because the
diverse group achieved consensus on several issues that divided 10646 and
Unicode. Although several issues remain to be resolved, and our proposal
needs to be accepted by the formal organizations involved, it is
appropriate to share this good news with you. We ask for your support
of this effort.
_________________________
Many information users and developers are concerned about the real
possibility that we will need to support two incompatible multi-octet codes, ISO
10646 and Unicode. Some may say that Unicode is not an international standard
and therefore deserves neither support nor recognition. However, we live in an
imperfect world where regardless of whether Unicode is an international standard
or not, many of us will have to choose to support it unless we do something
soon. For the reasons stated in the enclosed document, I believe that the world
is too small to have two incompatible multi-octet codes with the same goal. I
also believe that both DIS 10646 and Unicode complement each other and have
features valuable to a multi-octet code. Therefore, an international standard
that merges the best features of DIS 10646 and Unicode makes good sense to me,
and I hope to you also. That is my goal.
In May, the 10646 Working Group, JTC1/SC2/WG2, met in San Francisco,
California, USA. This appeared to be the perfect time and place to hold a
discussion between the 10646 proponents and the Unicode proponents. The results
of such discussions could be extremely useful in resolving issues if the DIS
10646 should fail to obtain a majority of the ballots. Although we wanted to
hold these discussions at the WG2 meeting, JTC1 rules prevented discussing any
changes to DIS 10646 while it was out for ballot. Accordingly, we did not
discuss any changes to DIS 10646 at the WG2 meeting. Rather, after the meeting
ended, we met informally to discuss merging the two codes into one.
I believe that we achieved a breakthrough because we achieved consensus on
several issues that divided 10646 and Unicode. This was particularly
encouraging because the participants presented a diverse industry cross-section.
We came from eight countries, over a dozen (12) different enterprises, included
both product developers and users, and represented both the 10646 and
Unicode codes. If it was a breakthrough that we had the discussions, it was a miracle
to get consensus among such a diverse group. The initial results are enclosed
for you to read and reach your own conclusions.
While encouraging as a first step, the proposal needs additional work. When
the proposal to merge DIS 10646 and Unicode is completed, I will submit it to
JTC1/SC2/WG2 for consideration. Meanwhile, I am making the draft available for
your consideration, your comments, and if you think it appropriate, your
support.
Thank you for your consideration.
_______________________________________________________________________________
Document: 10646M/91-01
Date: 30 May, 1991
Subject: Summary of Results of Informal Meeting to Discuss
Merging of DIS 10646 and Unicode into One Code
From: Edwin Hart, Moderator 10646M (Merger) Ad Hoc Group
Reply to: Edwin Hart
Johns Hopkins University
Applied Physics Laboratory
11100 Johns Hopkins Road
Laurel, MD 20723-6099
Electronic Mail: HART@APLVM.BITNET or
HART@APLVM.JHUAPL.EDU
Voice: +1 (301) 953-6926
Facsimile: +1 (301) 953-1093
This document represents the first draft of what we hope
will become a proposal to merge DIS 10646 and Unicode into
one code. The primary advantage of this proposal is that
it is built on consensus of people supporting ISO 10646 and
others supporting Unicode. We plan to submit a final
consensus document to WG2 for consideration at the WG2
editing meeting planned for August, 1991 in Geneva,
Switzerland. At that time, we plan to work within WG2 to
refine the 10646 standard.
Summary
We affirm our strong support of the effort by ISO-IEC
JTC1/SC2/WG2 to develop 10646. We believe that ISO with its open
and responsive procedures will give careful consideration to our
proposal to refine the DIS 10646. In addition, we believe that
the Unicode Consortium has provided valuable insight and
technical solutions to newer requirements. We also believe that
having a single international standard that incorporates the best
features of DIS 10646 and Unicode as outlined in this proposal is
far superior to having two incompatible standards with same goal.
Therefore, after the completion of the May, 1991 ISO-IEC
JTC1/SC2/WG2 meeting in San Francisco, California in the USA, the
delegates attended an informal meeting. At the meeting, we
discussed requirements to merge ISO-IEC DIS 10646 and Unicode.
The people attending the informal meeting included some who
favored the ISO 10646 code and others who favored Unicode. We
believed that achieving consensus among these people would lead
to a merger proposal more likely to be supported by ISO-IEC
JTC1/SC2 and the Unicode Consortium.
In view of the diverse views represented at the meeting, the
results are surprisingly positive. We succeeded in reaching a
consensus on major design issues that had previously separated
the DIS 10646 and Unicode codes and made them incompatible. We
believe that this proposal paves the way for a merger of the best
features of DIS 10646 and Unicode into one multi-octet code
standard. Yet, this is merely a first step; further work and
consensus are required to produce a final proposal. In summary,
although ISO and the Unicode Consortium have not yet endorsed
this proposal, it is promising because it was the result of a
consensus of many people who represented both the ISO 10646 and
Unicode Consortium efforts.
However, our work would have been almost impossible had it not
been preceded by the excellent proposals submitted to WG2 by
ECMA, Canada and China. To form our consensus, we used these
proposals and new information on the Chinese, Japanese and Korean
Joint Research Group (CJK-JRG) announced at the WG2 meeting in
San Francisco.
We believe this new proposal is very promising and those
attending agreed to work to build support for it within their
respective companies, and national and industry standard bodies,
including ECMA and the Unicode Consortium.
General Objectives
We adopted the following objectives for the group:
1. Create a proposal to merge the best features of DIS 10646
and Unicode such that the proposal is acceptable to both
ISO and the Unicode Consortium.
2. Increase cooperation between ISO-IEC JTC1/SC2 and the
Unicode Consortium.
3. Define action items and the timing to complete them.
Participants
Except for Mr. Jenkins, the following people participated in
the Wednesday afternoon discussions:
Jerry Andersen IBM, USA
Lloyd Anderson Ecological Linguistics, USA
Joseph Becker Xerox, USA
F. Avery Bishop Digital, USA
Willy Bohn University of Hanover, Germany
Mark Davis Apple, USA
Asmus Freytag Microsoft, USA
Joachim Friemelt Siemens, Germany
Edwin Hart SHARE Inc./Johns Hopkins University, USA
Masami Hasegawa Digital Japan
Huang, Weimin CESI, China
Olle Jarnefors Royal Institute of Technology, Sweden
John Jenkins Apple, USA
Bo Jensen IBM Denmark
Mike Ksar HP, USA
Takayuki Sato HP Japan
Isai Scheinberg IBM Canada
Karen Smith-Yoshimura The Research Libraries Group, USA
Michel Suignard Microsoft, France
J. G. Van Stee IBM, USA
Kenneth Whistler Metaphor, USA
Zhang, Zhoucai CCID, China
On Thursday, Mr. Jenkins joined the group but Mr. Stee and
Mr. Whistler were absent. In addition, Mr. Jenkins left before
voting, and Mr. Hasegawa, Mr. Ksar, and Mr. Bohn were unable to
stay for all the votes.
On Friday, except for Mr. Friemelt (who had to leave before
we concluded the meeting), the following participated in the
voting: Mr. Anderson, Mr. Bishop, Mr. Bohn, Mr. Freytag, Mr.
Friemelt, Mr. Hart, Mr. Hasegawa, Mr. Jenkins, Mr. Sato, Mr.
Scheinberg, and Mr. Suignard.
Advantages of Having Only One Multi-Octet Code Standard
Here is a list of advantages to having one global multi-octet
code standard:
1. Why should we be concerned about two standards?
a. Inevitable requirement to support both
i. 10646 because it is an international standard
ii. Unicode for compatibility with Unicode-based
products
b. Cost of supporting both
i. The cost to do both is probably very large
ii. Must consider the costs to convert between the two
c. Erosion of %single code standard% mind-set
i. If two, why not three? four? ten?
d. Diminishes advantages of either alone without the
other
i. Single code standard solves many problems that
would not be solved if we have two or more of them
ii. May introduce the requirement to switch between
the two
2. The importance of de-jure standards
a. Increasingly used as procurement requirements
i. Gives customer more options for interconnection of
products from different vendors
b. Integral part of vast, interlocking family of
standards, each assuming the others
c. Better acceptance, because every country can
participate
i. Not perceived as dominated by U.S.
3. Problems of code conversion
a. Must identify both the source and the target code, but
often no way to do this
b. Conversion is application/subsystem dependent, and it
often cannot be confined to one place (that is, it is
much more expensive)
c. Solving same problem in several places introduces
probability of getting some solutions out of
synchronization with others
d. An uncontrollable, moving target (that is, you never
own more than one of the two codes, you cannot control
repertoires, etc.)
e. Complicated by repertoire differences
f. No %right% way to manage the differences
i. Mismatch can range from minor irritation to
catastrophe
g. Further complicated by differences in character
semantics
i. No tested solution is known
ii. At best, makes translation even more difficult
4. The Costs of code conversion
a. Monetary cost of developing, testing, maintaining,
etc.
b. Diversion of human and other resources by developers
c. Performance and memory penalties (extra overhead)
d. Errors and other problems are inevitable
e. Customer dissatisfaction
f. Customer conversion requirements will divert resources
for creating local solutions
g. Forces tradeoffs between satisfying installed base and
meeting new market requirements
5. Other advantages
a. One reference source for the code
Areas of Consensus
1. Remove the %C0%and %C1% restrictions.
We support the ECMA proposal, point 1, %To remove the
restriction on the so-called C1 space.% This point is also
included in the Canadian proposal, and other national body
positions on DIS 10646 including the ones from China and the
US.
Vote Thursday: 17 for/ 0 against/ 2 abstain (Davis,
Freytag)
In addition, pending a careful review by computer
communication, systems, and applications experts, from ISO,
ECMA, CCITT, and within our enterprises, we believe it
desirable to allow encoding graphic characters in the %C0%
space presently reserved in DIS 10646. This refines point 2
from the Canadian proposal. Annex ____ provides more
details on this refinement (the %Bohn% refinement, named for
Willy Bohn, who proposed it) of the ECMA proposal.
Vote Thursday: 16 for/ 0 against/ 3 abstain (Bishop,
Hasegawa, Sato)
Removing the %C0% restriction in addition to removing the
%C1% restriction will provide for flexibility by allowing
the encoding of more characters in the base multilingual
plane that is the most important 2-octet plane for
interchange and interworking. A consequence of removing the
%C0% restriction is that 10646 must change the way 1-octet
control characters are encoded by placing the 1-octet
control character into the least significant octet of the
current compaction method and padding the most significant
octets to the width of the current compaction method. In
addition, the 1-octet compaction method must be adjusted to
ensure that the control characters are correctly handled.
2. Create an International Repertoire of Unified Chinese,
Japanese, and Korean Ideographs and Encode This Set of
Ideographs into the Base Multilingual Plane.
We propose a refinement to point 5 of the Canadian proposal.
We believe that coding an international repertoire of
unified Chinese, Japanese, and Korean ideographs in the base
multilingual plane is mandatory for international
interworking and processing efficiency. The encoding of the
international C/J/K repertoire must be completed by the end
of 1991. We propose to use the CJK-JRG results if it is
available in 1991; otherwise we propose to use the best
information available at that time.
Vote Thursday: 17 for/ 0 against/ 1 abstain (Ksar), 1
absent (Hasegawa)
Recent statements by the Japanese delegates to WG2 showed
their strong support for the CJK-JRG. From this
information, the group concluded that the unification of
Chinese, Japanese, and Korean ideographs so highly desired
by the international community is feasible. Providing that
WG2 continues to recognize the stated Japanese requirement
to encode its characters in its own 10646 plane, Japan
recognized the need for an international repertoire of
Chinese, Japanese, and Korean ideographs. A meeting of the
CJK-JRG has been called (Tokyo, July, 1991) to start
creating an international repertoire and ordering.
3. Allow the Option to Use Non-Spacing Marks.
Pending careful review by ISO TC46 and CCITT, we propose to
refine point iv) 2) of the ECMA proposal for floating
diacritical marks as follows: The third Code Extension
Level should specify:
a. In addition to diacritics, non-spacing marks should
include stress marks, tone marks, and those used for
text processing operations such as underlining or
mathematical notation for the name of a vector.
b. Non-spacing marks should follow the base character for
consistency.
c. Imaging and the order of multiple non-spacing
diacritics should follow well-defined rules. (See
Annex ____.)
d. To allow for compliance with future versions of 10646
that may encode additional pre-composed characters,
allow both encoding a character as a pre-composed
character or as a base character with one or more non-
spacing marks. (That is, delete the ECMA statement
%if the accented letter is already coded as a single
character, the alternative representation by means of
floating diacritical marks is not allowed.%) This
assumes that future revisions of 10646 will take
certain characters that used floating marks in the
current version of 10646 and encode them as pre-
composed characters.
e. All sequences of codes should be allowed because of
the difficulty of enforcing a legislation against
certain sequences of code positions.
Vote Thursday: 16 for/ 0 against/ 1 abstain (Sato)/ absent
(Bohn, Hasegawa, Ksar)
4. Define the merger (10646M) of DIS 10646 and Unicode as a 4-
octet code.
Vote Thursday: 16 for/ 0 against/ 0 abstain/ absent
(Hasegawa, Ksar, Bohn)
We support the 4-octet definition of the merger of DIS 10646
and Unicode. Using 4-octets allows the flexibility needed
to expand the code repertoire to meet all foreseeable
requirements.
5. Location of Space for Presentation Forms
We would support a drastic reduction or elimination of the
presentation forms in the base multilingual plane while
retaining codes necessary to transcode existing standards in
plain text. People were concerned that DIS 10646 reserved
too much unused code space in the base multilingual plane.
A final determination of the presentation codes will be made
in consultation with Arabic and other experts.
Vote Thursday: 15 for/ 0 against/ 1 abstain (Becker)
6. Combine the Repertoires of DIS 10646 and Unicode into the
Merged Code.
We propose that the repertoire of the base multilingual
plane of the merged code, 10646M, be derived from a superset
composed of the union of the repertoires of DIS 10646 and
Unicode; for example, the superset should include pre-
composed Latin, Greek, Hangul, Vietnamese, and additional
symbols.
Vote Friday: 10 for/ 0 against/ 0 abstain
7. Simplify the Compaction Methods.
We are concerned about the complexity of the DIS 10646
compaction forms. For simplicity, we propose that there be
several parts to the standard:
Part 1: General introduction, terminology, etc.
Part 2: The base multilingual plane (BMP). This part of
the standard will specify the 2-octet
implementation of the BMP. Other parts are not
required for conforming implementations of the
BMP. This part may be implemented without
announcers.
Part 3: The full four-octet canonical form.
Part 4: Mechanisms for other compaction methods to be
determined.
In the absence of other introducers for 10646 data, Part 2
should be assumed.
Vote Friday: 10 for/ 0 against/ 0 abstain
8. Make Annex H Part of the 10646 Conformance statement.
We recommend moving Annex H of DIS 10646 into the main body
of the standard and making it a requirement for conformance.
Vote Friday: 9 for/ 0 against/ 0 abstain/ 1 absent (Bohn)
Due to time limitations we were unable to discuss and make
recommendations to resolve the following differences between DIS
10646 and Unicode.
9. Coding of Semantics versus Shape.
For example, parenthesis, brackets and braces are coded as
open/close in Unicode, and as left/right in DIS 10646.
10. Using Any Multi-Octet Coded-Character-Set Will Require
Program Changes.
The following two examples show that neither DIS 10646 nor
Unicode may be blindly used with the C programming language.
a. C Language Wide-Character (wchar_t) Model
Padding ISO 8859/1 characters with the decimal 032 value
precludes the direct use (without conversion) of 10646
compaction forms 2-4 as the wchar_t data type in the C
programming language. This is point 3 in the Canadian
position statement.
b. NULL Characters in the C Language
Unicode may use 000 as the first or second octet of the 2-
octet code. The C language uses the NULL (000) octet as a
character string terminator for 1-octet character data.
Therefore, C programs must be rewritten to use Unicode.
11. Other Issues
The above list of differences between Unicode and DIS 10646
is not exhaustive. Other lower priority issues also need to
be considered.
Action Items to Promote the Agreement
1. Participants will lobby for this proposal with their country
and company constituencies. (All, immediately)
2. Ask the Unicode Consortium member companies to place a
discussion of this document on the agenda of the next
Unicode Consortium meeting on June 7. The Unicode
Consortium should formally state that it agrees or disagrees
with the general direction and state any of its concerns
with specific points. (Whistler)
3. Form a joint editing committee to help draft the final 10646
merged standard. (Freytag provides updated code tables,
Hasegawa provides updated structure and text, 15 Aug. list
the areas of the DIS 10646 document that would require
changes)
4. For closer cooperation between ISO and the Unicode
Consortium, we encourage the Unicode Consortium to pursue
becoming a liaison member of JTC1/SC2, and for JTC1/SC2 to
accept the Consortium as a liaison member. (Unicode
Consortium, Aug., 1991)
5. Send this report to the national bodies and ask them to
consider our consensus agreement in their votes on ISO-IEC
DIS 10646. (Hart, 29 May)
6. Provide a list of the advantages of having one multi-octet
code rather than two. (Andersen, done)
7. (Point 1) Coordinate an investigation of the impact of
coding in C0. (Scheinberg, 15 Aug.)
8. (Point 2) Using formal minutes and other information,
summarize the Tokyo CJK-JRG meeting. (Collins, 31 July)
9. (Point 3) Provide the Annex describing the rules to be used
with multiple non-spacing marks. (Whistler, 9 June)
10. (Point 3) Coordinate review by ISO TC46 and CCITT of
proposed use of non-spacing marks. (Smith-Yoshimura (TC46)
and Friemelt (CCITT), Aug. 15)
11. (Point 5) Coordinate a review of the need to reserve so
large an area for presentation forms for Arabic and other
scripts on the base multilingual plane. (Ksar and Friemelt,
15 Aug.).
12. (Point 6) Investigate need for composed characters from
Cyrillic and Polytonic Greek. (Why did WG2 include them in
the DIS?) (Whistler, 15 Aug.)
13. (Point 7) Coordinate an investigation of which compaction
methods to propose in %Part 4%. (Jarnefors, 15 Aug.)
14. Create 10646M electronic distribution list. Send electronic
mail message to Hart [** see below] to subscribe. (Hart, done)
(End of Document)