Date: Wed, 15 Sep 93 13:06:12 +0200
From: Olle Jarnefors <ojarnef@admin.kth.se>
To: Glenn Adams <glenn@metis.com>, Edwin Hart <hart@aplvm.bitnet>,
        Sten G Lindberg <stenl@stovm1.vnet.ibm.com>,
        keld@dkuug.dk (Keld J|rn Simonsen)
Subject: A short technical overview of ISO 10646/Unicode
 
Dear universal coders,
 
I have put together a short (less than six pages) overview of
UCS and Unicode from a technical viewpoint, which I plan to
distribute freely on Internet. I would very much appreciate any
suggestions for improvement or other comments on this text from
you, which I can incorporate in the final version. Here are some
questions that especially bother me:
 
-- Is the text short enough to have a chance to be read by
   interested people?
 
-- Are important aspects of UCS or Unicode missing?
 
-- Are some aspects treated in too great detail?
 
-- Can some of the explanations be made easier to understand
   (without becoming too long)?
 
-- Bad English?
 
The current version of the text follows. There is also a Swedish
version, not included here. You can fetch the latest versions by
anonymous FTP to   othello.admin.kth.se
in directory       pub/misc/ucs
 These files are of interest:
 
unicode-iso10646-oview.txta       latest English version
unicode-iso10646-oview-diff.txta  " with changes from previous version marked
unicode-iso10646-or.txt1          latest Swedish version (in Latin-1)
unicode-iso10646-or-diff.txt1     " with changes from previous version marked
unicode-iso10646-or.txts          latest Swedish version (in Swedish 646)
unicode-iso10646-or-diff.txts     " with changes from previous version marked
 
Best regards,
 
/Olle
 
-----
 
(unicode-iso10646-oview.txta Ap4 930914 OJ)
 
 
 
 
A SHORT OVERVIEW OF ISO 10646 AND UNICODE
 
 
By: Olle Jarnefors          Royal Institute of Technology, Sweden
    <ojarnef@admin.kth.se>                   Fax:  +46 8 10 25 10
 
 
The purpose of this text is to give a brief technical overview
of the new character set standard ISO 10646. I have omitted
descriptions of the history of the standard as well as general
talk about why a standard of this type is badly needed.
 
Previous knowledge: The reader should have some knowledge about
coded character sets, have seen an ASCII table, and know of some
8-bit character sets, like Latin-1 (ISO 8859-1).
 
 
1. Most important facts
-----------------------
 
ISO 10646 is a new character set standard that was published in
1993 by the International Organization for Standardization
(ISO). Its name is "Universal Multiple-Octet Coded Character
Set", *UCS*, and it's the first coded character set with the
ambition to eventually include all characters used in all the
written languages in the world (and, in addition, all
mathematical and other symbols). The current first edition at
least covers all major languages and all commercially important
languages. To make it possible to represent every character with
a unique bit sequence in UCS this representation must consist of
at least two octets or 16 bits.
 
UCS is intended to be used both for internal data representation
in computer systems and for data communication. New operating
systems and computers using UCS has already been released by
Microsoft (Windows NT) and Apple (Newton). The standard has,
however, been subject to strong criticism from both data
communication experts (Internet) and from some gropus in Japan.
 
ISO 10646 is a fundamental standard affecting almost all parts
of information technology, but UCS is only a coded character
set, not a complete system for text representation. For several
important aspects of text, as it's treated in modern text
processing programs, UCS needs to be supplemented by further
standards or rules. Some examples of this are italic text,
tables, mathematical formulas, fonts, document structure. (That
simple kind of text that can be represented by UCS or other
character set standards alone -- a linear sequence of
characters, with a fixed division into lines and pages -- is
called *plain text*.)
 
*Unicode* is a coded character set specified by a consortium of
major American computer manufacturers. The latest version, 1.1,
is identical to the 2-octet form of UCS (on implementation
level 3).
 
 
2. The structure of the coding space
------------------------------------
 
In the first version of UCS 34 203 different characters are
included. Of these 21 204 are ideographic characters used in
Chinese, Japanese and Korean. To guarantee the coding space will
not be filled even in the future -- 2 octets give 65 536
different character codes -- an *4-octet form* of UCS (UCS-4) is
also definied. Still only the *2-octet form* (UCS-2) is used in
practice.
 
The 65 536 positions in the 2-octet form of UCS are divided
into 256 *rows* with 256 *cells* in each. The first octet of a
character representation gives the row number, the second the
cell number. The first row, row 0, contains exactly the same
characters as ISO 8859-1 and the first 128 characters are thus
the ASCII characters. If you have the octet representing a ISO
8859-1 character, you will get its UCS representation by
putting a octet with value 0 in front of it. UCS includes the
same control characters as ISO 8859 and these are also in row
0. An overview of the content of all rows are found in the
annex.
 
In the 4-octet form 2 147 483 648 different characters can be
represented. (The first bit of the first octet shall be 0 so
only 31 of the 32 bits are used by UCS.) This coding space is
subdivided into 128 *groups*, each containing 256 planes. The
first octet in a character representation indicates the group
number and the second the plane number. The third and fourth
octets gives the row number and the cell number of the
character. Those characters that can be represented by the
2-octet form of UCS belong to plane 0 of group 0, which is
called the Basic Multilingual Plane (BMP). If you have the
2-octet representation of a character, you get the 4-octet
representation by putting two octets with the value 0 first.
 
 
3. What is accepted as a character in UCS?
------------------------------------------
 
When deciding which graphic characters should be included in UCS
the most important principle has been that a new character to be
accepted must differ from all already included characters both
in meaning and in appearance.
 
Alternative graphic forms of existing characters (font variants,
glyphs) are consequently not given their own character codes. In
Chinese, Japanese and Korean there is a very big number of
ideographic characters with the same historical origin and only
very minor differences in appearance. These national variants of
the same ideograph has been given a joint UCS character code, a
solution which is known as *CJK unification* and is
controversial in Japan.
 
Nor a completely new way of using an existing character -- the
same appearance but different meanings -- is sufficient
justification for getting a new coded representation in UCS. The
old punctuation symbol asterisk, "*", has in recent years also
been used as multiplication sign in different programming
languages. This case is regarded as two different uses of the
same character which is given only one UCS representation.
 
Two important exceptions to the rule that a character is a set
of those graphic forms which can be used with the same
meaning(s) are made in UCS:
 
-- Letters with exactly the same appearance that occurs in
   several different scripts are given different coded
   representations. There are for example one Latin "P", one
   Greek "P" (capital rho), and one Cyrillic "P" (Cyrillic R).
 
-- A comparatively small number of improper characters which are
   included in other practically important coded character sets
   are accepted in UCS. This is to make possible the fully
   reversible conversion of data coded in these character sets
   to UCS and back again to the original character set. Such
   characters are called *compatibility characters*. One example
   is the character SUPERSCRIPT TWO which can be found in UCS
   only because it's included in the character set ISO 8859-1.
 
What i's said above is a only a general outline of the principles
used to identify individual characters in UCS and Unicode. These
are unfortunately not described at all in the text of ISO 10646.
In many specific cases it's of course not at all clear how to
apply them. Quite a number of the decisions made are fairly
arbitrary.
 
On important feature of UCS is that a large number of code
positions are reserved for *private use characters*. No future
edition of ISO 10646 will use these positions. There is room for
6 400 private characters i the 2-octet form, and more than 500
million in the 4-octet form.
 
 
4. Implementation levels
-------------------------
 
Three different *implementation levels* of UCS are defined in
ISO 10646. On the two lower levels certain parts of full UCS,
which complicates the implementation of textprocessing programs,
are excluded.
 
-- The simplest implementation level 1 works exactly like the
   older simple coded character sets like ISO 8859: Each graphic
   character occupies one position and moves the active position
   one step in the writing direction (even though the movement
   needn't be constant; it isn't when a proportional font is
   used). This model works well for among others the Latin,
   Greek, and Cyrillic scripts. The composite characters
   (consisting of a base letter and one or more diacritical
   marks) used in a certain language must, however, be included
   in the character set as singel characters in their own right.
   UCS includes the composite characters of all official
   languages and also of most other languages with a
   well-established orthography using these scripts.
 
   Other languages that can be handled on implementation level 1
   are Japanese and Chinese. This is also possible for the
   Arabic and Hebrew scripts but there is an extra complication:
   These scripts are normally written from right to left, but
   when words in e.g. the Latin script are included in the text,
   these are written in their normal direction, from left to
   right. In computer memory all characters are stored in the
   *logical order*, i.e. the order in which the sounds of the
   words are pronounced and the letters normally are input. Two
   alternative methods to handle *bi-directional* text can be
   used together with UCS, one based on the international
   standard ISO 6429 and one defined for Unicode.
 
-- On implementation level 2 also the South-Asian scripts, e.g.
   Devanagari, can be handled. These have further implications
   for display software, since in many cases both the appearance
   and the position of a certain letter is determined by which
   the nearest sorrounding letters are.
 
-- On the full implementation level 3 programs also must be able
   to handle independant *combining characters*, e.g. accents
   and other diacritical marks that are printed over, under or
   through ordinary letters. Such characters can be freely
   combined with other characters and UCS sets no limit on the
   number of combining characters attached to a base character.
   A complication for programmers is that on this level some
   composite characters can each be coded in several different
   ways. As an example, the Danish letter "A with ring above and
   acute accent" can be represented in three different ways:
 
      01FA
      (the simple representation that must be used on level 1 and 2)
 
      00C5 0301
      ("A with ring above" + combining acute accent)
 
      0041 030A 0301
      ("A" + combining ring above + combining acute accent)
 
   (The coded representations in UCS are usually given ih
   hexadecimal notation. 01FA indicates two octets, first the
   octet with the value 1, then the octet with the decimal
   value 250.)
 
   Formally, the first alternative above is considered as a
   representation of a single *precomposed* character, while the
   second and third alternatives represent different *composite
   sequences* of several characters. Programs should, however,
   treat these three alternatives as fully equivalent
   representations of the same thing.
 
   Implementation level 3 is necessary for full support of the
   Korean Hangul script and also for full support of IPA, the
   International Phonetic Alphabet.
 
 
5. Data communication problems
------------------------------
 
Many data communication protocols treat octets with values in
the range 0-31 (which in 7-bit and 8-bit character sets
represents control characters) specially. Some octets that in
ASCII represents graphic characters can not be included in
file names in some important operating systems. For these
reasons algorithmic transformation methods have been defined
for UCS data. The most important is called *UTF-2*. It
replaces the coded representations 0000-007F (the ASCII
characters) with the corresponding octet in the range hex
00-7F. The other coded representations of UCS are transformed
to sequences of two or more octets in the range hex 80-FF.
 
 
6. Sources
----------
 
UCS is defined in:
 
   ISO/IEC International Standard 10646-1:1993(E):
   Information technology -- Universal Multiple-Octet Coded
   Character Set (UCS) -- Part 1: Arcitecture and Basic
   Multilingual Plane.
   ISO, 1993.
 
Unicode version 1.0 is defined in two books:
 
   The Unicode Consortium:
   The Unicode Standard Worldwide Character Encoding. Version 1.0.
   Volume 1 (Arcitecture, non-ideographic characters)
   Addison-Wesley, 1991
 
   The Unicode Consortium:
   The Unicode Standard Worldwide Character Encoding. Version 1.0.
   Volume 2 (Ideographic characters)
   Addison-Wesley, 1992
 
The changes made between version 1.0 and version 1.1 are
specified in:
 
   Unicode Technical Report #4: The Unicode Standard, Version 1.1
   The Unicode Consortium, 1993 (Prepublication Edition)
 
A good account of the history of ISO work on multi-octet
character sets and the merger between ISO 10646 and Unicode can
be found in:
!
   Michael Y. Ksar:
   Untying tongues. ISO/IEC breaks down computer barriers in
      processing worldwide languages
   ISO Bulletin, No. 6 (June 1993)
 
 
Annex: Overview of the Basic Multilingual Plane (grupp=00, plan=00)
-------------------------------------------------------------------
_______ ___________________________________________________________________
 
Row(s)  Content (script, other groups of characters, reserved area)
_______ ___________________________________________________________________
 
======= A-ZONE (alphabetical characters and symbols) =======================
00      (Control characters,) Basic Latin, Latin-1 Supplement (=ISO 8859-1)
01      Latin Extended-A, Latin Extended-B
02      Latin Extended-B, IPA Extensions, Spacing Modifier Letters
03      Combining Diacritical Marks, Basic Greek, Greek Symbols and Coptic
04      Cyrillic
05      Armenian, Hebrew
06      Basic Arabic, Arabic Extended
07--08  (Reserved for furture standardization)
09      Devanagari, Bengali
0A      Gumukhi, Gujarati
0B      Oriya, Tamil
0C      Telugu, Kannada
0D      Malayaiam
0E      Thai, Lao
0F      (Reserved for furture standardization)
10      Georgian
11      Hangul Jamo
12--1D  (Reserved for furture standardization)
1E      Latin Extended Additional
1F      Greek Extended
20      General Punctuation, SUper/subscripts, Currency, Combining Symbols
21      Letterlike Symbols, Number Forms, Arrows
22      Mathematical Operators
23      Miscellaneous Technical Symbols
24      Control Pictures, OCR, Enclosed Alphanumerics
25      Box Drawing, Block Elements, Geometic Shapes
26      Miscellaneous Symbols
27      Dingbats
28--2F  (Reserved for furture standardization)
30      CJK Symbols and Punctuation, Hiragana, Katakana
31      Bopomofo, Hangul Compatibility Jamo, CJK Miscellaneous
32      Enclosed CJK Letters and Months
33      CJK Compatibility
34--4D  Hangul
======= I-ZONE (ideographic characters) ===================================
4E--9F  CJK Unified Ideographs
======= O-ZONE (open zone) ================================================
A0--DF  (Reserved for furture standardization)
======= R-ZONE (restricted use zone) ======================================
E0--F8  (Private Use Area)
F9--FA  CJK Compatibility Ideographs
FB      Alphabetic Presentation Forms, Arabic Presentation Forms-A
FC--FD  Arabic Presentation Forms-A
FE      Combining Half Marks, CJK Compatibility Forms, Small Forms, Arabic-B
FF      Halfwidth and Fullwidth Forms, Specials
 
 
 
(unicode-iso10646-oview.txta Ap4 930914: END)
