Date: Wed, 15 Sep 93 13:06:12 +0200 From: Olle Jarnefors To: Glenn Adams , Edwin Hart , Sten G Lindberg , keld@dkuug.dk (Keld J|rn Simonsen) Subject: A short technical overview of ISO 10646/Unicode Dear universal coders, I have put together a short (less than six pages) overview of UCS and Unicode from a technical viewpoint, which I plan to distribute freely on Internet. I would very much appreciate any suggestions for improvement or other comments on this text from you, which I can incorporate in the final version. Here are some questions that especially bother me: -- Is the text short enough to have a chance to be read by interested people? -- Are important aspects of UCS or Unicode missing? -- Are some aspects treated in too great detail? -- Can some of the explanations be made easier to understand (without becoming too long)? -- Bad English? The current version of the text follows. There is also a Swedish version, not included here. You can fetch the latest versions by anonymous FTP to othello.admin.kth.se in directory pub/misc/ucs These files are of interest: unicode-iso10646-oview.txta latest English version unicode-iso10646-oview-diff.txta " with changes from previous version marked unicode-iso10646-or.txt1 latest Swedish version (in Latin-1) unicode-iso10646-or-diff.txt1 " with changes from previous version marked unicode-iso10646-or.txts latest Swedish version (in Swedish 646) unicode-iso10646-or-diff.txts " with changes from previous version marked Best regards, /Olle ----- (unicode-iso10646-oview.txta Ap4 930914 OJ) A SHORT OVERVIEW OF ISO 10646 AND UNICODE By: Olle Jarnefors Royal Institute of Technology, Sweden Fax: +46 8 10 25 10 The purpose of this text is to give a brief technical overview of the new character set standard ISO 10646. I have omitted descriptions of the history of the standard as well as general talk about why a standard of this type is badly needed. Previous knowledge: The reader should have some knowledge about coded character sets, have seen an ASCII table, and know of some 8-bit character sets, like Latin-1 (ISO 8859-1). 1. Most important facts ----------------------- ISO 10646 is a new character set standard that was published in 1993 by the International Organization for Standardization (ISO). Its name is "Universal Multiple-Octet Coded Character Set", *UCS*, and it's the first coded character set with the ambition to eventually include all characters used in all the written languages in the world (and, in addition, all mathematical and other symbols). The current first edition at least covers all major languages and all commercially important languages. To make it possible to represent every character with a unique bit sequence in UCS this representation must consist of at least two octets or 16 bits. UCS is intended to be used both for internal data representation in computer systems and for data communication. New operating systems and computers using UCS has already been released by Microsoft (Windows NT) and Apple (Newton). The standard has, however, been subject to strong criticism from both data communication experts (Internet) and from some gropus in Japan. ISO 10646 is a fundamental standard affecting almost all parts of information technology, but UCS is only a coded character set, not a complete system for text representation. For several important aspects of text, as it's treated in modern text processing programs, UCS needs to be supplemented by further standards or rules. Some examples of this are italic text, tables, mathematical formulas, fonts, document structure. (That simple kind of text that can be represented by UCS or other character set standards alone -- a linear sequence of characters, with a fixed division into lines and pages -- is called *plain text*.) *Unicode* is a coded character set specified by a consortium of major American computer manufacturers. The latest version, 1.1, is identical to the 2-octet form of UCS (on implementation level 3). 2. The structure of the coding space ------------------------------------ In the first version of UCS 34 203 different characters are included. Of these 21 204 are ideographic characters used in Chinese, Japanese and Korean. To guarantee the coding space will not be filled even in the future -- 2 octets give 65 536 different character codes -- an *4-octet form* of UCS (UCS-4) is also definied. Still only the *2-octet form* (UCS-2) is used in practice. The 65 536 positions in the 2-octet form of UCS are divided into 256 *rows* with 256 *cells* in each. The first octet of a character representation gives the row number, the second the cell number. The first row, row 0, contains exactly the same characters as ISO 8859-1 and the first 128 characters are thus the ASCII characters. If you have the octet representing a ISO 8859-1 character, you will get its UCS representation by putting a octet with value 0 in front of it. UCS includes the same control characters as ISO 8859 and these are also in row 0. An overview of the content of all rows are found in the annex. In the 4-octet form 2 147 483 648 different characters can be represented. (The first bit of the first octet shall be 0 so only 31 of the 32 bits are used by UCS.) This coding space is subdivided into 128 *groups*, each containing 256 planes. The first octet in a character representation indicates the group number and the second the plane number. The third and fourth octets gives the row number and the cell number of the character. Those characters that can be represented by the 2-octet form of UCS belong to plane 0 of group 0, which is called the Basic Multilingual Plane (BMP). If you have the 2-octet representation of a character, you get the 4-octet representation by putting two octets with the value 0 first. 3. What is accepted as a character in UCS? ------------------------------------------ When deciding which graphic characters should be included in UCS the most important principle has been that a new character to be accepted must differ from all already included characters both in meaning and in appearance. Alternative graphic forms of existing characters (font variants, glyphs) are consequently not given their own character codes. In Chinese, Japanese and Korean there is a very big number of ideographic characters with the same historical origin and only very minor differences in appearance. These national variants of the same ideograph has been given a joint UCS character code, a solution which is known as *CJK unification* and is controversial in Japan. Nor a completely new way of using an existing character -- the same appearance but different meanings -- is sufficient justification for getting a new coded representation in UCS. The old punctuation symbol asterisk, "*", has in recent years also been used as multiplication sign in different programming languages. This case is regarded as two different uses of the same character which is given only one UCS representation. Two important exceptions to the rule that a character is a set of those graphic forms which can be used with the same meaning(s) are made in UCS: -- Letters with exactly the same appearance that occurs in several different scripts are given different coded representations. There are for example one Latin "P", one Greek "P" (capital rho), and one Cyrillic "P" (Cyrillic R). -- A comparatively small number of improper characters which are included in other practically important coded character sets are accepted in UCS. This is to make possible the fully reversible conversion of data coded in these character sets to UCS and back again to the original character set. Such characters are called *compatibility characters*. One example is the character SUPERSCRIPT TWO which can be found in UCS only because it's included in the character set ISO 8859-1. What i's said above is a only a general outline of the principles used to identify individual characters in UCS and Unicode. These are unfortunately not described at all in the text of ISO 10646. In many specific cases it's of course not at all clear how to apply them. Quite a number of the decisions made are fairly arbitrary. On important feature of UCS is that a large number of code positions are reserved for *private use characters*. No future edition of ISO 10646 will use these positions. There is room for 6 400 private characters i the 2-octet form, and more than 500 million in the 4-octet form. 4. Implementation levels ------------------------- Three different *implementation levels* of UCS are defined in ISO 10646. On the two lower levels certain parts of full UCS, which complicates the implementation of textprocessing programs, are excluded. -- The simplest implementation level 1 works exactly like the older simple coded character sets like ISO 8859: Each graphic character occupies one position and moves the active position one step in the writing direction (even though the movement needn't be constant; it isn't when a proportional font is used). This model works well for among others the Latin, Greek, and Cyrillic scripts. The composite characters (consisting of a base letter and one or more diacritical marks) used in a certain language must, however, be included in the character set as singel characters in their own right. UCS includes the composite characters of all official languages and also of most other languages with a well-established orthography using these scripts. Other languages that can be handled on implementation level 1 are Japanese and Chinese. This is also possible for the Arabic and Hebrew scripts but there is an extra complication: These scripts are normally written from right to left, but when words in e.g. the Latin script are included in the text, these are written in their normal direction, from left to right. In computer memory all characters are stored in the *logical order*, i.e. the order in which the sounds of the words are pronounced and the letters normally are input. Two alternative methods to handle *bi-directional* text can be used together with UCS, one based on the international standard ISO 6429 and one defined for Unicode. -- On implementation level 2 also the South-Asian scripts, e.g. Devanagari, can be handled. These have further implications for display software, since in many cases both the appearance and the position of a certain letter is determined by which the nearest sorrounding letters are. -- On the full implementation level 3 programs also must be able to handle independant *combining characters*, e.g. accents and other diacritical marks that are printed over, under or through ordinary letters. Such characters can be freely combined with other characters and UCS sets no limit on the number of combining characters attached to a base character. A complication for programmers is that on this level some composite characters can each be coded in several different ways. As an example, the Danish letter "A with ring above and acute accent" can be represented in three different ways: 01FA (the simple representation that must be used on level 1 and 2) 00C5 0301 ("A with ring above" + combining acute accent) 0041 030A 0301 ("A" + combining ring above + combining acute accent) (The coded representations in UCS are usually given ih hexadecimal notation. 01FA indicates two octets, first the octet with the value 1, then the octet with the decimal value 250.) Formally, the first alternative above is considered as a representation of a single *precomposed* character, while the second and third alternatives represent different *composite sequences* of several characters. Programs should, however, treat these three alternatives as fully equivalent representations of the same thing. Implementation level 3 is necessary for full support of the Korean Hangul script and also for full support of IPA, the International Phonetic Alphabet. 5. Data communication problems ------------------------------ Many data communication protocols treat octets with values in the range 0-31 (which in 7-bit and 8-bit character sets represents control characters) specially. Some octets that in ASCII represents graphic characters can not be included in file names in some important operating systems. For these reasons algorithmic transformation methods have been defined for UCS data. The most important is called *UTF-2*. It replaces the coded representations 0000-007F (the ASCII characters) with the corresponding octet in the range hex 00-7F. The other coded representations of UCS are transformed to sequences of two or more octets in the range hex 80-FF. 6. Sources ---------- UCS is defined in: ISO/IEC International Standard 10646-1:1993(E): Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Arcitecture and Basic Multilingual Plane. ISO, 1993. Unicode version 1.0 is defined in two books: The Unicode Consortium: The Unicode Standard Worldwide Character Encoding. Version 1.0. Volume 1 (Arcitecture, non-ideographic characters) Addison-Wesley, 1991 The Unicode Consortium: The Unicode Standard Worldwide Character Encoding. Version 1.0. Volume 2 (Ideographic characters) Addison-Wesley, 1992 The changes made between version 1.0 and version 1.1 are specified in: Unicode Technical Report #4: The Unicode Standard, Version 1.1 The Unicode Consortium, 1993 (Prepublication Edition) A good account of the history of ISO work on multi-octet character sets and the merger between ISO 10646 and Unicode can be found in: ! Michael Y. Ksar: Untying tongues. ISO/IEC breaks down computer barriers in processing worldwide languages ISO Bulletin, No. 6 (June 1993) Annex: Overview of the Basic Multilingual Plane (grupp=00, plan=00) ------------------------------------------------------------------- _______ ___________________________________________________________________ Row(s) Content (script, other groups of characters, reserved area) _______ ___________________________________________________________________ ======= A-ZONE (alphabetical characters and symbols) ======================= 00 (Control characters,) Basic Latin, Latin-1 Supplement (=ISO 8859-1) 01 Latin Extended-A, Latin Extended-B 02 Latin Extended-B, IPA Extensions, Spacing Modifier Letters 03 Combining Diacritical Marks, Basic Greek, Greek Symbols and Coptic 04 Cyrillic 05 Armenian, Hebrew 06 Basic Arabic, Arabic Extended 07--08 (Reserved for furture standardization) 09 Devanagari, Bengali 0A Gumukhi, Gujarati 0B Oriya, Tamil 0C Telugu, Kannada 0D Malayaiam 0E Thai, Lao 0F (Reserved for furture standardization) 10 Georgian 11 Hangul Jamo 12--1D (Reserved for furture standardization) 1E Latin Extended Additional 1F Greek Extended 20 General Punctuation, SUper/subscripts, Currency, Combining Symbols 21 Letterlike Symbols, Number Forms, Arrows 22 Mathematical Operators 23 Miscellaneous Technical Symbols 24 Control Pictures, OCR, Enclosed Alphanumerics 25 Box Drawing, Block Elements, Geometic Shapes 26 Miscellaneous Symbols 27 Dingbats 28--2F (Reserved for furture standardization) 30 CJK Symbols and Punctuation, Hiragana, Katakana 31 Bopomofo, Hangul Compatibility Jamo, CJK Miscellaneous 32 Enclosed CJK Letters and Months 33 CJK Compatibility 34--4D Hangul ======= I-ZONE (ideographic characters) =================================== 4E--9F CJK Unified Ideographs ======= O-ZONE (open zone) ================================================ A0--DF (Reserved for furture standardization) ======= R-ZONE (restricted use zone) ====================================== E0--F8 (Private Use Area) F9--FA CJK Compatibility Ideographs FB Alphabetic Presentation Forms, Arabic Presentation Forms-A FC--FD Arabic Presentation Forms-A FE Combining Half Marks, CJK Compatibility Forms, Small Forms, Arabic-B FF Halfwidth and Fullwidth Forms, Specials (unicode-iso10646-oview.txta Ap4 930914: END)