ÐÏࡱáGOTOBUTTON _Toc450466412  PAGEREF _Toc450466412 8 11 Revision and updating of the UCS  GOTOBUTTON _Toc450466413  PAGEREF _Toc450466413 9 12 Subsets  GOTOBUTTON _Toc450466414  PAGEREF _Toc450466414 9 13 Coded representation forms of the UCS  GOTOBUTTON _Toc450466415  PAGEREF _Toc450466415 9 14 Implementation levels  GOTOBUTTON _Toc450466416  PAGEREF _Toc450466416 9 15 Use of control functions with the UCS  GOTOBUTTON _Toc450466417  PAGEREF _Toc450466417 10 16 Declaration of identification of features  GOTOBUTTON _Toc450466418  PAGEREF _Toc450466418 10 17 Structure of the code tables and lists  GOTOBUTTON _Toc450466419  PAGEREF _Toc450466419 11 18 Block names  GOTOBUTTON _Toc450466420  PAGEREF _Toc450466420 12 19 Characters in bi-directional context  GOTOBUTTON _Toc450466421  PAGEREF _Toc450466421 12 20 Special characters  GOTOBUTTON _Toc450466422  PAGEREF _Toc450466422 12 21 Presentation forms of characters  GOTOBUTTON _Toc450466423  PAGEREF _Toc450466423 13 22 Compatibility characters  GOTOBUTTON _Toc450466424  PAGEREF _Toc450466424 13 23 Order of characters  GOTOBUTTON _Toc450466425  PAGEREF _Toc450466425 13 24 Combining characters  GOTOBUTTON _Toc450466426  PAGEREF _Toc450466426 13 25 Special features of individual scripts  GOTOBUTTON _Toc450466427  PAGEREF _Toc450466427 14 26 Code tables and lists of character names  GOTOBUTTON _Toc450466428  PAGEREF _Toc450466428 15 27 CJK unified ideographs  GOTOBUTTON _Toc450466429  PAGEREF _Toc450466429 20 Annexes A  TOC \o "1-1" Collections of graphic characters for subsets 22 B List of combining characters 28 C Transformation format for 16 planes of Group 00 (UTF-16) 33 D UCS Transformation Format 8 (UTF-8) 36 E Mirrored characters in Arabic bi-directional context 40 F Alternate format characters 42 G Alphabetically sorted list of character names 47 H The use of "signatures" to identify UCS 48 J Recommendation for combined receiving/originating devices with internal storage 49 K Notations of octet value representations 50 L Character naming guidelines 51 M Sources of characters 53 N External references to character repertoires 55 P Additional information on characters 57 Q Code mapping table for Hangul syllables 60 R Procedure for the unification and arrangement of CJK Ideographs 70  Information technology — Universal Multiple-Octet Coded Character Set (UCS) — Part 1: Architecture and Basic Multilingual Plane 1 Scope ISO/IEC 10646 specifies the Universal Multiple-Octet Coded Character Set (UCS). It is applicable to the representation, transmission, interchange, processing, storage, input, and presentation of the written form of the languages of the world as well as additional symbols. This part of ISO/IEC 10646 specifies the overall architecture, and - defines terms used in ISO/IEC 10646; - describes the general structure of the coded character set; - specifies the Basic Multilingual Plane (BMP) of the UCS, and defines a set of graphic characters used in scripts and the written form of languages on a world-wide scale; - specifies the names for the graphic characters of the BMP, and the coded representations; - specifies the four-octet (32-bit) canonical form of the UCS: UCS-4; - specifies a two-octet (16-bit) BMP form of the UCS: UCS-2; - specifies the coded representations for control functions; - specifies the management of future additions to this coded character set. The UCS is a coding system different from that specified in ISO 2022. The method to designate UCS from ISO 2022 is specified in 16.2. NOTE - It is intended that character code positions for additional scripts and symbols will be allocated in this Part 1 of this International Standard when sufficient input and review is provided by national standards organizations or other qualified experts. 2 Conformance 2.1 General Whenever private use characters are used as specified in ISO/IEC 10646, the characters themselves shall not be covered by these conformance requirements. 2.2 Conformance of information interchange A coded-character-data-element (CC-data-element) within coded information for interchange is in conformance with ISO/IEC 10646 if a) all the coded representations of graphic characters within that CC-data-element conform to clauses 6 and 7, to an identified form chosen from clause 13 or Annex C or Annex D, and to an identified implementation level chosen from clause 14; b) all the graphic characters represented within that CC-data-element are taken from those within an identified subset (clause 12); c) all the coded representations of control functions within that CC-data-element conform to clause 15. A claim of conformance shall identify the adopted form, the adopted implementation level and the adopted subset by means of a list of collections and/or characters. 2.3 Conformance of devices A device is in conformance with ISO/IEC 10646 if it conforms to the requirements of item a) below, and either or both of items b) and c). NOTE - The term device is defined (in 4.18) as a component of information processing equipment which can transmit and/or receive coded information within CC-data-elements. A device may be a conventional input/output device, or a process such as an application program or gateway function. A claim of conformance shall identify the document that contains the description specified in a) below, and shall identify the adopted form(s), the adopted implementation level, the adopted subset (by means of a list of collections and/or characters), and the selection of control functions adopted in accordance with clause 15. a) Device description: A device that conforms to ISO/IEC 10646 shall be the subject of a description that identifies the means by which the user may supply characters to the device and/or may recognize them when they are made available to the user, as specified respectively, in subclauses b), and c) below. b) Originating device: An originating device shall allow its user to supply any characters from an adopted subset, and be capable of transmitting their coded representations within a CC-data-element in accordance with the adopted form and implementation level. c) Receiving device: A receiving device shall be capable of receiving and interpreting any coded representation of characters that are within a CC-data-element in accordance with the adopted form and implementation level, and shall make any corresponding characters from the adopted subset available to the user in such a way that the user can identify them. Any corresponding characters that are not within the adopted subset shall be indicated to the user. The way used for indicating them need not distinguish them from each other. NOTES 1 An indication to the user may consist of making available the same character to represent all characters not in the adopted subset, or providing a distinctive audible or visible signal when appropriate to the type of user. 2 See also annex J for receiving devices with re-transmission capability. 3 Normative references The following standards contain provisions which, through reference in this text, constitute provisions of this part of ISO/IEC 10646. At the time of publication, the editions indicated were valid. All standards are subject to revision, and parties to agreements based on this part of ISO/IEC 10646 are encouraged to investigate the possibility of applying the most recent editions of the standards listed below. Members of IEC and ISO maintain registers of currently valid International Standards. ISO/IEC 2022:1994 Information technology — Character code structure and extension techniques. ISO/IEC 6429:1992 Information technology — Control functions for coded character sets. 4 Definitions For the purposes of ISO/IEC 10646, the following definitions apply : 4.1 Basic Multilingual Plane (BMP): Plane 00 of Group 00. 4.2 block: A contiguous range of code positions to which a set of characters that share common characteristics, such as script, are allocated. A block cannot overlap another block. One or more of the code positions within a block may have no character allocated to it. 4.3 canonical form: The form with which characters of this coded character set are specified using four octets to represent each character. 4.4 CC-data-element (coded-character-data-element): An element of interchanged information that is specified to consist of a sequence of coded representations of characters, in accordance with one or more identified standards for coded character sets. 4.5 cell: The place within a row at which an individual character may be allocated. 4.6 character: A member of a set of elements used for the organisation, control, or representation of data. 4.7 character boundary: Within a stream of octets the demarcation between the last octet of the coded representation of a character and the first octet of that of the next coded character. 4.8 coded character: A character together with its coded representation. 4.9 coded character set: A set of unambiguous rules that establishes a character set and the relationship between the characters of the set and their coded representation. 4.10 code table: A table showing the characters allocated to the octets in a code. 4.11 collection: A set of coded characters which is numbered and named and which consists of those coded characters whose code positions lie within one or more identified ranges. NOTE - If any of the identified ranges include code positions to which no character is allocated, the repertoire of the collection will change if an additional character is assigned to any of those positions at a future amendment of this International Standard. However it is intended that the collection number and name will remain unchanged in future editions of this International Standard. 4.12 combining character: A member of an identified subset of the coded character set of ISO/IEC 10646 intended for combination with the preceding non-combining graphic character, or with a sequence of combining characters preceded by a non-combining character (see also 4.14). NOTE - This part of ISO/IEC 10646 specifies several subset collections which include combining characters. 4.13 compatibility character: A graphic character included as a coded character of ISO/IEC 10646 primarily for compatibility with existing coded character sets. 4.14 composite sequence: A sequence of graphic characters consisting of a non-combining character followed by one or more combining characters (see also 4.12). NOTES 1 A graphic symbol for a composite sequence generally consists of the combination of the graphic symbols of each character in the sequence. 2 A composite sequence is not a character and therefore is not a member of the repertoire of ISO/IEC 10646. 4.15 control function: An action that affects the recording, processing, transmission or interpretation of data, and that has a coded representation consisting of one or more octets. 4.16 default state: The state that is assumed when no state has been explicitly specified. 4.17 detailed code table: A code table showing the individual characters, and normally showing a partial row. 4.18 device: A component of information processing equipment which can transmit and/or receive coded information within CC-data-elements. (It may be an input/output device in the conventional sense, or a process such as an application program or gateway function.) 4.19 fixed collection: A collection in which every code position within the identified range(s) has a character allocated to it, and which is intended to remain unchanged in future editions of this International Standard. 4.20 graphic character: A character, other than a control function, that has a visual representation normally handwritten, printed, or displayed. 4.21 graphic symbol: The visual representation of a graphic character or of a composite sequence. 4.22 group: A subdivision of the coding space of this coded character set; of 256 x 256 x 256 cells. 4.23 high-half zone: a set of cells reserved for use in UTF-16 (see Annex C); an RC-element corresponding to any of these cells may be used as the first of a pair of RC-elements which represents a character from a plane other than the BMP. 4.24 interchange: The transfer of character coded data from one user to another, using telecommunication means or interchangeable media. 4.25 interworking: The process of permitting two or more systems, each employing different coded character sets, meaningfully to interchange character coded data; conversion between the two codes may be involved. 4.26 low-half zone: a set of cells reserved for use in UTF-16 (see Annex C); an RC-element corresponding to any of these cells may be used as the second of a pair of RC-elements which represents a character from a plane other than the BMP. 4.27 octet: An ordered sequence of eight bits considered as a unit. 4.28 plane: A subdivision of a group; of 256 x 256 cells 4.29 presentation; to present: The process of writing, printing, or displaying a graphic symbol. 4.30 presentation form: In the presentation of some scripts, a form of a graphic symbol representing a character that depends on the position of the character relative to other characters. 4.31 private use plane: A plane within this coded character set the contents of which is not specified in ISO/IEC 10646 (see clause 10) 4.33 RC-element: a two-octet sequence comprising the R-octet and the C-octet (see 6.2) from the four octet sequence that corresponds to a cell in the coding space of this coded character set. 4.33 repertoire: A specified set of characters that are represented in a coded character set. 4.34 row: A subdivision of a plane; of 256 cells. 4.35 script: A set of graphic characters used for the written form of one or more languages. 4.36 supplementary plane: A plane that accommodates characters which have not been allocated to the Basic Multilingual Plane. 4.37 unpaired RC-element: An RC-element in a CC-data element that is either: • an RC-element from the high-half zone that is not immediately followed by an RC-element from the low-half zone, or • an RC-element from the low-half zone that is not immediately preceded by a high-half RC-element from the high-half zone. 4.38 user: A person or other entity that invokes the service provided by a device. (This entity may be a process such as an application program if the "device" is a code converter or a gateway function, for example.) 4.39 zone: A sequence of cells of a code table, comprising one or more rows, either in whole or in part, containing characters of a particular class (see clause 8). 5 General structure of the UCS The general structure of the Universal Multiple-Octet Coded Character Set (referred to hereafter as "this coded character set") is described in this explanatory clause, and is illustrated in figures 1 and 2. The normative specification of the structure is given in the following clauses. The value of any octet is expressed in hexadecimal notation from 00 to FF in ISO/IEC 10646 (see annex K). The canonical form of this coded character set ( the way in which it is to be conceived ( uses a four-dimensional coding space, regarded as a single entity, consisting of 128 three-dimensional groups. NOTE - Thus, bit 8 of the most significant octet in the canonical form of a coded character can be used for internal processing purposes within a device as long as it is set to zero within a conforming CC-data-element. Each group consists of 256 two-dimensional planes. Each plane consists of 256 one-dimensional rows, each row containing 256 cells. A character is located and coded at a cell within this coding space or the cell is declared unused. In the canonical form, four octets are used to represent each character, and they specify the group, plane, row and cell, respectively. The canonical form consists of four octets since two octets are not sufficient to cover all the characters in the world, and a 32-bit representation follows modern processor architectures. The four-octet canonical form can be used as a four-octet coded character set, in which case it is called UCS-4. The first plane (Plane 00 of Group 00) is called the Basic Multilingual Plane. The Basic Multilingual Plane includes characters in general use in alphabetic, syllabic and ideographic scripts together with various symbols and digits. The subsequent planes are regarded as supplementary or private use planes, which will accommodate additional graphic characters (see clause 9). The planes that are reserved for private use are specified in clause 10. The contents of the cells in private use zones are not specified in ISO/IEC 10646. Each character is located within the coded character set in terms of its Group-octet, Plane-octet, Row-octet, and Cell-octet. In addition to the canonical form, a two-octet BMP form is specified. Thus, the Basic Multilingual Plane can be used as a two-octet coded character set identified as UCS-2. Subsets of the coding space may be used in order to give a sub-repertoire of graphic characters. A UCS Transformation Format (UTF-16) is specified in Annex C which can be used to represent characters from 16 planes of group 00, additional to the BMP, in a form that is compatible with the two-octet BMP form. A UCS Transformation Format (UTF-8) is specified in Annex D which can be used to transmit text data through communication systems which are sensitive to octet values for control characters coded according to the 8-bit structure of ISO/IEC 2022, and to ISO/IEC 4873. UTF-8 also avoids the use of octet values according to ISO/IEC 4873 which have special significance during the parsing of file-name character strings in widely-used file-handling systems. 6 Basic structure and nomenclature 6.1 Structure The Universal Multiple-Octet Coded Character Set as specified in ISO/IEC 10646 shall be regarded as a single entity. This entire coded character set shall be conceived of as comprising 128 groups of 256 planes. Each plane shall be regarded as containing 256 rows of characters, each row containing 256 cells. In a code table representing the contents of a plane (such as in figure 2), the horizontal axis shall represent the least significant octet, with its smaller value to the left; and the vertical axis shall represent the more significant octet, with its smaller value at the top. Each axis of the coding space shall be coded by one octet. Within each octet the most significant bit shall be bit 8 and the least significant bit shall be bit 1. Accordingly, the weight allocated to each bit shall be bit 8bit 7bit 6bit 5bit 4bit 3bit 2bit 11286432168421   Group 7F  Plane 00 of Group 7F Group 01  Group 00 Plane 00 of Group 01  Each plane: Plane FF of Group 00 256 x 256 cells  Plane 00 of Group 00 Figure 1 - Entire coding space of the Universal Multiple-Octet Coded Character Set  Supplementary planes   Cell-octet 00 80 FF Row- octet FF  80 E0  80  0F Private use planes 0F, 10, E0 - FF D8..DF S-zone E0..F8 Private use zone 01 F9..FF 00 Basic Multilingual Plane Plane-octet NOTE - Labels “S-zone” and “Private use zone” are specified in clause 8. Figure 2 - Group 00 of the Universal Multiple-Octet Coded Character Set 6.2 Coding of characters In the canonical form of the coded character set, each character within the entire coded character set shall be represented by a sequence of four octets. The most significant octet of this sequence shall be the group-octet. The least significant octet of this sequence shall be the cell-octet. Thus this sequence may be represented as m.s. l.s. Group-octetPlane-octetRow-octetCell-octet where m.s. means the most significant octet, and l.s. means the least significant octet. For brevity, the octets may be termed m.s. l.s. G-octetP-octetR-octetC-octet Where appropriate, these may be further abbreviated to G, P, R, and C. The value of any octet shall be represented by two hexadecimal digits, for example: 31 or FE. When a single character is to be identified in terms of the values of its group, plane, row, and cell, this shall be represented such as: 0000 0030 for DIGIT ZERO 0000 0041 for LATIN CAPITAL LETTER A When referring to characters within an identified plane, the leading four digits (for G-octet and P-octet) may be omitted. For example, within plane 00, 0030 may be used to refer to DIGIT ZERO. 6.3 Octet order The sequence of the octets that represent a character, and the most significant and least significant ends of it, shall be maintained as shown above. When serialized as octets, a more significant octet shall precede less significant octets. When not serialized as octets, the order of octets may be specified by agreement between sender and recipient (see 16.1 and annex H). 6.4 Naming of characters ISO/IEC 10646 assigns a unique name to each character. The name of a character either: a. denotes the customary meaning of the character, or b. describes the shape of the corresponding graphic symbol, or c. follows the rule given in clause 27 for Chinese/Japanese/Korean (CJK) unified ideographs. Guidelines to be used for constructing the names of characters in cases a. and b. are given in annex L. 6.5 Identifiers for characters ISO/IEC 10646 defines a short identifier for each character. The short identifier for any character is distinct from the short identifier for any other character. These short identifiers are independent of the language in which this standard is written, and are thus retained in all translations of the text. The following alternative forms of notation of a short identifier are defined here. a. The eight-digit form of short identifier shall consist of the sequence of eight hexadecimal digits that represents the code position of the character (see 6.2). b. The four-digit form of short identifier shall consist of the last four digits of the eight-digit form. It is not defined if the first four digits of the eight-digit form are not all zeroes; that is, for characters allocated outside the Basic Multilingual Plane. c. The character "-" (HYPHEN-MINUS) may, as an option, precede the 8-digit form of short identifier. d. The character "+" (PLUS SIGN) may, as an option, precede the 4-digit form of short identifier. e. The prefix letter "U" (LATIN CAPITAL LETTER U) may, as an option, precede any of the four forms of short identifier defined in a. to d. above. The CAPITAL letters A to F, and U that appear within identifiers may be replaced by the corresponding SMALL letters. The full syntax of the notation of a short identifier, in Backus-Naur form, is: { U | u } [ {+}xxxx | {-}xxxxxxxx ] where "x" represents one hexadecimal digit (0 to 9, A to F, or a to f), for example: -hhhhhhhh +kkkk Uhhhhhhhh U+kkkk where hhhhhhhh indicates the eight-digit form and kkkk indicates the four-digit form. NOTES 1 As an example the identifier for LATIN SMALL LETTER LONG S (see tables for Row 01 in clause 26) may be notated in any of the following forms: 0000017F -0000017F U0000017F U-0000017F 017F +017F U017F U+017F Any of the capital letters may be replaced by the corresponding small letter. 2 Two special prefixed forms of notation have also been used, in which the letter T (LATIN CAPITAL LETTER T or LATIN SMALL LETTER T) replaces the letter U in the corresponding prefixed forms. The forms of notation that included the prefix letter T indicated that the identifier refers to a character in ISO/IEC 10646-1 First Edition (before the application of any Amendments), whereas the forms of notation that include the prefix letter U always indicate that the identifier refers to a character in ISO/IEC 10646 at the most recent state of amendment. Corresponding identifiers of the form T-xxxxxxxx and U-xxxxxxxx refer to the same character except when xxxxxxxx lies in the range 00003400 to 00004DFF inclusive. Forms of notation that include no prefix letter always indicate a reference to the most recent state of amendment of ISO/IEC 10646, unless otherwise qualified. 7 General requirements for the UCS The following requirements apply to the entire coded character set. a) The values of P-, and R-, and C-octets used for representing graphic characters shall be in the range 00 to FF. The values of G-octets used for representation of graphic characters shall be in the range 00 to 7F. On any plane, code positions FFFE and FFFF shall not be used. NOTE - Code position FFFE is reserved for "signature" (see annex H). Code position FFFF can be used for internal processing uses requiring a numeric value that is guaranteed not to be a coded character such as in terminating tables, or signaling end-of-text. Since it is the largest two-octet value, it may also be used as the final value in binary or sequential searching index. b) Code positions to which a character is not allocated, except for the positions reserved for private use characters or for transformation formats, are reserved for future standardization and shall not be used for any other purpose. Future editions of ISO/IEC 10646 will not allocate any characters to code positions reserved for private use characters or for transformation formats. c) The same graphic character shall not be allocated to more than one code position. There are graphic characters with similar shapes in the coded character set; they are used for different purposes and have different character names. 8 The Basic Multilingual Plane Plane 00 of Group 00 shall be the Basic Multilingual Plane (BMP). The BMP can be used as a two-octet coded character set in which case it shall be called UCS-2 (see 13.1). Code positions 0000 0000 to 0000 001F in the BMP are reserved for control characters, and code position 0000 007F is reserved for the character DELETE (see clause 15). Code positions 0000 0080 to 0000 009F are reserved for control characters. Code positions 0000 D800 to 0000 DFFF are reserved for the use of UTF-16 (see Annex C). These positions are known as the S-zone. Code positions 0000 E000 to 0000 F8FF are reserved for private use (see clause 10). These positions are known as the private use zone. Code postions FFFE and FFFF are reserved. 9 Other planes 9.1 Planes reserved for future standardization Planes 11 to DF in Group 00 and Planes 00 to FF in Groups 01 to 5F are reserved for future standardization, and thus those code positions shall not be used for any other purpose. 9.2 Planes accessible by UTF-16 Each code position in Planes 01 to 10 of Group 00 has a unique mapping to a four-octet sequence in accordance with the UTF-16 form of coded representation (see Annex C). This form is compatible with the two-octet BMP form of UCS-2 (see 13.1). Code positions in Planes 11 to FF of Group 00, or in Planes 00 to FF of other groups, do not have a mapping to the UTF-16 form. 10 Private use groups, planes, and zones 10.1 Private use characters Private use characters are not restrained in any way by ISO/IEC 10646. Private use characters can be used to provide user-defined characters. For example, this is a common requirement for users of ideographic scripts. NOTE 1 - For meaningful interchange of private use characters, an agreement, independent of ISO/IEC 10646, is necessary between sender and recipient. Private use characters can be used for dynamically-redefinable character applications. NOTE 2 - For meaningful interchange of dynamically-redefinable characters, an agreement, independent of ISO/IEC 10646 is necessary between sender and recipient. ISO/IEC 10646 does not specify the techniques for defining or setting up dynamically-redefinable characters. 10.2 Code positions for private use characters The code positions of the 32 groups from Group 60 to Group 7F shall be for private use. The code positions of Plane 0F and Plane 10, and of the 32 planes from Plane E0 to Plane FF, of Group 00 shall be for private use. The 6400 code positions E000 to F8FF of the Basic Multilingual Plane shall be for private use. The contents of these code positions are not specified in ISO/IEC 10646 (see 10.1). 11 Revision and updating of the UCS The revision and updating of this coded character set will be carried out by ISO/IEC JTC1/SC2. NOTE - It is intended that in future editions of ISO/IEC 10646, the names and allocation of the characters in this edition will remain unchanged. 12 Subsets ISO/IEC 10646 provides the specification of subsets of coded graphic characters for use in interchange, by originating devices, and by receiving devices. There are two alternatives for the specification of subsets: limited subset and selected subset. An adopted subset may comprise either of them, or a combination of the two. 12.1 Limited subset A limited subset consists of a list of graphic characters in the specified subset. This specification allows applications and devices that were developed using other codes to interwork with this coded character set. A claim of conformance referring to a limited subset shall list the graphic characters in the subset by the names of graphic characters or code positions as defined in ISO/IEC 10646. 12.2 Selected subset A selected subset consists of a list of collections of graphic characters as defined in ISO/IEC 10646. The collections from which the selection may be made are listed in annex A of each part of ISO/IEC 10646. A selected subset shall always automatically include the Cells 20 to 7E of Row 00 of Plane 00 of Group 00. A claim of conformance referring to a selected subset shall list the collections chosen as defined in ISO/IEC 10646. 13 Coded representation forms of the UCS ISO/IEC 10646 provides two alternative forms of coded representation of characters. NOTE - The characters from the ISO/IEC 646 IRV repertoire are coded by simple zero extensions to their coded representations in ISO/IEC 646 IRV. Therefore, their coded representations have the same integer values when represented as 8-bit, 16-bit, or 32-bit integers. For implementations sensitive to a zero-valued octet (e.g. for use as a string terminator), use of 8-bit based array data type should be avoided as any zero-valued octet may be interpreted incorrectly. Use of data types at least 16-bits wide is more suitable for UCS-2, and use of data types at least 32-bits wide is more suitable for UCS-4. 13.1 Two-octet BMP form This coded representation form permits the use of characters from the Basic Multilingual Plane with each character represented by two octets. Within a CC-data-element conforming to the two-octet BMP form, a character from the Basic Multilingual Plane shall be represented by two octets comprising the R-octet and the C-octet as specified in 6.2 (i.e. its RC-element). NOTE - A coded graphic character using the two-octet BMP form may be implemented by a 16-bit integer for processing. 13.2 Four-octet canonical form The canonical form permits the use of all the characters of ISO/IEC 10646, with each character represented by four octets. Within a CC-data-element conforming to the four-octet canonical form, every character shall be represented by four octets comprising the G-octet, the P-octet, the R-octet, and the C-octet as specified in 6.2. NOTE - A coded graphic character using the four-octet canonical form may be implemented by a 32-bit integer for processing. 14 Implementation levels ISO/IEC 10646 specifies three levels of implementation. Combining characters are described in 24 and listed in annex B. 14.1 Implementation level 1 When implementation level 1 is used, a CC-data-element shall not contain coded representations of combining characters (see clause B.1) nor of characters from HANGUL JAMO block (see clause 25). When implementation level 1 is used the unique-spelling rule shall apply (25.2). 14.2 Implementation level 2 When implementation level 2 is used, a CC-data-element shall not contain coded representations of characters listed in clause B.2. When implementation level 2 is used the unique-spelling rule shall apply (25.2). 14.3 Implementation level 3 When implementation level 3 is used, a CC-data-element may contain coded representations of any characters. 15 Use of control functions with the UCS This coded character set provides for use of control functions encoded according to ISO/IEC 6429 or similarly structured standards for control functions, and standards derived from these. A set or subset of such coded control functions may be used in conjunction with this coded character set. These standards encode a control function as a sequence of one or more octets. When a control character of ISO/IEC 6429 is used with this coded character set, its coded representation as specified in ISO/IEC 6429 shall be padded to correspond with the number of octets in the adopted form (see clause 13). Thus, the least significant octet shall be the bit combination specified in ISO/IEC 6429, and the more significant octet(s) shall be zeros. For example, the control character FORM FEED is represented by "000C" in the two-octet form, and "0000 000C" in the four-octet form. For escape sequences, control sequences, and control strings (see ISO/IEC 6429) consisting of a coded control character followed by additional bit combinations in the range 20 to 7F, each bit combination shall be padded by octet(s) with value 00. For example, the escape sequence "ESC 02/00 04/00" is represented by "001B 0020 0040" in the two-octet form, and "0000 001B 0000 0020 0000 0040" in the four-octet form. NOTE - The term “character” appears in the definition of many of the control functions specified in ISO/IEC 6429, to identify the elements on which the control functions will act. When such control functions are applied to coded characters according to ISO/IEC 10646 the action of those control functions will depend on the type of element from ISO/IEC 10646 that has been chosen, by the application, to be the element (or character) on which the control functions act. These elements may be chosen to be characters (non-combining characters and/or combining characters) or may be chosen in other ways (such as composite sequences) when applicable. Code extension control functions for the ISO/IEC 2022 code extension techniques (such as designation escape sequence, single shift, and locking shift) shall not be used with this coded character set. 16 Declaration of identification of features 16.1 Purpose and context of identification CC-data-elements conforming to ISO/IEC 10646 are intended to form all or part of a composite unit of coded information that is interchanged between an originator and a recipient. The identification of ISO/IEC 10646 (including the form), the implementation level, and any subset of the coding space that have been adopted by the originator must also be available to the recipient. The route by which such identification is communicated to the recipient is outside the scope of ISO/IEC 10646. However, some standards for interchange of coded information may permit, or require, that the coded representation of the identification applicable to the CC-data-element forms a part of the interchanged information. This clause specifies a coded representation for the identification of UCS with an implementation level and a subset of ISO/IEC 10646, and also of a C0 and a C1 set of control functions from ISO/IEC 6429 for use in conjunction with ISO/IEC 10646. Such coded representations provide all or part of an identification data element, which may be included in information interchange in accordance with the relevant standard. If two or more of the identifications are present, the order of those identifications shall follow the order as specified in this clause. NOTE - An alternative method of identification is described in annex N. 16.2 Identification of UCS coded representation form with implementation level When the escape sequences from ISO/IEC 2022 are used, the identification of a coded representation form of UCS (see clause 13) and an implementation level (see clause 14) specified by ISO/IEC 10646 shall be by a designation sequence chosen from the following list: ESC 02/05 02/15 04/00 UCS-2 with implementation level 1 ESC 02/05 02/15 04/01 UCS-4 with implementation level 1 ESC 02/05 02/15 04/03 UCS-2 with implementation level 2 ESC 02/05 02/15 04/04 UCS-4 with implementation level 2 ESC 02/05 02/15 04/05 UCS-2 with implementation level 3 ESC 02/05 02/15 04/06 UCS-4 with implementation level 3 If such an escape sequence appears within a CC-data-element conforming to ISO/IEC 2022, it shall consist only of the sequences of bit combinations as shown above. If such an escape sequence appears within a CC-data-element conforming to ISO/IEC 10646, it shall be padded in accordance with clause 15. 16.3 Identification of subsets of graphic characters When the control sequences of ISO/IEC 6429 are used, the identification of subsets (see clause 12) specified by ISO/IEC 10646 shall be by a control sequence IDENTIFY UNIVERSAL CHARACTER SUBSET (IUCS) as shown below. CSI Ps... 02/00 06/13 Ps... means that there can be any number of selective parameters. The parameters are to be taken from the subset collection numbers as shown in annex A of each part of ISO/IEC 10646. When there is more than one parameter, each parameter value is separated by an octet with value 03/11. Parameter values are represented by digits where octet values 03/00 to 03/09 represent digits 0 to 9. If such an escape sequence appears within a CC-data-element conforming to ISO/IEC 2022, it shall consist only of the sequences of bit combinations as shown above. If such a control sequence appears within a CC-data-element conforming to ISO/IEC 10646, it shall be padded in accordance with clause 15. 16.4 Identification of control function set When the escape sequences from ISO/IEC 2022 are used, the identification of each set of control functions (see clause 15) of ISO/IEC 6429 to be used in conjunction with ISO/IEC 10646 shall be an identifier sequence of the type shown below. ESC 02/01 04/00 identifies the full C0 set of ISO/IEC 6429 ESC 02/02 04/03 identifies the full C1 set of ISO/IEC 6429 For a subset of C0 or C1 sets, the final octet F shall be obtained from the International Register of Coded Character Sets. The identifier sequences for these sets shall be: ESC 02/01 F identifies a C0 set ESC 02/02 F identifies a C1 set If such an escape sequence appears within a CC-data-element conforming to ISO/IEC 2022, it shall consist only of the sequences of bit combinations as shown above. If such an escape sequence appears within a CC-data-element conforming to ISO/IEC 10646, it shall be padded in accordance with clause 15. 16.5 Identification of the coding system of ISO/IEC 2022 When the escape sequences from ISO/IEC 2022 are used, the identification of a return, or transfer, from UCS to the coding system of ISO/IEC 2022 shall be by the escape sequence ESC 02/05 04/00. If such an escape sequence appears within a CC-data-element conforming to ISO/IEC 10646, it shall be padded in accordance with clause 15. If such an escape sequence appears within a CC-data-element conforming to ISO/IEC 2022, it shall consist only of the sequences of bit combinations as shown above. NOTE - Escape sequence ESC 02/05 04/00 is normally used for return to the restored state of ISO/IEC 2022. The escape sequence ESC 02/05 04/00 specified here is sometimes not exactly as specified in ISO/IEC 2022 due to the presence of padding octets. For this reason the escape sequences in 16.2 for the identification of UCS include the octet 02/15 to indicate that the return does not always conform to that standard. 17 Structure of the code tables and lists The clauses 26 and 27 set out the detailed code tables and the lists of character names for the graphic characters. Together, these specify graphic characters, their coded representation, and the character name for each character. The graphic symbols are to be regarded as typical visual representations of the characters. ISO/IEC 10646 does not attempt to prescribe the exact shape of each character. The shape is affected by the design of the font employed, which is outside the scope of ISO/IEC 10646. Graphic characters specified in ISO/IEC 10646 are uniquely identified by their names. This does not imply that the graphic symbols by which they are commonly imaged are always different. Examples of graphic characters with similar graphic symbols are LATIN CAPITAL LETTER A, GREEK CAPITAL LETTER ALPHA, and CYRILLIC CAPITAL LETTER A. The meaning attributed to any character is not specified by ISO/IEC 10646; it may differ from country to country, or from one application to another. For the alphabetic scripts, the general principle has been to arrange the characters within any row in approximate alphabetic sequence; where the script has capital and small letters, these are arranged in pairs. However, this general principle has been overridden in some cases. For example, for those scripts for which a relevant standard exists, the characters are allocated according to that standard. This arrangement within the code tables will aid conversion between the existing standards and this coded character set. In general, however, it is anticipated that conversion between this coded character set and any other coded character set will use a table lookup technique. It is not intended, nor will it often be the case, that the characters needed by any one user will be found all grouped together in one part of the code table. Furthermore, the user of any script will find that needed characters may have been coded elsewhere in this coded character set. This especially applies to the digits, to the symbols, and to the use of Latin letters in dual-script applications. Therefore, in using this coded character set, the reader is advised to refer first to the block names list in annex A.2 or an overview of the BMP in figures 3 and 4, and then to turn to the specific code table rows for the relevant script and for symbols and digits. In addition, annex G contains an alphabetically sorted list of character names. 18 Block names Named blocks of contiguous code positions are specified within a plane for the purpose of allocation of characters sharing some common characteristic, such as script. The blocks specified within the BMP are listed in A.2 of Annex A, and are illustrated in Figures 3 and 4. 19 Characters in bi-directional context A class of left/right handed pairs of characters have special significance in the context of bi-directional text. In this context the terms LEFT or RIGHT in the character name are also intended to imply "opening" or "closing" forms of character shape, rather than a strict left-hand or right-hand form. These characters are listed below. Code Name Position 0028 LEFT PARENTHESIS 0029 RIGHT PARENTHESIS 005B LEFT SQUARE BRACKET 005D RIGHT SQUARE BRACKET 007B LEFT CURLY BRACKET 007D RIGHT CURLY BRACKET 2045 LEFT SQUARE BRACKET WITH QUILL 2046 RIGHT SQUARE BRACKET WITH QUILL 207D SUPERSCRIPT LEFT PARENTHESIS 207E SUPERSCRIPT RIGHT PARENTHESIS 208D SUBSCRIPT LEFT PARENTHESIS 208E SUBSCRIPT RIGHT PARENTHESIS 2329 LEFT-POINTING ANGLE BRACKET 232A RIGHT-POINTING ANGLE BRACKET 3008 LEFT ANGLE BRACKET 3009 RIGHT ANGLE BRACKET 300A LEFT DOUBLE ANGLE BRACKET 300B RIGHT DOUBLE ANGLE BRACKET 300C LEFT CORNER BRACKET 300D RIGHT CORNER BRACKET 300E LEFT WHITE CORNER BRACKET 300F RIGHT WHITE CORNER BRACKET 3010 LEFT BLACK LENTICULAR BRACKET 3011 RIGHT BLACK LENTICULAR BRACKET 3014 LEFT TORTOISE SHELL BRACKET 3015 RIGHT TORTOISE SHELL BRACKET 3016 LEFT WHITE LENTICULAR BRACKET 3017 RIGHT WHITE LENTICULAR BRACKET 3018 LEFT WHITE TORTOISE SHELL BRACKET 3019 RIGHT WHITE TORTOISE SHELL BRACKET 301A LEFT WHITE SQUARE BRACKET 301B RIGHT WHITE SQUARE BRACKET The interpretation and rendering of any of these characters depend on the state related to the symmetric swapping characters (see F.2.2) and on the direction of the character being rendered that are in effect at the point in the CC-data-element where the coded representation of the character appears. For example, if the character ACTIVATE SYMMETRIC SWAPPING occurs and if the direction of the character is from right to left, the character shall be interpreted as if the term LEFT or RIGHT in its name had been replaced by the term RIGHT or LEFT, respectively. NOTE - In the context of Arabic bi-directional text, certain mathematical symbols may also have special significance (see annex E). 20 Special characters There are some characters that do not have printable graphic symbols. These characters include space characters. They are Code Name Position 0020 SPACE 00A0 NO-BREAK SPACE 2000 EN QUAD 2001 EM QUAD 2002 EN SPACE 2003 EM SPACE 2004 THREE-PER-EM SPACE 2005 FOUR-PER-EM SPACE 2006 SIX-PER-EM SPACE 2007 FIGURE SPACE 2008 PUNCTUATION SPACE 2009 THIN SPACE 200A HAIR SPACE 3000 IDEOGRAPHIC SPACE Currency symbols in ISO/IEC 10646 do not necessarily identify the currency of a country. For example, YEN SIGN can be used for Japanese yen and Chinese yuan. Also, DOLLAR SIGN is used in numerous countries including the United States of America. There is a special class of characters called Alternate Format Characters which are included for compatibility with some industry practices. These are described in annex F. 21 Presentation forms of characters Each presentation form of a character provides an alternative form, for use in a particular context, to the nominal form of the character or sequence of characters from the other zones of graphic characters. The transformation from the nominal form to the presentation forms may involve substitution, superimposition, or combination. The rules for the superimposition, choice of differently shaped characters, or combination into ligatures, or conjuncts which are often of extreme complexity are not specified in ISO/IEC 10646. In general, presentation forms are not intended to be used as a substitute for the nominal forms of the graphic characters specified elsewhere within this coded character set. However, specific applications may encode these presentation forms instead of the nominal forms for specific reasons among which is compatibility with existing devices. The rules for searching, sorting, and other processing operations on presentation forms are outside the scope of ISO/IEC 10646. Within the BMP these characters are mostly allocated to positions in rows FB to FF. 22 Compatibility characters Compatibility characters are included in ISO/IEC 10646 primarily for compatibility with existing coded character sets to allow two-way code conversion without loss of information. Within the BMP many of these characters are allocated to positions within rows F9, FA, FE, and FF, and within rows 31 and 33. Some compatibility characters are also allocated within other rows. 23 Order of characters Usually, coded characters appear in a CC-data-element in logical order (logical or backing store order corresponds approximately to the order in which characters are entered from the keyboard, after corrections such as insertions, deletions, and overtyping have taken place). This applies even when characters of different dominant direction are mixed: left-to-right (Greek, Latin, Thai) with right-to-left (Arabic, Hebrew), or with vertical (Mongolian) script. Some characters may not appear linearly in final rendered text. For example, the medial form of the short i in Devanagari is displayed before the character that it logically follows in the CC-data-element. 24 Combining characters This clause specifies the use of combining characters. A list of combining characters is shown in clause B.1. A list of combining characters not allowed in implementation level 2 is shown in clause B.2. NOTE - The names of many script-independent combining characters contain the word "COMBINING". 24.1 Order of combining characters Coded representations of combining characters shall follow that of the graphic character with which they are associated (for example, coded representations of LATIN SMALL LETTER A followed by COMBINING TILDE represent a composite sequence for Latin "ã"). If a combining character is to be regarded as a composite sequence in its own right, it shall be coded as a composite sequence by association with the character SPACE. For example, grave accent can be composed as SPACE followed by COMBINING GRAVE ACCENT. NOTE - Indic matras form a special category of combining characters, since the presentation can depend on more than one of the surrounding characters. Thus it might not be desirable to associate Indic matra with the character SPACE. 24.2 Appearance in code tables Combining characters intended to be positioned relative to the associated character are depicted within the character code tables above, below, to the right of, to the left of, in, around, or through a dotted circle. In presentation, these characters are intended to be positioned relative to the preceding base character in some manner, and not to stand alone or function as base characters. This is the motivation for the term "combining". Diacritics are the principal class of combining characters used in European alphabets. In the code tables for some scripts, such as Hebrew, Arabic, and the scripts of India and South East Asia, combining characters are indicated in relation to dotted circles to show their position relative to the base character. Many of these combining characters encode vowel letters; as such they are not generally referred to as "diacritical marks". 24.3 Multiple combining characters There are instances where more than one combining character is applied to a single graphic character. ISO/IEC 10646 does not restrict the number of combining characters that can follow a base character. The following rules shall apply: a) If the combining characters can interact in presentation (for example, COMBINING MACRON and COMBINING DIAERESIS), then the position of the combining characters in the resulting graphic display is determined by the order of the coded representation of the combining characters. The presentations of combining characters are to be positioned from the base character outward. For example, combining characters placed above a base character are stacked vertically, starting with the first encountered in the sequence of coded representations and continuing for as many marks above as are required by the coded combining characters following the coded base character. For combining characters placed below a base character, the situation is inverted, with the combining characters starting from the base character and stacking downward. An example of multiple combining characters above the base character is found in Thai, where a consonant letter can have above it one of the vowels 0000 0E34 to 0000 0E37 and, above that, one of four tone marks 0000 0E48 to 0000 0E4B. The order of the coded representation is: base consonant, followed by a vowel, followed by a tone mark. b) Some specific combining characters override the default stacking behaviour by being positioned horizontally rather than stacking, or by forming a ligature with an adjacent combining character. When positioned horizontally, the order of coded representations is reflected by positioning in the dominant order of the script with which they are used. For example, horizontal accents in a left-to-right script are coded left-to-right. Prominent characters that show such override behaviour are associated with specific scripts or alphabets. For example, the COMBINING GREEK KORONIS (0000 0343) requires that, together with a following acute or grave accent, they be rendered side-by-side above a letter, rather than the accent marks being stacked above the COMBINING GREEK KORONIS. The order of the coded representations is: the letter itself, followed by that of the breathing mark, followed by that of the accent marks. Two Vietnamese tone marks which have the same graphic appearance as the Latin acute and grave accent marks do not stack above the three Vietnamese vowel letters which already contain the circumflex diacritic (â, ê, ô). Instead, they form ligatures with the circumflex component of the vowel letters. c) If the combining characters do not interact in presentation (for example, when one combining character is above a graphic character and another is below), the resultant graphic symbol from the base character and combining characters in different orders may appear the same. For example, the coded representations of LATIN SMALL LETTER A, followed by COMBINING CARON, followed by COMBINING OGONEK may result in the same graphic symbol as the coded representations of LATIN SMALL LETTER A, followed by COMBINING OGONEK, followed by COMBINING CARON. Combining characters in Hebrew or Arabic scripts do not normally interact. Therefore, the sequence of their coded representations in a composite sequence does not affect its graphic symbol. The rules for forming the combined graphic symbol are beyond the scope of ISO/IEC 10646. NOTE - Where combining characters are used for the generation of composite sequences in implementation level 3, this facility may be used to provide an alternative coded representation of text. For example, in implementation level 3 the French word "là" may be represented by the characters LATIN SMALL LETTER L followed by LATIN SMALL LETTER A WITH GRAVE, or may be represented by the characters LATIN SMALL LETTER L followed by LATIN SMALL LETTER A followed by COMBINING GRAVE ACCENT. 24.4 Collections containing combining characters In some collections of characters listed in annex A, such as collections 14 (BASIC ARABIC) or 25 (THAI), both combining characters and non-combining characters are included. When implementation level 1 or 2 is adopted, a CC-data-element shall not contain the coded representations of combining characters listed in annex B, even though the adopted subset may include them. Other collections of characters listed in annex A comprise only combining characters, for example collection 7 (COMBINING DIACRITICAL MARKS). Such a collection shall not be included in the adopted subset when implementation level 1 is adopted. 25 Special features of individual scripts 25.1 Hangul syllable composition method In rendering, a sequence of Hangul Jamo (from HANGUL JAMO block: 1100 to 11FF) are�e end of the sequence, after the Hangul Jamo character which completes the syllable block. 25.2 Features of Indic alphabetic scripts In the tables for Rows 09 to 0D and 0F, and for the MYANMAR block in Row 10, of the BMP (see 26) the graphic symbols shown for some characters appear to be formed as compounds of the graphic symbols for two other characters in the same table. Examples: Row 0B Tamil. The graphic symbol for 0B94 TAMIL LETTER AU appears is if it is constructed from the graphic symbols for: 0B93 TAMIL LETTER OO and 0BD7 TAMIL AU LENGTH MARK Row 0D Malayalam. The graphic symbol for 0D4A MALAYALAM VOWEL SIGN O appears as if it is constructed from the graphic symbols for: 0D46 MALAYALAM VOWEL SIGN E and 0D3E MALAYALAM VOWEL SIGN AA In such cases a single coded character may appear to the user to be equivalent to the sequence of two coded characters whose graphic symbols, when combined, are visually similar to the graphic symbol of that single character, as in a composite sequence (4.14). In Levels 1 and 2 a "unique-spelling" rule shall apply. When this rule applies, no coded character from a table for Rows 09 to 0D or 0F, or for the MYANMAR block in Row 10, shall be regarded as equivalent to a sequence of two or more other coded characters taken from the same table. NOTE - In Levels 1 and 2, if such a sequence occurs in a CC-data-element it is always made available to the user as two distinct characters in accordance with their respective character names. 26 Code tables and lists of character names 26.1 General An overview of the Basic Multilingual Plane is shown in figure 3. Detailed code tables and lists of character names for the Basic Multilingual Plane are shown on the following pages and in applicable Amendments. Guidelines to be used for constructing names of characters are given in annex L for information. In some cases, a name of a character is followed by additional explanatory statements not part of the name. These statements are in parentheses and not in capital letters except for the initials of the word, where required. 26.2 Character names and annotations for Hangul syllables Names for the Hangul syllable characters in code positions (hex) 0000 AC00 - 0000 D7A3 are derived from their code position numbers by the numerical procedure described below. Lists of names for these characters are not provided. 1. Obtain the code position number of the Hangul syllable character. It is of the form 0000 h1h2h3h4 where h1, h2, h3, and h4 are hexadecimal digits; h1h2 is the Row number within the BMP and h3h4 is the cell number within the row. The number h1h2h3h4 lies within the range AC00 to D7A3. 2. Derive the decimal numbers d1, d2, d3, d4 that are numerically equal to the hexadecimal digits h1, h2, h3, h4 respectively. 3. Calculate the character index C from the formula: C = 4096 × (d1 - 10) + 256 × (d2 - 12) + 16 × d3 + d4 Note: If C < 0 or > 11,171 then the character is not in the HANGUL SYLLABLES block. 4. Calculate the syllable component indices I, P, F from the following formulae: I = C / 588 (Note: 0 ( I ( 18) P = (C % 588) / 28 (Note: 0 ( P ( 20) F = C % 28 (Note: 0 ( F ( 27) where "/" indicates integer division (i.e. x / y is the integer quotient of the division), and "%" indicates the modulo operation (i.e. x % y is the remainder after the integer division x / y). 5. Obtain the Latin character strings that correspond to the three indices I, P, F from columns 2, 3, and 4 respectively of Table 1 below (for I = 11 and for F = 0 the corresponding strings are null). Concatenate these three strings in left-to-right order to make a single string, the syllable-name. 6. The character name for the character at position 0000 h1h2h3h4 is then: HANGUL SYLLABLE s-n where "s-n" indicates the syllable-name string derived in step 5. Example. For the character in code position D4DE: d1 = 13, d2 = 4, d3 = 13, d4 = 14. C = 10462 I = 17, P = 16, F = 18. The corresponding Latin character strings are: P , WI, BS. The syllable-name is PWIBS, and the character name is: HANGUL SYLLABLE PWIBS Annotations for the Hangul syllable characters in code positions (hex) 0000 AC00 - 0000 D7A3 are also derived from their code position numbers by a similar numerical procedure described below. 7. Carry out steps 1 to 4 as described above. 8. Obtain the Latin character strings that correspond to the three indices I, P, F from columns 5, 6, and 7 respectively of Table 1 below (for I = 11 and for F = 0 the corresponding strings are null). Concatenate these three strings in left-to-right order to make a single string, and enclose it within parentheses to form the annotation. Example. For the character in code position D4DE: d1 = 13, d2 = 4, d3 = 13, d4 = 14. C = 10462 I = 17, P = 16, F = 18. The corresponding Latin character strings are: ph, wi, ps, and the annotation is (phwips). Table 1: Elements of Hangul syllable names and annotations Syllable name elementsAnnotation elementsIndex numberI stringP stringF stringI stringP stringF string0GAka1GGAEGkkaek2NYAGGnyakk3DYAEGStyaeks4DDEONtteon5RENJrenc6MYEONHmyeonh7BYEDpyet8BBOLppol9SWALGswalk10SSWAELMsswaelm11OELBoelp12JYOLScyols13JJULTcculth14CWEOLPchweolph15KWELHkhwelh16TWIMthwim17PYUBphyup18HEUBSheups19YISyis20ISSiss21NGng22Jc23Cch24Kkh25Tth26Pph27Hh Row-octet 00 .. .. .. .. .. 33 Rows 00 to 33 (see Figure 4) 34 .. 4D CJK Unified Ideographs Extension A 4E .. .. .. .. .. .. 9F CJK Unified Ideographs A0.. A3Yi SyllablesA4Yi RadicalsA5.. ABAC .. .. .. .. D7