L2/00-__

L2/00-150R2

Universal Multiple-Octet Coded Character Set

(UCS)

ISO/IEC JTC 1/SC 2/WG 2 N 2230

Date: 2000-07-21

Source:	US national body (Author: V.S. Umamaheswaran)
Title:	Proposal for Unique Sequence Identifiers (USI-s) and repertoire specifications including these USI-s.
Action:	For consideration and adoption by WG 2
Status:	New proposal
Distribution:	ISO/IEC JTC 1/SC 2 and WG 2

Summary

This document proposes to add a new identifier called Unique Sequence Identifier, and its use in repertoire specifications, as an enhanced response to the proposal for a composition identifier in document SC2/WG2 N2189, and similar other requirements.

Requirement

Outside the context of the Unicode (and ISO/IEC 10646) standard, there is a need for expressing collections of entities that are required by a specific application -- for example, to state all the letters (including accented letters), digits, symbols etc. of a given national language such as Lithuanian. While most of the entities of such a collection may have a single code position allocated to them, others can be represented only as sequences of code positions. Such sequences could be either combining character sequences (for example, accented Latin letters), or sequences of coded characters (such as Philippine NG, or Swiss Ch). At present, such sequences do not have a standardized unique identifier in the same sense as the characters that are encoded in the standard. This requirement is expressed in the contribution from Finland and Germany (in document L2/00-89). When one examines the reasons behind the Lithuanian proposal, again, one of the driving requirements is the desire to be able to state uniquely the repertoire required by Lithuania. While the elements of such sequences -- such as the combining accent marks that are needed -- do have coded representations (and hence unique identifiers) that could be referenced in a repertoire, the specific sequences cannot be assigned a standardized unique identifier.

Background

One of the principles of encoding fully composed characters in the Unicode Standard (and in ISO/IEC 10646) has been to include it only when it can be shown that a decomposed representation is not acceptable. A set of fully composed characters, that could be decomposed, were included in the first version of the standard for reasons of compatibility with the then-existing international, national or industry standards. There have been some recent proposals for adding fully composed characters, for example from Lithuania (see L2/99-349). These have not been accepted by the UTC or by ISO/IEC JTC1/SC2/WG2, for several reasons -- the primary reason being the implication of Normalization (see UTR #22, and L2/00-078).

Clause 6.5 -of ISO/IEC 10646-1: 2000 contains a short identification mechanism to reference characters that are encoded in the standard. It can be summarized as:

The full syntax of the notation of a short identifier, in Backus-Naur form, is:

{ U | u } [ {+}xxxx | {-}xxxxxxxx ]

where “x” represents one hexadecimal digit (0 to 9, A to F, or a to f).

Annex A of 10646-1 contains identified collections of graphic characters for subsets of 10646. This identification is done as enumeration of individual or range(s) of code positions (one form of the short identifier) within the standard, or as a union or enumerated individual or range(s) of collections of identified collections, for example:

24 MALAYALAM 0D00 - 0D7F 2000C 200D

250 GENERAL FORMAT CHARACTERS Collections 200 - 203

However, such enumeration are constrained at present only to characters defined in the standard.

Proposal

Paragraphs along the following lines is proposed to be added to an appropriate clause (such as clause 6.5) or Annex A, or a separate Annex to the standard.

Note: This proposal is worded towards amending 10646-1: 2000. However, equivalent paragraphs should be considered for the Unicode standard also.

Unique Identification of Sequences

An entity that is represented by a sequence of 'n' code positions from the standard, is identified by a Unique Sequence Identifier (USI) having the following form:

<UID1, UID2, UID3, .. UIDn>

where, UID1, UID2, etc. represent the unique identifiers of the corresponding characters from the standard, in the same sequence as needed to represent the identified entity. The syntax for UID1, UID2, … is specified in clause 6.5. A Comma (optionally followed by a Space character) separates the UIDs, and a pair of Angle Brackets enclose the whole sequence of UIDs.

Examples of such sequences are:

a composite sequence containing a base character plus one or more combining characters

a sequence of characters representing a conjunct

a sequence of characters representing a digraph or ligature

a sequence of standalone characters. or,

a sequence of any mix of the above.

When there are multiple sequences that may be used to represent the same entity, each such sequence will be considered as a separate USI, and the choice of which one of these needed has to be made, or distinguishing entity names should be assigned to differentiate between these sequences.

Some examples:

Entity name	Sequence	USI
Philippino NG	(N G)	<004E, 0047>
Latin Small Letter U With Macron And Tilde	(u combining macron combining tilde)	<0075, 0304, 0303>
Malayalam SHWA	(sha virama va)	<0D36, 0D4D, 0D35>

Repertoires including Uniquely Identified Sequences:

In addition to the unique character identifiers from the standard, a repertoire definition may include entities represented by unique sequence identifiers as defined above -- for example to specify a Lithuanian repertoire. Such a repertoire can be defined in any document, for example in a National Standard, or a standard that defines all the possible sequences to represent all of Devanagari or Thai (including the specific valid conjuncts). When sufficient justification exists, such a repertoire may be proposed to be included in ISO/IEC 10646 as "an identified collection". To be able to accommodate such a request, the definition of "collections" in the standard should be enhanced to specifically recognize the possibility of inclusion of Uniquely Identified Sequences in a collection.

Note: We have to keep in mind that 10646 collections are the only current standardized means of being able to identify repertoires which are subsets of 10646.

The above proposals should meet the stated requirements for repertoire definitions in document L2/00-89, and other such requirements.

Naming of entities

Single names (as opposed to a sequence of names) of entities which are represented by sequences will remain outside the scope of Unicode (and ISO/IEC 10646). A sequence of standardized names, corresponding to the elements of Unique Sequence Identifier may be used to reference a single name that may be assigned (by the referencing document) to make the correspondence unique.

Note: Situations may arise when an entity may be represented using more than one sequence -- for example, a multiple-accented character may be expressed as a sequence of an already encoded composed character and another combining accent, or as a completely decomposed sequence. The UIS-s for these sequences will be different. Different entity names will be necessary to be able to reference the correct UIS.

Reference Documents:

1. L2/99-349 Proposal to add Lithuanian accented letters to ISO/IEC10646-1, SC2/WG2 N2075R, 1999-09-09

2. L2/00-089 Identification of decomposed characters in ISO/IEC 10646-1, Kolehmainen, Küster, SC2/WG2 N2189, 2000-03-14

3. L2/00-078 Implications of Normalization on Character Encoding (for addition to principles and procedures); Mark Davis, SC2/WG2 N2176, 2000-03-07

4. Clause 6.5 Short identifiers for characters, ISO/IEC 10646-1: 2000

5. ISO/IEC 10646-1: 2000 Annex A, Collections of Graphic Characters for Subsets