Code Point Name/Label Options

L2/08-423

Re: Code Point Name/Label Options

From: Mark Davis

Date: 2008-11-06

URL: http://docs.google.com/Doc?id=dfqr8rd5_357ftxjchgb

Here are the three options we discussed in the meeting.

Option A. Code Point Label (defined in [Whistler, L2/08-382]).

Option B. Define a Code Point Name property: Code_Point_Name (short name: CPName). This is a derived property defined in the same way as the Code Point Label in [Whistler, L2/08-382].

Option C. Expand the Name property to also cover code points (with values as defined in Ken's document) that had null values in U5.1.

===

In each of these options, the value would be as in [Whistler, L2/08-382], with the exception discussed in the meeting for the C0 controls.

Construction of Code Point Names/Labels

Type	Value (NNNN represents the code point)
C0 Controls	Field 10 of UnicodeData without parentheticals, Eg, FORM FEED.
C1 Controls	control-NNNN
Reserved	reserved-NNNN
Noncharacter	noncharacter-NNNN
Private-Use	private-use-NNNN
Surrogate	surrogate-NNNN
Others	Field 1 of UnicodeData or constructed values for Hangul Syllables or CJK Ideographs

===

Changes if we do option C.

[[ As a 4th bullet under definition D4 Character Name in Chapter 3, insert ]]

The detailed specification of the Unicode character names, including rules for derivation of some ranges of characters, is given in Section 4.8, "Name -- Normative". That section also describes the relationship between the normative value of the Name property and the contents of the corresponding data field in UnicodeData.txt in the Unicode Character Database.

[[Incorporate the following text in Section 4.8, "Name -- Normative", as a subsection, with appropriate editorial adjustments to other existing text in that section. ]]

Unicode Code Point Name

The Name property (short alias: "na") is a string property, defined as follows:

For Hangul syllables, the name is derived by rule, as specified in Section 3.12, under "Hangul Syllable Name Generation", making use of the values of the Jamo_Short_Name property.
For ideographs, the name is derived by rule, by concatenating the string "CJK UNIFIED IDEOGRAPH-" or "CJK COMPATIBILITY IDEOGRAPH-" (or other as specified, e.g. "TANGUT IDEOGRAPH-") to the code point, expressed in hexadecimal, with the usual 4 to 6 digit convention. The exact ranges subject to these name derivations are specified by a name range convention used in field 1 of UnicodeData.txt.
For other Graphic and Format characters, the name is as listed in field 1 of UnicodeData.txt.
For C0 control codes (U+0000..U+001F), the name is as ISO 6429: 1992, which is listed in field 10 of UnicodeData.txt, removing any parentheticals (such as "(FF)").
For all other Unicode code points, the value of the UCD Name property is constructed from combining a prefix with the code point value, expressed in hexadecimal, with the usual 4 to 6 digit convention. The prefix corresponds to the type of the Code Point Type (control, reserved, noncharacter, private-use, or surrogate) plus "-". For example: "control-009F", "surrogate-D800".

When displayed in mixed contexts, to emphasize the distinction between graphic/format code point names and others, the latter are often displayed between angle brackets: <control-0009>, <noncharacter-FFFF>, etc.

Note that the Unicode Name property values are unique for all code points. Furthermore, the Name property value uniqueness requirement interacts with name assignment rules for formal aliases and for named character sequences: Unicode character names, formal aliases, and named character sequences constitute a single, unique namespace.

The Name property values for all but reserved code points will not be changed. The Name property values for reserved code points will change if a character is assigned to the code point. For more information, see the Unicode Encoding Stability Policies.

As corollary to this specification, it should be noted that the value of Field 1 (the string of characters between the semicolon separators) is to be taken as the normative specification of the UCD Name property only for Graphic and Format characters other than ideographs and Hangul syllables. All other values which occur in field 1 are labels that serve other functions in the generation of names lists and charts, or to label abbreviated ranges of property definitions, but do not constitute values of the UCD Name property per se.

For any encoded character, the term "Character name" refers to the Code Point Name.

[[ In TUS 5.0, on page 79, after the existing definition D10 Code Point, insert the following new definitions. ]]

D10a Code Point Type: Any of the seven fundamental classes of code points in the standard: Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, Reserved.

See Table 2-3, "Types of Code Points" for a summary of the meaning and use of each class.
For Noncharacter, see also D14 Noncharacter.
For Reserved, see also D15 Reserved code point.
For Private-Use, see also D49 Private-use code point.
For Surrogate, see also D71 High-surrogate code point and D73 Low-surrogate code point.

[[The current stability policy is:]]

Once a character is encoded, its character name will not be changed.

[[A request should be made to the officers to extend this to:]]

The Unicode Name Property Value for any non-reserved code point will not be changed. In particular, once a character is encoded its name will not be changed.