Appendix A
Notational Conventions
This appendix describes the typographic conventions, the extended BNF, and the conventions for describing rendering rules that are used throughout this core specification.
#A.1 Typographic Conventions
#A.1.1 Code Points
In running text, an individual Unicode code point is expressed as U+n, where n is four to six hexadecimal digits, using the digits 0–9 and uppercase letters A–F (for 10 through 15, respectively). Leading zeros are omitted, unless the code point would have fewer than four hexadecimal digits—for example, U+0001, U+0012, U+0123, U+1234, U+12345, U+102345.
- U+0416 is the Unicode code point for the character named CYRILLIC CAPITAL LETTER ZHE.
The U+ may be omitted for brevity in tables or when denoting ranges. The U+ is obligatorily omitted when this code point convention is used in rule NR2, for cases where characters have names algorithmically derived from their code points. See “Unicode Name Property” in Section 4.8, Name.
A range of Unicode code points is expressed as U+xxxx–U+yyyy or U+xxxx..U+yyyy, where xxxx and yyyy are the first and last Unicode values in the range, and the en dash or two dots indicate a contiguous range inclusive of the endpoints. For ranges involving supplementary characters, the code points in the ranges are expressed with five or six hexadecimal digits.
- The range U+0900–U+097F contains 128 Unicode code points.
- The Plane 16 private-use characters are in the range U+100000..U+10FFFD.
#A.1.2 Character Names
In running text, a formal Unicode name is shown in small capitals (for example, GREEK SMALL LETTER MU), and alternative names (aliases) appear in italics (for example, umlaut). Italics are also used to refer to a text element that is not explicitly encoded (for example, pasekh alef) or to set off a non-English word (for example, the Welsh word ynghyd).
For more information on Unicode character names, see Section 4.8, Name.
For notational conventions used in the code charts, see Section 24.1, Character Names List.
#A.1.3 Character Blocks
When referring to the normative names of character blocks in the text of the standard, the character block name is titlecased and is used with the term “block.” For example:
the Latin Extended-B block
Optionally, an exact range for the character block may also be cited:
the Alphabetic Presentation Forms block (U+FB00..U+FB4F)
These references to normative character block names should not be confused with the headers used throughout the text of the standard, particularly in the block description chapters, to refer to particular ranges of characters. Such headers may be abbreviated in various ways and may refer to subranges within character blocks or ranges that cross character block boundaries. For example:
Latin Ligatures: U+FB00–U+FB06
The definitive list of normative character block names is Blocks.txt in the Unicode Character Database.
#A.1.4 Sequences
A sequence of two or more code points may be represented by a comma-delimited list, set off by angle brackets. For this purpose, angle brackets consist of U+003C LESS-THAN SIGN and U+003E GREATER-THAN SIGN. Spaces are optional after the comma, and U+ notation for the code point is also optional—for example, “<U+0061, U+0300>”.
When the usage is clear from the context, a sequence of characters may be represented with generic short names, as in “<a, grave>”, or the angle brackets may be omitted.
In contrast to sequences of code points, a sequence of one or more code units may be represented by a list set off by angle brackets, but without comma delimitation or U+ notation. For example, the notation “<nn nn nn nn>” represents a sequence of bytes, as for the UTF-8 encoding form of a Unicode character. The notation “<nnnn nnnn>” represents a sequence of 16-bit code units, as for the UTF-16 encoding form of a Unicode character.
#A.1.5 Properties and Property Values
The names of properties and property values appear in titlecase, with words connected by an underscore—for example, General_Category or Uppercase_Letter. In some instances, short names are used, such as gc = Lu, which is equivalent to General_Category = Uppercase_Letter. Long and short names for all properties and property values are defined in the Unicode Character Database; see also Section 3.5, Properties.
Occasionally, and especially when discussing character properties that have single words as names, such as age and block, the names appear in lowercase italics.
#A.1.6 Miscellaneous
Phonemic transcriptions are shown between slashes, as in Khmer /khnyom/.
Phonetic transcriptions are shown between square brackets, using the International Phonetic Alphabet. (Full details on the IPA can be found on the International Phonetic Association’s website, https://www.internationalphoneticassociation.org/.)
A leading asterisk is used to represent an incorrect or nonoccurring linguistic form.
In this specification, the word “Unicode” when used alone as a noun refers to the Unicode Standard.
Unambiguous dates of the current common era, such as 1999, are unlabeled. In cases of ambiguity, CE is used. Dates before the common era are labeled with BCE.
The term byte, as used in this standard, always refers to a unit of eight bits. This corresponds to the use of the term octet in some other standards.
#A.1.7 Operators
Operators used in this standard are listed in Table A-1.
Symbol | Meaning |
---|---|
→ | is transformed to, or behaves like |
↛ | is not transformed to |
¬ | logical not |
#A.2 Extended BNF
The Unicode Standard and technical reports use an extended BNF format for describing syntax. This format uses elements from the regular expression syntax specified in Unicode Technical Standard #18, “Unicode Regular Expressions”; however, a BNF is not a regular expression, and may be interpreted differently even when looking like one. As different conventions are used for BNF, Table A-2 lists the notation used here.
Symbols | Meaning |
---|---|
x := ... | production rule |
x y | the sequence consisting of x then y |
x* | zero or more occurrences of x |
x? | zero or one occurrence of x |
x+ | one or more occurrences of x |
x | y | either x or y |
( x ) | for grouping |
{ x } | equivalent to (x)? |
"abc" | string literals ( “_” is sometimes used to denote space for clarity) |
'abc' | string literals (alternative form) |
sot | start of text |
eot | end of text |
\u{HHHHHH} | Unicode code points within string literals or character classes. Between one and six hexadecimal digits; maximum \u{10FFFF}. |
\uHHHH | Unicode BMP code points within string literals or character classes. Exactly four hexadecimal digits. |
U+HHHHHH | Unicode code point literal: equivalent to “\u{HHHHHH}”. Between four and six hexadecimal digits; maximum U+10FFFF. |
U-00HHHHHH | Unicode code point literal: equivalent to “\u{HHHHHH}”. Exactly six hexadecimal digits after the initial two zeroes; maximum U+10FFFF. This format was used in ISO 10646 but is now obsolete. |
H | Hexadecimal digit, 0-9 or A-F |
[…], \p{…} | code point or character class (syntax below) |
In other environments, such as programming languages or markup, alternative notation for sequences of code points or code units may be used.
#A.2.1 Character Classes
A code point class is a set of code points. When the code points are all assigned characters, it can also be referred to as a character class. Its specification can be based on any of the following:
- A literal code point or a range of literal code points.
- A set of code points having a given value for a given Unicode character property, as defined in the Unicode Character Database (see PropertyAliases.txt and PropertyValueAliases.txt).
- Set operations on character classes.
Further extensions to this specification of character classes are used in some Unicode Standard Annexes and Unicode Technical Reports. Such extensions are described in those documents, as appropriate.
A partial formal BNF syntax for character classes as used in this standard is given by the following:
CHARACTER_CLASS := '[' COMPLEMENT? SET ']' | '\p{' PROP_SPEC '}'
COMPLEMENT := '^'
SET := ITEM (SET_EXTEND)*
ITEM := LITERAL (RANGE_OPERATOR LITERAL)? | CHARACTER_CLASS
RANGE_OPERATOR := '-' | '..'
SET_EXTEND := SET_OPERATOR CHARACTER_CLASS | ','? ITEM
SET_OPERATOR := '--'
PROP_SPEC := PROP_NAME (RELATION PROP_VALUE)?
RELATION := '=' | '≠'
If COMPLEMENT
is specified, the resulting code point set is the set of all Unicode code points (U+0000..U+10FFFF) except the code points given by SET
. A LITERAL
can be a Unicode code point escape sequence, a Unicode code point literal, or a character itself. The operator “--” indicates set difference (older documents may use “-”). A PROP_NAME
must be a valid Unicode property name or alias. A PROP_VALUE
must be a valid property value for the PROP_NAME
it is used with. If a PROP_NAME
is used by itself, without a RELATION
and PROP_VALUE
, the property must be a Boolean property, the relation is assumed to be “=” and the value to be True
.
In prose where the context makes clear that a property-based character class is being discussed, \p{PROP_NAME=PROP_VALUE}
may be simplified to PROP_NAME=PROP_VALUE
.
Whenever any character could be interpreted as a syntax character, it must be escaped. If a space character is used as a literal, it is escaped. The interpretation of spaces differs from that in regular expressions, so that in the examples below spaces have to be removed in order to obtain equivalent regular expressions. Examples are found in Table A-3.
Syntax | Matches |
---|---|
[a-z] | English lowercase letters |
[a-z -- c] | English lowercase letters except for c |
[0-9] | European decimal digits |
[\u0030-\u0039] | (same as above, using Unicode escapes) |
[0-9 A-F a-f] | hexadecimal digits |
[\p{gc=Letter} \p{gc=Nonspacing_Mark}] | all letters and nonspacing marks |
[\p{gc=L} \p{gc=Mn}] | (same as above, using abbreviated notation) |
[^\p{gc=Unassigned}] | all assigned Unicode characters |
[\u{A980}-\u{A9DF} -- \p{gc=Unassigned}] | all assigned characters in the main Javanese range |
[\p{Alphabetic}] | all alphabetic characters |
[^\p{Line_Break=Infix_Numeric}] | all code points that do not have the line break property of Infix_Numeric |
For more information about character classes, see Unicode Technical Standard #18, “Unicode Regular Expressions.”
#A.3 Rendering
A figure such as Figure A-1 depicts how a sequence of characters is typically rendered.
The sequence under discussion is depicted on the left of the arrow, using representative glyphs and code points below them. A possible rendering of that sequence is depicted on the right side of the arrow.