Unicode Properties in Character Proposals
Characters in the Unicode Standard have a number of properties.
Properties are used to determine character behavior in software. For example,
the case property—whether a character is uppercase or lowercase—will
affect how the character is used in software that performs capitalization of
words in English. A character's properties may identify whether it is a letter,
a number, a mark of punctuation, whether it belongs to a script that runs right
to left or left to right, and so forth. These properties are used for various
computer processes, such as capitalization, searching, spell-checking. If
properties are incorrectly identified, text that is pasted into a document may
get reversed, the cursor may not work as expected, text may not lay out
correctly on a page, all depending on whether the character's properties are
correctly identified or not.
Some of these properties are obvious and easily discovered, and some are not.
Some properties are automatically assigned (such as Derived Age, which tells
when a character was added to the standard), others are assigned with ease,
implicit in the character name or other information easily supplied by a
proposal author. For general information on character properties, see Chapter 4, Character Properties, in the Unicode Standard.
For reference, a more-or-less complete list of properties can be found online
here:
https://www.unicode.org/reports/tr44/#Properties
Property information must be supplied at the time new characters are
published in the standard. The following questions and discussion below have
been developed to get proposal authors and committee members thinking about this
issue. For each character in a proposal, the proposal author should consider
the character in context, and answer questions about how the character interacts
with other characters.
The most basic information required about characters includes name, code point,
and other identity information, such as whether a character goes by more than
one name, or can be cross-referenced to another character. This information is
included in the names list of a proposal, accompanying glyphs of the proposed
characters.
Code points The code points are typically assigned by the
standards committees (WG2 and UTC), but if the proposal is for an entire script,
it is probably already on the roadmap, and therefore a particular range of
code points may already have been pre-selected. In other cases, those proposing
characters can make recommendations about where the characters should be
encoded, but it isn't necessary to do so.
Names If there are alternative names for a character or characters in
the proposal, these should also be discussed, as well as other information about
the meanings of names, and similarities in behavior to other characters that are
already encoded in the standard.
A sample listing of code points and names is the following:
0391 GREEK CAPITAL LETTER ALPHA
0392 GREEK CAPITAL LETTER BETA
0393 GREEK CAPITAL LETTER GAMMA
0394 GREEK CAPITAL LETTER DELTA
Each character is assigned a "General Category". The general category should
be specified in a separate "Character Properties" section of a proposal.
The general category properties are documented in the Unicode Character
Database (UCD). Typical categories include things such as "letter", "combining
mark", "symbol" and so forth. The preferred format for listing the character
properties is that found in the file https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt. The following line is
the UnicodeData.txt entry for Greek uppercase gamma:
0393;GREEK CAPITAL LETTER GAMMA;Lu;0;L;;;;;N;;;;03B3;
One of the easiest ways to provide character properties is to find a similar
character that is already encoded, and copy its properties, inserting the
appropriate code point and name, and other changes as applicable. The listing of
all the characters and their properties is located in the file https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt.
Note: If the character property information is still puzzling, then describe
the character's use, answering the questions in the Appendix for each character.
The fields in UnicodeData.txt, separated by semicolons, comprise the
following categories (given below with the values from the example above):
Code point: 0393
Name: GREEK CAPITAL LETTER GAMMA
General Category: Lu (for Letter uppercase)
Canonical Combining Class: 0 (this category provides information as to
where a given character is placed in relation to another character, for
example where a diacritic is placed; gamma is a spacing character, and as
such it receives combining class 0)
Bidirectional Class: L (for strong left-to-right directionality of the
script)
Decomposition Type/Decomposition Mapping: (left blank as there is no
decomposition into other characters)
Numeric Type: (left blank as this is not a number; if it were, the digit
value would be included here)
Numeric Value: (left blank as this is not a number)
Bidi Mirrored: (left blank as this character has no mirroring)
Unicode 1 Name: (left blank as there was no Unicode 1.0 name)
ISO Comment: (left blank as there is no ISO comment)
Simple Uppercase Mapping: (left blank since this is already uppercase)
Simple Lowercase Mapping: 03B3 (the code point for GREEK SMALL LETTER
GAMMA, the lowercase form that should be mapped to this character)
Simple Titlecase Mapping: (left blank as no Unicode titlecase character
for uppercase gamma is encoded)
The above properties in UnicodeData.txt are documented in UAX #44, "Unicode Character Database": https://www.unicode.org/reports/tr44/#Properties
For a useful Excel spreadsheet that shows the Unicode character properties with informational notes, see SIL's "Unicode Character Properties
Excel Workbook" at
http://scripts.sil.org/ExcelUnicodeData.
Line breaking behavior affects how lines of text fit into a graphic window.
As a window is re-sized to be narrower, the words are made to wrap automatically
to the next line. Specific line breaking properties affect how characters behave
at the ends and beginnings of lines, as the line ends change. For example, in
the expression "$ .01" the dollar sign should stay with the following number
when it occurs at the end of a line, even though a space intervenes; $ on one
line and .01 on the next wouldn't typically be allowed. Closing punctuation
marks such as ")" would typically not be allowed as the first character on a
line. Defining "line breaking" for characters used in
historic scripts may seem anachronistic, but you will need to consider
how a modern edition may lay out an ancient text on a page or in a text window
on a computer.
As with the character properties, information on line-breaking should be
included in a separate section of a proposal for new characters.
Determining line breaking can be tricky, but many characters simply behave "just
like" some other characters. One way to determine the line breaking property is
to determine if there is a character already in the standard that behaves
similarly, or identically, to the given character in terms of line breaking, and
to use the line breaking properties of the already encoded character, as given
in UAX #14, Unicode Line Breaking
Algorithm.
Another way to determine line breaking is to describe the line breaking
properties of the characters, based on responses to the following questions:
- Can it appear at the end of a line? Beginning of a line?
- Does it have special or unusual behavior near the ends of lines? If so,
describe the special behavior.
- Does it come between letters and cause them to not
be breakable at the end of a line? Or can surrounding characters be broken
across the line even when this character is before/after?
- Is the character a math or technical operator? A "technical operator" would
be a character that acts like a math operator but in non-math contexts, for
example in a programming language, a grammar, or other semantic notation.
- If it is a math operator, is it binary or unary, or other?
- Does it have the "math" property, or not?
- Does it stretch or change in appearance depending on context (e.g., like
summation or integrals)?
- Does the character belong to any of the special
categories, such as hyphen, dash, or diacritic? These categories are special
because they are used for determining other kinds of character behavior. A
verbal description of how a given character behaves is advised for such special
categories.
Characters are often ordered in relation to other characters. For symbols,
the default order in which they happen to appear in the standard often doesn't
matter very much. However, for characters that are part of an alphabet or
syllabary, the default order is often quite important. If you are proposing a
whole script, the binary order (the order in which the characters are
listed in the standard) of the proposal is often taken as the first
approximation of an expected ordering. If there are reasons why the binary order
differs from the expected "native" ordering, these should be justified and
spelled out in a separate section of the character proposal. Otherwise, the
characters in the proposal should simply be laid out in a logical, expected
ordering. A simple listing of the characters in the expected order is
recommended, such as the following for Kaithi consonants: ka, kha, ga, gha, etc.
If two or more orders occur with some frequency (for example, there might be
differences in how characters are ordered depending in the language being sorted),
it is helpful to discuss such differences, and include both orders in the text
of the proposal.
If you are proposing additional characters in a script that is already
encoded, show where the characters should be sorted in relation to the
other characters already encoded. For example, if new syllables in the Vai or Yi
scripts are to be added, their binary order (where they are encoded) may be
very
different from where they should occur in the expected native syllabic order.
The proposal needs to specify exactly where the characters should be
interpolated in sorting.
For historic scripts, particularly those that are still not fully understood,
it may be difficult to specify the ordering. In this case, provide your best
guess, but it is advisable to rely, if possible, on the order given in available
standard handbooks or dictionaries.
Besides the primary and secondary order issues for the letters and digits,
the proposal author also needs to provide some information about how "special"
characters behave—whether they are simply ignored for collation, or have some
special order. Special characters might include symbols, punctuation,
and so on. That is often the hardest part of coming up with collation table
assignments. It may help to think in terms of whether such symbols might behave
like other characters already encoded.
For a technical overview of sorting behavior see the introductory portions of
UTS #10, The Unicode Collation Algorithm, especially sections 1.0, 1.1,
1.8, and 1.9.
In a section of the proposal, include a comment on the potential use of a
character in identifiers. Identifiers are letters, numbers, or symbols used in
domain names (such as "paypal.com") or as variables in programming languages.
The questions below can assist you in providing the necessary information for
the Unicode Technical Committee.
- Can the character be used in identifiers? Normally
only modern-use letters, marks, and numbers are permitted in identifiers (used,
for example, in programming languages, user names, international domain names,
etc).
- Is it a character in customary modern use, e.g. commonly used in
newspapers, magazines, and so on in one or more living languages?
- If the character is not a letter, mark, or number, but is deemed necessary
to be in identifiers, provide justifications.
- If allowable in identifiers, can the character
start
an identifier, or would it only be used as a non-first character? (Most
characters that are allowed in identifiers can be the first character.) Any
special handling or considerations should be spelled out. For example, a
part number like"X2b-31c" or model numbers like"325i" are identifiers.
Bidirectional refers to text such as mixed Hebrew or Arabic and
English with parts of the text running in left-to-right and right-to-left
directions. In the context of bidirectional text, how do the characters behave?
Characters need to have their directionality specified as either L ("Left"), R
("Right"), AL ("Right to Left Arabic"), or neutral; and some discussion of this
may be required in the proposal, for each such character.
Note that the directionality of "R" applies to strong directional characters
for most Right-to-Left scripts, such as the Hebrew alphabet and related
punctuation. The directionality "AL" is a special strong Left-to-Right
direction, used only for Arabic, Thaana, and Syriac alphabets and most
punctuation specific to those scripts.
Also, special symbols need to be compared to the behavior of other special
symbols in bidi, and the directional class of numbers needs to be specified.
Shaping behavior refers to changes in a character's shape based
on context, such as whether it appears at the beginning, middle, or end of a
word. In Arabic, almost all letters have special requirements for how they
appear depending on positional context, and are divided into various shaping
classes. If you are working on Arabic or scripts with similarly complex shaping
behavior, see UAX #9, The Unicode Bidirectional Algorithm, as well as Chapter
8, Middle Eastern Scripts in the Unicode Standard.
In scripts such as Arabic, the shaping classes and behavior need to be
explicitly determined for each such letter:
- Is there an Arabic letter with similar or identical shaping behavior?
- Does it belong to an existing shaping class?
- Would the character normally be mirrored if used in right-to-left text?
The addition of CJK ideographs is usually handled by the Ideographic
Rapporteur Group (IRG), but in rare cases, a proposal for CJK characters may be
presented to the UTC. If the character is a CJK ideograph, it should be assigned
properties just like other ideographs, so a whole set of questions are already
pre-answered, because it should be assigned most properties identical to
all other CJK ideographs. However, there are some other questions:
- Does it have special numerical significance?
- Is it some kind of variant of an existing CJK character?
However, CJK characters will also need to have a lot of associated data, as
specified in the Unihan documentation. See
UAX #38, Unicode Han Database
for details.
Answering the questions below will provide basic information to allow the
Unicode Technical Committee members to determine a character's properties.
Provide a description of each character's use, with examples if possible.
A. Some scripts have case, and if so, it will be necessary to know:
- Is it uppercase, lowercase, or uncased? If uppercase or lowercase, what are
the case mappings? (These mappings refer to a property that
identifies the other element of a case pair, for example the uppercase
mapping of "m" is "M".) Uppercase and titlecase characters must have
lowercase mappings.
- Is it a titlecase digraph?, E.g. the Unicode character U+01F2 LATIN CAPITAL
LETTER D WITH SMALL LETTER Z (which looks like "Dz")
- Does it have complex or non-standard case mapping behavior? (e.g., Turkish
dotless i)
B. Is the character an ordinary letter of an alphabet or syllabary (non-CJK
ideograph)? Or is it a stand-alone symbol? (For CJK ideographs, see the special
section above.)
C. Is the character a white-space character, or does it cause visible
separation between other characters?
D. Does the character have a numeric value? If so, is it a decimal digit, or
is it a "digit" of some other non-decimal numbering system?
- If the character is a true decimal digit (i.e., it forms decimal radix
numbers like European numbers), then the General_Category value is Nd and
all three numeric fields should have a numeric value filled in (for example,
for CHAKMA DIGIT NINE, the General_Category is Nd and 9 is inserted in the
three numeric fields: 1113F;CHAKMA DIGIT
NINE;Nd;0;L;;9;9;9;N;;;;;)
- If the character is any other kind of number, even if it has a numeric
value from 1 through 9, then the General_Category value is No (or Nl), and
only the third numeric field should be filled in (for example, AEGEAN NUMBER
NINE is not a decimal radix number, so it is No and 9 appears only in the
third numeric field: 1010F;AEGEAN NUMBER NINE;No;0;L;;;;9;N;;;;;).
E. Is it a "base letter" or does it combine with letters or symbols?
F. If it is a combining character:
- How does it combine? Above? Below? After? Are there particularly strong
restrictions on how it is displayed, such as being centered or to the
left/right of a base character?
- Does it bind very tightly to letters, such as some vowel signs do?
- Is it completely non-spacing, or does it combine but also have spacing
characteristics?
G. If this is a punctuation character:
- Is it terminal punctuation (i.e., ending a clause or a sentence)?
- Is it paired with anything else? For example, "(" is paired with ")",
"[" is paired with "]".
- Does it separate words? If so, does it occur exclusively before or after
words?
- Does it occur within words?
- Does it occur within (as opposed to at the end of) sentences?
- Can it appear at the end of a line? Beginning of a line?
- Does it come between letters and cause them to not
be breakable at the end of a line?