L2/07-115
Unicode Properties in Character Proposals (draft)
This is a draft of a document concerning character properties which was requested by UTC in action item 110-A99:
110-A99 Make up a set of questions for determining character properties, particularly punctuation. (Cf. 110-A098)
Introduction
Characters in The Unicode Standard have a number of properties, some of which are obvious and easily discovered, and some of which are not. Some properties are automatically assigned (such as Derived Age), others are assigned with ease, implicit in the character name or other information easily supplied by a proposal author. For general information on character proprties, see The Unicode Standard, Chapter 4 (PDF).
For reference, a more-or-less complete list of properties can be found online here:
http://unicode.org/Public/UNIDATA/UCD.html#Properties
The questions and discussion below have been developed to get proposal authors and committee members thinking about, and providing in proposals, the property information that will be needed at the time new characters are published in the standard. For each character in a proposal, the proposal author should think about the character in context, and answer questions about how the character interacts with other characters.
Basic Information
The most basic information required about characters includes Name, Codepoint, and other identity information, such as whether a character goes by more than one name, or can be cross-referenced to another character.
The codepoints are typically assigned by the committees (WG2 and UTC), but if the proposal is for an entire script, it is probably already on the roadmap, and therefore a particular range of codepoints may already have been pre-selected. In other cases, those proposing characters can make recommendations about where the characters should be encoded, but it isn't necessary to do so.
If there are alternative names for a character or characters in the proposal, these should also be discussed, as well as other information about the meanings of names, and similarities in behavior to other characters that are already encoded in the standard.
General Category and Other Properties
Each character is assigned a "General Category". These are documented in the Unicode Character Database (UCD). Typical categories include things such as "letter", "combining mark", "symbol" and so forth. This category must be specified, or suggested, for each character in a proposal.
The file UnicodeData.txt contains a number of categories in a specified layout. It is most helpful if a set of lines emulating the entries file are included in a proposal, for each character in the proposal. For example, the following line is the UnicodeData.txt entry for Greek upper-case Gamma
0393;GREEK CAPITAL LETTER GAMMA;Lu;0;L;;;;;N;;;;03B3;The properties in UnicodeData.txt are documented here: http://www.unicode.org/Public/UNIDATA/UCD.html
The discussion below relates to these properties as well as other extended properties that are documented in other files.
Some scripts have case (A/a) if so, it will be necessary to know:
- Is it uppercase, lowercase, or uncased? If all but the latter, what are the case mappings? Note that uppercase and titlecase characters must have lowercase mappings.
- Is it a titlecase digraph, e.g. "Dz"?
- What is its case-pair mapping, if any?
- Does it have complex or non-standard case mapping behavior? (e.g., Turkish dotless i)
Can the character be used in identifiers, such as domain names or programming language variables? Normally only modern-use letters, marks, and numbers are permitted in identifiers (used, for example, in programming languages, user names, international domain names, etc).
- Is it a character in customary modern use, e.g. commonly used in newspapers, magazines, and so on in one or more living languages?
- If the character is not a letter, mark, or number, does it need to be in identifiers? If so, provide justifications.
If allowable in identifiers, can it start an identifier, or would it only be used as a non-first character? (Most characters that are allowed in identifiers can be the first character.) Any special handling or considerations should be spelled out.
Is the character an ordinary letter of an alphabet or syllabary (non-CJK ideograph)? Or is it a stand-alone symbol? (For CJK ideographs, see the special section below.)
Is the character a white-space character, or does it cause visible separation between other characters?
Does the character have a numeric value?
- Is it a decimal digit?
- A "digit" of some other non-decimal numbering system?
Is it a "base letter" or does it combine with letters or symbols?
If it is a combining character:
- How does it combine? Above? Below? After?
- Does it bind very tightly to letters, such as some vowel signs do?
- Is it completely non-spacing, or does it combine but also have spacing characteristics?
- Does it sit above or below, and are there particularly strong restrictions on how it is displayed, such as being centered or to the left/right of a base character?
If this is a punctuation character:
- Is it terminal punctuation? (I.e., ending a clause, sentence?)
- Is it paired with anything else, e.g. () []?
- Does it separate words, clauses, sentences, or other units of writing?
- Does it occur within words? If so, which characters does it behave like in http://www.unicode.org/reports/tr29/#Default_Word_Boundaries ?
- Does it occur within (as opposed to at the end of) sentences? If not, which characters does it behave like in http://www.unicode.org/reports/tr29/#Default_Sentence_Boundaries
- Does it come only after words, before words, within words?
- Can it appear at the end of a line? Beginning of a line?
- Does it come between letters an cause them to not be breakable at the end of a line?
Line breaking behavior can be tricky, but many characters simply behave "just like" some other characters. Is there a character already in the standard that behaves similarly, or identically, to this character in terms of line breaking?
To determine proper line-breaking behavior, one can think of a line of text in a graphic window. As a window is re-sized to be narrower, and words are made to automatically wrap to the next line, how does this character behave?
- What characters does it behave like in http://www.unicode.org/reports/tr14/#DescriptionOfProperties
- Can it appear at the end of a line? Beginning of a line?
- Does it come between letters an cause them to not be breakable at the end of a line? Or can surrounding characters be broken across the line even when this character is before/after?
- Does it has special or unusual behavior near the ends of lines?
- If applicable, describe the special behavior.
Can the character be normalized to (or mapped to) another character, or some combination of other characters, either already-encoded, or not-yet encoded in the standard?
Is the character a math or technical operator?
- If so, is it binary or unary, or other?
- Does it have the "math" property? Or not?
- Does it stretch or change in appearance depending on context (e.g., like summation or integrals)?
In the context of bidirectional text, how does the character behave? The main issues are directionality of "R" versus "AL". Symbols need to have their directionality specified as L, R, AL, or neutral; and some discussion of this may be required in the proposal, for each such symbol.
Also, special symbols need to be compared to the behavior of other special symbols in bidi, and the directional class of numbers needs to be specified.
What about shaping behavior? In scripts such as Arabic, the shaping classes and behavior need to be explicitly determined for each such letter.
- Is there an Arabic or Hebrew letter with similar or identical
shaping behavior?- Does it belong to an existing shaping class?
- Would the character normally be mirrored if used in right-to-left text?
Should the character belong to any of the special categories, such as hyphen, dash, diacritic?
Special Considerations for CJK Additions
Addition of CJK ideographs is usually handled by the Ideographic Rapporteur Group (IRG), but in rare cases, a proposal for CJK characters may be presented to UTC. If the character is a CJK ideograph, it should be assigned properties just like other ideographs, so a whole set of questions are already pre-answered, because it should be assigned most properties identical to all other CJK ideographs. However, there are some other questions:
- Does it have special numerical significance?
- Is it some kind of variant of an existing CJK character?
However, CJK characters will also need to have a lot of associated data, as specified in the Unihan documentation. See: http://www.unicode.org/reports/tr38/ and http://www.unicode.org/Public/UNIDATA/Unihan.html for details.
Collation and Ordering Issues
Characters are often ordered in relation to other characters. For symbols, the default ordering often doesn't matter very much. However, for characters that are part of an alphabet or syllabary, the default order is often quite important. If you are proposing a whole script, the binary order of the proposal is often taken as the first approximation of an expected ordering. If there are reasons why the binary order differs from the expected "native" ordering, these should be justified and spelled out. Otherwise, the characters in the proposal should simply be laid out in a logical, expected ordering. In the case there are two or more orders that occur with some frequency, it is helpful to discuss their differences, and include both orders in the text of the proposal.
If you are proposing additional characters in a script that is already encoded, it is necessary to show where the characters should be sorted in relation to the other characters already encoded. For example, if new syllables in the Vai or Yi scripts are to be added, their binary order (where they are encoded) may be very different from where they should occur in the syllabic order. The proposal needs to specify exactly where the characters should be interpolated in sorting.
Besides the primary and secondary order issues for the letters and digits, the proposal author also needs to provide some information about how "special" characters behave—whether they are simply ignored for collation, or have some special order. That is often the hardest part of coming up with collation table assignments. It may help to think in terms of how such symbols might behave like other characters already encoded.
Draft date: 2007-05-03