Technical Reports |
Version |
1 (draft1) |
Authors | Mark Davis (mark.davis@us.ibm.com) |
Date | 2003-07-18 |
This Version | http://www.unicode.org/reports/tr31/tr31-1.html |
Previous Version | n/a |
Latest Version | http://www.unicode.org/reports/tr31/ |
This document describes specifications for recommended defaults for the use of Unicode in the definitions of identifiers and in pattern-based syntax. It incorporates the Identifier section of Unicode 4.0 (somewhat reorganized) and a new section on the use of Unicode in patterns. As a part of the latter, it presents recommended new properties for addition to the Unicode Character Database.
Feedback is requested both on the text of the new pattern section and on the contents of the proposed properties.
This document has been approved by the Unicode Technical Committee for public review as a Proposed Draft Unicode Technical Report. Making this document available for public review does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.
A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.
Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].
A common task facing an implementer of the Unicode Standard is the provision of a parsing and/or lexing engine for identifiers. To assist in the standard treatment of identifiers in Unicode character-based parsers, a set of specifications is provided here as a recommended default for the definition of identifier syntax. These guidelines are no more complex than current rules in the common programming languages, except that they include more characters of different types.
In addition, this document provides a proposed definition of a set of properties for use in defining stable pattern syntax: syntax that is stable over future versions of the Unicode Standard.
Note to reviewers: Section 2 would eventually supersede Section 5.15 Identifiers from The Unicode Standard 4.0.
The formal syntax provided here is intended to capture the general intent that an identifier consists of a string of characters that begins with a letter or an ideograph, and then includes any number of letters, ideographs, digits, or underscores. Each programming language standard has its own identifier syntax; different programming languages have different conventions for the use of certain characters from the ASCII range ($, @, #, _) in identifiers. To extend such a syntax to cover the full behavior of a Unicode implementation, implementers need only combine these specific rules with the sample syntax provided here.
<identifier> := <identifier_start> (<identifier_start>
| <identifier_extend>)*
Identifiers are defined by the following sets of character categories from the Unicode Character Database.
Syntactic Class | Properties | Coverage |
---|---|---|
<identifier_start> |
General Category = L or Nl, or Other_ID_Start = true | Uppercase letter, lowercase letter, titlecase letter, modifier letter, other letter, letter number, stability extensions |
<identifier_extend> |
General Category = Mn, Mc, Nd, Pc, or Cf | Nonspacing mark, spacing combining mark, decimal number, connector punctuation, formatting code |
The innovations in the identifier syntax to cover the Unicode Standard include the following:
Combining marks must be accounted for in identifier syntax. A composed character sequence consisting of a base character followed by any number of combining marks must be valid for an identifier. This requirement results from the requirement for combining marks in the representation of many languages, and the conformance rules in Chapter 3 regarding interpretation of canonical-equivalent character sequences.
Enclosing combining marks (for example, U+20DD..U+20E0
) are
excluded from the syntactic definition of <identifier_extend>
,
because the composite characters that result from their composition with
letters (for example, U+24B6 circled
latin capital letter a) are themselves not normally considered valid
constituents of these identifiers.
The Unicode characters that are used to control joining behavior, bidirectional ordering control, and alternative formats for display are explicitly defined as not affecting breaking behavior. Unlike space characters or other delimiters, they do not serve to indicate word, line, or other unit boundaries. Accordingly, they should normally be ignored for the purposes of identifier definition. Implementations that cannot ignore characters in identifiers should exclude these characters.
Specific identifier syntaxes can be treated as tailorings of the generic
syntax based on character properties. For example, SQL identifiers allow an
underscore as an identifier part (but not as an identifier start); C
identifiers allow an underscore as either an identifier part or an identifier
start. Specific languages may also want to exclude the characters that have a decomposition_type
other than canonical
or none
, or to exclude some
subset of those, such as those with a decomposition_type
equal to
font
.
For programming language identifiers, normalization has a number of important implications. For a discussion of these issues, see Annex 7: Programming Language Identifiers in UAX #15, Unicode Normalization Forms [UAX15].
Note to reviewers: Would it be better to move that section into this UTR. Comments?
Unicode General Category values are kept as stable as possible, but they
can change across versions of the Unicode Standard. The Other_ID_Start
property contains a small list of characters that qualified as <identifier_start>
characters in some previous version of Unicode solely on the basis of their
General Category properties, but that no longer qualify in the current
version. In Unicode 4.0, this list consists of four characters:
The Other_ID_Start property is thus designed to ensure that the Unicode identifier specification is backward compatible: Any sequence of characters that qualified as an identifier in some version of Unicode will continue to qualify as an identifier in future versions.
The down-side of working with the syntactic classes defined above is the storage space needed for the detailed definitions, plus the fact that with each new version of the Unicode Standard new characters are added, which an existing parser would not be able to recognize. In other words, the recommendations based on that table are not upwardly compatible.
One method to address this problem is to turn the question around. Instead of defining the set of code points that are allowed, define a small, fixed set of code points that are reserved for syntactic use and allow everything else (including unassigned code points) as part of an identifier. All parsers written to this specification would behave the same way for all versions of the Unicode Standard, because the classification of code points is fixed forever.
The drawback of this method is that it allows “nonsense” to be part of identifiers because the concerns of lexical classification and of human intelligibility are separated. Human intelligibility can, however, be addressed by other means, such as usage guidelines that encourage a restriction to meaningful terms for identifiers. For an example of such guidelines, see the XML 1.1 specification by the W3C [XML1.1].
By increasing the set of disallowed characters, a reasonably intuitive recommendation for identifiers can be achieved. This approach uses the full specification of identifier classes, as of a particular version of the Unicode Standard, and permanently disallows any characters not recommended in that version for inclusion in identifiers. All code points unassigned as of that version would be allowed in identifiers, so that any future additions to the standard would already be accounted for. This approach ensures both upwardly compatible identifier stability and a reasonable division of characters into those that do and do not make human sense as part of identifiers.
Some additional extensions to the list of disallowed code points can be made to further constrain “unnatural” identifiers. For example, one could include unassigned code points in blocks of characters set aside for future encoding as symbols, such as mathematical operators.
With or without such fine-tuning, such a compromise approach still incurs the expense of implementing large lists of code points. While they no longer change over time, it is a matter of choice whether the benefit of enforcing somewhat word-like identifiers justifies their cost.
Alternatively, one can use the properties described below, and allow all sequences of characters to be identifiers that are neither pattern syntax nor pattern whitespace. This has the advantage of simplicity and small tables, but allows many more “unnatural” identifiers.
There are many circumstances where software interprets patterns that are a mixture of literal characters, whitespace, and syntax characters. Examples include regular expressions, Java collation rules, Excel or ICU number formats, and many others. These patterns have been very limited in the past, and forced to use clumsy combinations of ASCII characters for their syntax. As Unicode becomes ubiquitous, some of these will start to use non-ASCII characters for their syntax: first as more readable optional alternatives, then eventually as the standard syntax.
For forwards and backwards compatibility, it is very advantageous to have a fixed set of whitespace and syntax code points for use in patterns. This follows the recommendations that the Unicode Consortium made regarding completely stable identifiers, and the practice that is seen in XML 1.1 [XML1.1]. (In particular, the consortium committed to not allocating characters suitable for identifiers in the range 2190..2BFF, which is being used by XML 1.1.)
With a fixed set of whitespace and syntax code points, a pattern language can then have a policy requiring all possible syntax characters (even ones currently unused) to be quoted if they are literals. By using this policy, it preserves the freedom to extend the syntax in the future by using those characters. Past patterns on future systems will always work; future patterns on past systems will signal an error instead of silently producing the wrong results.
Example:
In version 1.3 of program X, '≈' is a reserved syntax character, e.g. it doesn't perform an operation, but you have to quote it. In version 1.4, '≈' gets a real meaning, e.g. uppercase the subsequent characters. In this example, '\' quotes the next character; i.e., causes it to be treated as a literal instead of a syntax character.
- The pattern abc...\≈...xyz works on both version 1.3 and 1.4, and refers to the literal character since it is quoted in both cases.
- The pattern abc...≈...xyz works on 1.1 and uppercases the following characters. On version 1.0, the engine (rightfully) has no idea what to do with ≈. Rather than silently fail (by ignoring ≈ or turning it into a literal), it has the opportunity signal an error.
This document provides a recommended set of code points that can be used for such pattern whitespace and syntax characters. Particular pattern languages may, of course, override these recommendations (for example, adding or removing other characters for compatibility in ASCII). But by providing a list of these in UCD properties, a stable, common basis for future expansion.
For stability, the property values will be absolutely invariant; not changing with successive versions of Unicode. Of course, this doesn't limit the ability of the Unicode Standard to add more symbol or whitespace characters, but the syntax and whitespace characters recommended for use in patterns would not change.
When generating rules or patterns, all whitespace and syntax code points that are to be literals would require quoting. For readability, it is recommended practice to quote or escape all whitespace and default ignorable code points as well. That is,
The two proposed pattern properties to for the next appropriate version of the UCD are Pattern_White_Space and Pattern_Syntax. The contents are presented here for review; they would be removed once incorporated into the [UCD]. The contents were derived as follows:
The proposed Pattern_White_Space characters were originally derived from White_Space by removing some characters that appeared inappropriate for patterns, and adding LRM and RLM. However, once we settle on their contents, they would be immutable from then on.
The LRM and RLM are added so as to allow easier use of Arabic and Hebrew in Patterns. For example, a rule like:
X / W => Y* / Z ;
becomes almost unreadable when some of the W..Z are right-to-left (RTL) characters (e.g. Arabic or Hebrew) and others are left-to-right (LTR) characters. However, by surrounding the RTL strings by LRM (or the LTR characters by RLM), the rules can be made readable.
Note to reviewers: A tighter definition would result from removing all compatibility characters, leaving only U+0009..U+000D, U+0020, U+0085, U+200E..U+200F, and U+2028..U+2029. Would it be better to have this narrower definition? Should U+3000 be at least retained?
The proposed Pattern_Syntax code points were derived from the following set, then some script-specific characters were removed, along with some other characters that appeared inappropriate for patterns.
[[:gc=s:] | [:gc=p:] | [\u2190-\u2BFF]]
0009..000D ; Pattern_White_Space # <CHARACTER TABULATION>..<CARRIAGE RETURN (CR)> 0020 ; Pattern_White_Space # SPACE 0085 ; Pattern_White_Space # <NEXT LINE (NEL)> 00A0 ; Pattern_White_Space # NO-BREAK SPACE 2000..200A ; Pattern_White_Space # EN QUAD..HAIR SPACE 200E..200F ; Pattern_White_Space # LEFT-TO-RIGHT MARK..RIGHT-TO-LEFT MARK 2028 ; Pattern_White_Space # LINE SEPARATOR 2029 ; Pattern_White_Space # PARAGRAPH SEPARATOR 202F ; Pattern_White_Space # NARROW NO-BREAK SPACE 205F ; Pattern_White_Space # MEDIUM MATHEMATICAL SPACE 3000 ; Pattern_White_Space # IDEOGRAPHIC SPACE # Latin-1 0021..002F ; Pattern_Syntax # EXCLAMATION MARK..SOLIDUS 003A..0040 ; Pattern_Syntax # COLON..COMMERCIAL AT 005B..0060 ; Pattern_Syntax # LEFT SQUARE BRACKET..GRAVE ACCENT 007B..007E ; Pattern_Syntax # LEFT CURLY BRACKET..TILDE 00A1..00A7 ; Pattern_Syntax # INVERTED EXCLAMATION MARK..SECTION SIGN 00A9 ; Pattern_Syntax # COPYRIGHT SIGN 00AB..00AC ; Pattern_Syntax # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK..NOT SIGN 00AE ; Pattern_Syntax # REGISTERED SIGN 00B0..00B1 ; Pattern_Syntax # DEGREE SIGN..PLUS-MINUS SIGN 00B6..00B7 ; Pattern_Syntax # PILCROW SIGN..MIDDLE DOT 00BB ; Pattern_Syntax # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK 00BF ; Pattern_Syntax # INVERTED QUESTION MARK 00D7 ; Pattern_Syntax # MULTIPLICATION SIGN 00F7 ; Pattern_Syntax # DIVISION SIGN # General punctuation, may include currently unassigned code points 2010..2027 ; Pattern_Syntax # HYPHEN..HYPHENATION POINT 2030..205E ; Pattern_Syntax # PER MILLE SIGN..<unassigned> # Whole blocks # Arrows, Mathematical Operators, Miscellaneous Technical, # Control Pictures, Optical Character Recognition # Enclosed Alphanumerics, Box Drawing, Block Elements, # Geometric Shapes, Miscellaneous Symbols, Dingbats # Miscellaneous Mathematical Symbols-A, Supplemental Arrows-A, # Braille Patterns, Supplemental Arrows-B, Miscellaneous Mathematical Symbols-B, # Supplemental Mathematical Operators, Miscellaneous Symbols and Arrows # NOTE: may include currently unassigned code points 2190..2BFF ; Pattern_Syntax # LEFTWARDS ARROW..<unassigned-2BFF> # CJK Symbols and Punctuation 3001..3003 ; Pattern_Syntax # IDEOGRAPHIC COMMA..DITTO MARK 3008..3020 ; Pattern_Syntax # LEFT ANGLE BRACKET..POSTAL MARK FACE 3030 ; Pattern_Syntax # WAVY DASH #Arabic Presentation Forms-A (should have been encoded elsewhere) FD3E..FD3F ; Pattern_Syntax # ORNATE LEFT PARENTHESIS..ORNATE RIGHT PARENTHESIS #CJK Compatibility Forms FE45..FE46 ; Pattern_Syntax # SESAME DOT..WHITE SESAME DOT
Note to Reviewers: should the above Arabic Presentation Forms-A and CJK Compatibility Forms be retained?
TBD.
[Feedback] | Reporting Errors and Requesting
Information Online http://www.unicode.org/reporting.html |
Reports] | Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports. |
[UCD] | Unicode Character Database. http://www.unicode.org/ucd For an overview of the Unicode Character Database and a list of its associated files |
[Unicode] | The Unicode Consortium. The Unicode Standard, Version 4.0. Reading, MA, Addison-Wesley, 2003. 0-321-18578-1. |
[UAX15] |
UAX #15, Unicode Normalization Forms |
[Versions] | Versions of the Unicode Standard http://www.unicode.org/versions/ For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports. |
[XML1.1] | Extensible Markup Language (XML) 1.1 http://www.w3.org/TR/xml11/ |
The following summarizes modifications from the previous version of this document.
1 |
|
Copyright © 2000-2003 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.