Latest Version: | http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/posix_classes.html |
Previous Versions: | http://oss.software.ibm.com/cvs/icu/icuhtml/design/ |
Last updated: | 2003-04-29, MED |
The POSIX-style property names are are not well specified, and don't really map well to the broader types of characters available in Unicode/10646. For example, there is no provision for titlecase, nor for a distinction between symbols and punctuation. The Posix categories aren't really set up to make distinctions among combining marks, nor many of the other Unicode Properties.
HT | U+0009 <CHARACTER TABULATION> |
LF | U+000A <LINE FEED> |
VT | U+000B <LINE TABULATION> |
FF | U+000C <FORM FEED> |
CR | U+000D <CARRIAGE RETURN> |
IS4 | U+001C <INFORMATION SEPARATOR FOUR> |
IS3 | U+001D <INFORMATION SEPARATOR THREE> |
IS2 | U+001E <INFORMATION SEPARATOR TWO> |
IS1 | U+001F <INFORMATION SEPARATOR ONE> |
SP | U+0020 SPACE |
LL | U+005F (_) LOW LINE |
NEL | U+0085 <NEXT LINE> |
ZWSP | U+200B ZERO WIDTH SPACE |
However, many programs use the POSIX-style properties, so for compatibility it is best to come up with uniform set of recommendations for how they should be interpreted in a Unicode context. This also relates to Java, since many of the methods on Character ultimately derive from trying to match some of the POSIX categories.
The following compares current Perl, ICU, Java, Windows, and the POSIX spec, and tries to derive a recommendation for the best definition, given the way people use the properties in practice. Note that these are only current snapshots, since those environments may change their definitions, especially as they upgrade beyond Unicode 3.x.
The main open issues are:
Feedback is welcome, at mailto:icu@oss.software.ibm.com.
Perl | ICU | Java | Windows | Recommended | Comments |
---|---|---|---|---|---|
punct P |
u_charType gc=P |
getType gc=P |
iswpunct IsPunctuation gc=P |
\p{gc=P} | For a better match to the POSIX locale, add \p{gc=S}. Not recommended generally, due to the confusion of having punct include non-punctuation marks. |
alpha gc=L or M |
u_isalpha gc=L u_isUAlphabetic |
isLetter gc=L |
iswalpha IsLetter gc=L |
\p{Alphabetic} | Alphabetic includes more than gc = Letter. Note that marks (Me, Mn, Mc) are required for words of many languages. While they could be applied to non-alphabetics, their principle use is on alphabetics. See DerivedCoreProperties for Alphabetic, also DerivedGeneralCategory |
lower gc=Ll |
u_islower gc=Ll u_isULowercase |
isLowerCase gc=Ll |
iswlower IsLower gc=Ll |
\p{Lowercase}
|
Lowercase includes more than gc = Lowercase_Letter (Ll). See DerivedCoreProperties.
For strict POSIX, intersect recommendation with {alpha}. One may also add Lt, although it logically doesn't belong. |
upper gc=Lu |
u_isupper gc=Lu u_isUUppercase |
isUpperCase gc=Lu |
iswupper IsUpper gc=Lu |
\p{Uppercase} |
Uppercase includes more than gc = Uppercase_Letter (Lu).
For strict POSIX, intersect recommendation with {alpha}. One may also add Lt, although it logically doesn't belong. |
digit gc=Nd \d |
u_isdigit gc=Nd |
isDigit gc=Nd |
iswdigit IsDigit gc=Nd |
\p{gc=Nd} | Non-decimal numbers (like Roman numerals) are normally excluded. In
U4.0+, this is the same as gc = Decimal_Number (Nd). See DerivedNumericType
For strict POSIX, intersect recommendation with {ASCII} |
xdigit 0..9, A..F, a..f |
u_getIntPropertyValue
UCHAR_ASCII_HEX_DIGIT UCHAR_HEX_DIGIT |
digit
!= -1 gc=Nd a-f, A-F |
∅ | \p{gc=Nd} a-f, A-F, a-f, A-F |
The A-F are upper & lower, narrow and fullwidth. The POSIX spec requires
that xdigit contains digit. This also matches Java. For strict POSIX, intersect recommendation with {ASCII} |
alnum gc=L or M or N |
u_isalnum gc=L or Nd |
isLetterOrDigit gc=L or Nd |
iswalnum IsLetterOrDigit gc=L or Nd |
\p{alpha} \p{digit} |
Simple combination of other properties |
Perl | ICU | Java | Windows | Recommended | Comments |
cntrl gc=C |
u_isISOControl gc=Cc u_iscntrl |
isISOControl gc=Cc |
iswcntrl IsControl gc=Cc |
\p{gc=Control} | |
graph gc=L or M or N or P or S or Co |
∅ | ∅ | iswgraph not in .NET ?? |
All but: [\p{space} \p{gc=Cc} \p{gc=Cs} \p{gc=Cn}] |
Perl is the same as excluding: Z, Cc, Cf, Cs, Cn.
POSIX: includes alpha, digit, punct, excludes cntrl |
print gc=graph + Zs |
u_isprint All but gc=C |
∅ | iswprint not in .NET ?? |
\p{graph} \p{space} |
POSIX: includes graph, <space> |
Perl | ICU | Java | Windows | Recommended | Comments |
space Z or HT..CR \s |
u_isWhitespace = Java_isWhitespace + NEL u_isJavaSpaceChar u_isspace u_isUWhiteSpace |
isWhitespace Z + HT..CR, IS4..IS1 - no-break-whitespace isSpaceChar isSpace |
iswspace IsWhiteSpace gc=Z or HT..CR, NEL |
\p{Whitespace} | See Whitespace_Comparison for a
comparison of WhiteSpace to Z, and for no-break-whitespace.
See PropList for the definition of Whitespace (also in U3.1, U3.2) Note: ZWSP, while a Z character, is for line break control and should not be included. |
blank gc=Zl or Zp or HT, SP |
∅ | ∅ | ∅ | \p{Whitespace} - [\N{LF} \N{VT} \N{FF} \N{CR} \N{NEL} \p{gc=Zl} \p{gc=Zp}] |
"horizontal" whitespace.
POSIX: Space, Tab,... |
Perl | ICU | Java | Windows | Recommended | Comments |
word L or M or N or "_" \w |
see below | see below | see below | \p{alpha} \p{digit} \p{gc=Pc} |
This is only an approximation to Word Boundaries (see below). The gc=Pc is added in for programming language identifiers, thus adding "_". |
\X | BreakIterator
(ICU4C) ubrk.h BreakIterator (ICU4J) |
BreakIterator | BreakIterator | Default Grapheme Cluster Boundaries | See UAX #29:
Text Boundaries, also GraphemeClusterBreakTest.html
Other functions are used for programming language identifier boundaries. |
\b | BreakIterator
(ICU4C) ubrk.h BreakIterator (ICU4J) |
BreakIterator | BreakIterator | Default Word Boundaries | If there is a requirement that \b align with \w, then it would use the
approximation above instead. See UAX
#29: Text Boundaries, also WordBreakTest.html.
Other functions are used for programming language identifier boundaries. |
In White_Space, but not in Z: U+0009 # <CHARACTER TABULATION (HT)> U+000A # <LINE FEED (LF)> U+000B # <LINE TABULATION (VT)> U+000C # <FORM FEED (FF)> U+000D # <CARRIAGE RETURN (CR)> U+0085 # <NEXT LINE (NEL)> Not in White_Space, but in Z: U+200B # ZERO WIDTH SPACE In both White_Space and Z: U+0020 # SPACE U+00A0 # NO-BREAK SPACE U+1680 # OGHAM SPACE MARK U+180E # MONGOLIAN VOWEL SEPARATOR U+2000 # EN QUAD U+2001 # EM QUAD U+2002 # EN SPACE U+2003 # EM SPACE U+2004 # THREE-PER-EM SPACE U+2005 # FOUR-PER-EM SPACE U+2006 # SIX-PER-EM SPACE U+2007 # FIGURE SPACE U+2008 # PUNCTUATION SPACE U+2009 # THIN SPACE U+200A # HAIR SPACE U+2028 # LINE SEPARATOR U+2029 # PARAGRAPH SEPARATOR U+202F # NARROW NO-BREAK SPACE U+205F # MEDIUM MATHEMATICAL SPACE U+3000 # IDEOGRAPHIC SPACE
No-break-whitespace is defined to be \p{dt=nb}&\p{Whitespace}. In Unicode 4.0, it consists of the following 3 characters:
NBSP | U+00A0 NO-BREAK SPACE |
NNBSP | U+202F NARROW NO-BREAK SPACE |
FSP | U+2007 FIGURE SPACE |
See DerivedDecompositionType for nb (nobreak) values for dt (decomposition type).
Abbr. |
Description |
---|---|
Lu | Letter, Uppercase |
Ll | Letter, Lowercase |
Lt | Letter, Titlecase |
Lm | Letter, Modifier |
Lo | Letter, Other |
Mn | Mark, Non-Spacing |
Mc | Mark, Spacing Combining |
Me | Mark, Enclosing |
Nd | Number, Decimal |
Nl | Number, Letter |
No | Number, Other |
Pc | Punctuation, Connector |
Pd | Punctuation, Dash |
Ps | Punctuation, Open |
Pe | Punctuation, Close |
Pi | Punctuation, Initial quote (may behave like Ps or Pe depending on usage) |
Pf | Punctuation, Final quote (may behave like Ps or Pe depending on usage) |
Po | Punctuation, Other |
Sm | Symbol, Math |
Sc | Symbol, Currency |
Sk | Symbol, Modifier |
So | Symbol, Other |
Zs | Separator, Space |
Zl | Separator, Line |
Zp | Separator, Paragraph |
Cc | Other, Control |
Cf | Other, Format |
Cs | Other, Surrogate |
Co | Other, Private Use |
Cn | Other, Not Assigned (no characters in the file have this property) |
Below is a table from the Posix requirements. The effect of this is:
|
Can Also Belong To |
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
In Class |
upper |
lower |
alpha |
digit |
space |
cntrl |
punct |
graph |
|
xdigit |
blank |
upper |
|
- |
A |
x |
x |
x |
x |
A |
A |
- |
x |
lower |
- |
|
A |
x |
x |
x |
x |
A |
A |
- |
x |
alpha |
- |
- |
|
x |
x |
x |
x |
A |
A |
- |
x |
digit |
x |
x |
x |
|
x |
x |
x |
A |
A |
A |
x |
space |
x |
x |
x |
x |
|
- |
* |
* |
* |
x |
- |
cntrl |
x |
x |
x |
x |
- |
|
x |
x |
x |
x |
- |
punct |
x |
x |
x |
x |
- |
x |
|
A |
A |
x |
- |
graph |
- |
- |
- |
- |
- |
x |
- |
|
A |
- |
- |
|
- |
- |
- |
- |
- |
x |
- |
- |
|
- |
- |
xdigit |
- |
- |
- |
- |
x |
x |
x |
A |
A |
|
x |
blank |
x |
x |
x |
x |
A |
- |
* |
* |
* |
x |
|
Explanation of codes:
The <space>, which is part of the space and blank classes, cannot belong to punct or graph, but shall automatically belong to the print class. Other space or blank characters can be classified as any of punct, graph, or print.
TR 14652 only provides the contents of each value with a long list of characters, not based upon a boolean combination of Unicode properties. And it is way out of date. So it is a bit hard to compare. However, there are two generated comparison files, based on Unicode version.
Some comments based on those comparisons:
[snip 1,2]
Note: With both blank and space, this emphasizes to me again that
U+200B # () ZERO WIDTH SPACE should not be in Zs; it should be in
Cf. We correct that in White_Space property, but it will continue to be a source
of confusion unless it is removed from Zs.