L2/03-139

Recommendations for POSIX-Style Properties

Latest Version:	http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/posix_classes.html
Previous Versions:	http://oss.software.ibm.com/cvs/icu/icuhtml/design/
Last updated:	2003-04-29, MED

The POSIX-style property names are are not well specified, and don't really map well to the broader types of characters available in Unicode/10646. For example, there is no provision for titlecase, nor for a distinction between symbols and punctuation. The Posix categories aren't really set up to make distinctions among combining marks, nor many of the other Unicode Properties.

**Characters Mnemonics**
HT	U+0009 <CHARACTER TABULATION>
LF	U+000A <LINE FEED>
VT	U+000B <LINE TABULATION>
FF	U+000C <FORM FEED>
CR	U+000D <CARRIAGE RETURN>
IS4	U+001C <INFORMATION SEPARATOR FOUR>
IS3	U+001D <INFORMATION SEPARATOR THREE>
IS2	U+001E <INFORMATION SEPARATOR TWO>
IS1	U+001F <INFORMATION SEPARATOR ONE>
SP	U+0020 SPACE
LL	U+005F (_) LOW LINE
NEL	U+0085 <NEXT LINE>
ZWSP	U+200B ZERO WIDTH SPACE

However, many programs use the POSIX-style properties, so for compatibility it is best to come up with uniform set of recommendations for how they should be interpreted in a Unicode context. This also relates to Java, since many of the methods on Character ultimately derive from trying to match some of the POSIX categories.

The following compares current Perl, ICU, Java, Windows, and the POSIX spec, and tries to derive a recommendation for the best definition, given the way people use the properties in practice. Note that these are only current snapshots, since those environments may change their definitions, especially as they upgrade beyond Unicode 3.x.

Open Issues:

The main open issues are:

Alphabetic also does not include the following, which TR 14652 does in alpha. This is not a formal requirement, but certainly the first two should be in alpha (also in Alphabetic!):
1. U+309B # (゛) KATAKANA-HIRAGANA VOICED SOUND MARK
2. U+309C # (゜) KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
3. U+30FB # (・) KATAKANA MIDDLE DOT
ZWSP, while a Z character, is for line break control and is not included in Whitespace. However, the fact that it is still in Z is very misleading. The recommendation is to change the General_Category to Cf to accurately reflect its status.

Feedback is welcome, at mailto:icu@oss.software.ibm.com.

Comparison Table

Notes:

The Java function names are methods on Character, unless otherwise noted.
The Windows column lists CRT function, then .NET System.Char method, then equivalence.
The normal property abbreviations are used, e.g. gc for General_Category. See PropertyAliases and PropertyValueAliases for the abbreviations.
For values of the General Category, see General_Categories.
For the boolean combinations, & means intersection of sets, - means set difference, [...] are for grouping, and adjacency is union. Thus
- [\p{Lowercase}\p{gc=Lt}] & {alpha} - [A-F] means the set (((Lowercase ∪ Lt) ∩ alpha) ⊝ {A B C D E F})

Perl	ICU	Java	Windows	Recommended	Comments
punct P	u_charType gc=P	getTypegc=P	iswpunct IsPunctuation gc=P	\p{gc=P}	For a better match to the POSIX locale, add \p{gc=S}. Not recommended generally, due to the confusion of having punct include non-punctuation marks.
alpha gc=L or M	u_isalpha gc=L u_isUAlphabetic Alphabetic=true	isLetter gc=L	iswalpha IsLetter gc=L	\p{Alphabetic}	Alphabetic includes more than gc = Letter. Note that marks (Me, Mn, Mc) are required for words of many languages. While they could be applied to non-alphabetics, their principle use is on alphabetics. See DerivedCoreProperties for Alphabetic, also DerivedGeneralCategory
lower gc=Ll	u_islower gc=Ll u_isULowercase Lowercase=true	isLowerCase gc=Ll	iswlower IsLower gc=Ll	\p{Lowercase}	Lowercase includes more than gc = Lowercase_Letter (Ll). See DerivedCoreProperties. For strict POSIX, intersect recommendation with {alpha}. One may also add Lt, although it logically doesn't belong.
upper gc=Lu	u_isupper gc=Lu u_isUUppercase Uppercase=true	isUpperCase gc=Lu	iswupper IsUpper gc=Lu	\p{Uppercase}	Uppercase includes more than gc = Uppercase_Letter (Lu). For strict POSIX, intersect recommendation with {alpha}. One may also add Lt, although it logically doesn't belong.
digit gc=Nd \d	u_isdigit gc=Nd	isDigit gc=Nd	iswdigit IsDigit gc=Nd	\p{gc=Nd}	Non-decimal numbers (like Roman numerals) are normally excluded. In U4.0+, this is the same as gc = Decimal_Number (Nd). See DerivedNumericType For strict POSIX, intersect recommendation with {ASCII}
xdigit 0..9, A..F, a..f	u_getIntPropertyValue UCHAR_ASCII_HEX_DIGIT 0-9 A-F a-f UCHAR_HEX_DIGIT adds fullwidth	digit != -1 gc=Nd a-f, A-F	∅	\p{gc=Nd} a-f, A-F, ａ-ｆ, Ａ-Ｆ	The A-F are upper & lower, narrow and fullwidth. The POSIX spec requires that xdigit contains digit. This also matches Java. For strict POSIX, intersect recommendation with {ASCII}
alnum gc=L or M or N	u_isalnum gc=L or Nd	isLetterOrDigit gc=L or Nd	iswalnum IsLetterOrDigit gc=L or Nd	\p{alpha} \p{digit}	Simple combination of other properties
Perl	ICU	Java	Windows	Recommended	Comments
cntrl gc=C	u_isISOControl gc=Cc u_iscntrl gc=Cc or Cf or Zl or Zp	isISOControl gc=Cc	iswcntrl IsControl gc=Cc	\p{gc=Control}
graph gc=L or M or N or P or S or Co	∅	∅	iswgraph not in .NET ??	All but: [\p{space} \p{gc=Cc} \p{gc=Cs} \p{gc=Cn}]	Perl is the same as excluding: Z, Cc, Cf, Cs, Cn. POSIX: includes alpha, digit, punct, excludes cntrl
print gc=graph + Zs	u_isprint All but gc=C	∅	iswprint not in .NET ??	\p{graph} \p{space}	POSIX: includes graph, <space>
Perl	ICU	Java	Windows	Recommended	Comments
space Z or HT..CR \s	u_isWhitespace = Java_isWhitespace + NEL u_isJavaSpaceChar = Java_isSpaceChar u_isspace = Z + HT..CR, IS4..IS1 u_isUWhiteSpace Unicode White_Space	isWhitespace Z + HT..CR, IS4..IS1 - no-break-whitespace isSpaceChar gc=Z isSpace HT..CR (deprecated)	iswspace IsWhiteSpace gc=Z or HT..CR, NEL	\p{Whitespace}	See Whitespace_Comparison for a comparison of WhiteSpace to Z, and for no-break-whitespace. See PropList for the definition of Whitespace (also in U3.1, U3.2) Note: ZWSP, while a Z character, is for line break control and should not be included.
blank gc=Zl or Zp or HT, SP	∅	∅	∅	\p{Whitespace} - [\N{LF} \N{VT} \N{FF} \N{CR} \N{NEL} \p{gc=Zl} \p{gc=Zp}]	"horizontal" whitespace. POSIX: Space, Tab,...
Perl	ICU	Java	Windows	Recommended	Comments
word L or M or N or "_" \w	see below	see below	see below	\p{alpha} \p{digit} \p{gc=Pc}	This is only an approximation to Word Boundaries (see below). The gc=Pc is added in for programming language identifiers, thus adding "_".
\X	BreakIterator (ICU4C) ubrk.h BreakIterator (ICU4J)	BreakIterator	BreakIterator	Default Grapheme Cluster Boundaries	See UAX #29: Text Boundaries, also GraphemeClusterBreakTest.html Other functions are used for programming language identifier boundaries.
\b	BreakIterator (ICU4C) ubrk.h BreakIterator (ICU4J)	BreakIterator	BreakIterator	Default Word Boundaries	If there is a requirement that \b align with \w, then it would use the approximation above instead. See UAX #29: Text Boundaries, also WordBreakTest.html. Other functions are used for programming language identifier boundaries.

References:

The Open Group Base Specifications Issue 6, IEEE Std 1003.1, 2003 Edition, "Locale" chapter
- http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap07.html
Related Perl Links
ICU Links
- http://oss.software.ibm.com/cvs/icu/~checkout~/icu/source/common/unicode/uchar.h (latest version)
- http://oss.software.ibm.com/icu/apiref/uchar_8h.html
Java Links
- http://java.sun.com/j2se/1.4.1/docs/api/java/lang/Character.html
TR 14652
- http://anubis.dkuug.dk/jtc1/sc22/wg20/docs/n972-14652ft.pdf
Unicode:

ICU Background

ICU4J ("for Java") implements the same methods in its UCharacter class as the JDK's Character class, with the same method names and semantics, only extended to all of Unicode (0..0x10ffff) and up to date with the latest Unicode version (current snapshot at Unicode 4). There are also additional methods for more Unicode-defined properties.
ICU4C ("for C/C++") provides C functions with more C-style names but the same semantics as ICU4J/JDK. There is no full definition of the C APIs for more than ASCII anyway (which is why we are having this discussion), so they were made parallel with Java for ease of porting.
A good link on ICU properties, although not quite up to date right now (still on ICU 2.4 level), is the Properties chapter in the User Guide: http://oss.software.ibm.com/icu/userguide/properties.html

Whitespace Comparison:

In White_Space, but not in Z:
        U+0009  # <CHARACTER TABULATION (HT)>
        U+000A  # <LINE FEED (LF)>
        U+000B  # <LINE TABULATION (VT)>
        U+000C  # <FORM FEED (FF)>
        U+000D  # <CARRIAGE RETURN (CR)>
        U+0085  # <NEXT LINE (NEL)>
Not in White_Space, but in Z:
        U+200B  # ZERO WIDTH SPACE
In both White_Space and Z:
        U+0020  # SPACE
        U+00A0  # NO-BREAK SPACE
        U+1680  # OGHAM SPACE MARK
        U+180E  # MONGOLIAN VOWEL SEPARATOR
        U+2000  # EN QUAD
        U+2001  # EM QUAD
        U+2002  # EN SPACE
        U+2003  # EM SPACE
        U+2004  # THREE-PER-EM SPACE
        U+2005  # FOUR-PER-EM SPACE
        U+2006  # SIX-PER-EM SPACE
        U+2007  # FIGURE SPACE
        U+2008  # PUNCTUATION SPACE
        U+2009  # THIN SPACE
        U+200A  # HAIR SPACE
        U+2028  # LINE SEPARATOR
        U+2029  # PARAGRAPH SEPARATOR
        U+202F  # NARROW NO-BREAK SPACE
        U+205F  # MEDIUM MATHEMATICAL SPACE
        U+3000  # IDEOGRAPHIC SPACE

No-break-whitespace is defined to be \p{dt=nb}&\p{Whitespace}. In Unicode 4.0, it consists of the following 3 characters:

NBSP	U+00A0 NO-BREAK SPACE
NNBSP	U+202F NARROW NO-BREAK SPACE
FSP	U+2007 FIGURE SPACE

See DerivedDecompositionType for nb (nobreak) values for dt (decomposition type).

General Categories

Abbr.	Description
Lu	Letter, Uppercase
Ll	Letter, Lowercase
Lt	Letter, Titlecase
Lm	Letter, Modifier
Lo	Letter, Other
Mn	Mark, Non-Spacing
Mc	Mark, Spacing Combining
Me	Mark, Enclosing
Nd	Number, Decimal
Nl	Number, Letter
No	Number, Other
Pc	Punctuation, Connector
Pd	Punctuation, Dash
Ps	Punctuation, Open
Pe	Punctuation, Close
Pi	Punctuation, Initial quote (may behave like Ps or Pe depending on usage)
Pf	Punctuation, Final quote (may behave like Ps or Pe depending on usage)
Po	Punctuation, Other
Sm	Symbol, Math
Sc	Symbol, Currency
Sk	Symbol, Modifier
So	Symbol, Other
Zs	Separator, Space
Zl	Separator, Line
Zp	Separator, Paragraph
Cc	Other, Control
Cf	Other, Format
Cs	Other, Surrogate
Co	Other, Private Use
Cn	Other, Not Assigned (no characters in the file have this property)

Posix Requirements

Below is a table from the Posix requirements. The effect of this is:

alpha, digit, punct, cntrl are all disjoint
space, cntrl, blank are pairwise disjoint with any of alpha, digit, xdigit
alpha includes upper, lower
graph includes alpha, digit, punct
print includes graph
xdigit includes digit

Table: Valid Character Class Combinations

	Can Also Belong To
In Class	upper	lower	alpha	digit	space	cntrl	punct	graph	print	xdigit	blank
upper		-	A	x	x	x	x	A	A	-	x
lower	-		A	x	x	x	x	A	A	-	x
alpha	-	-		x	x	x	x	A	A	-	x
digit	x	x	x		x	x	x	A	A	A	x
space	x	x	x	x		-	*	*	*	x	-
cntrl	x	x	x	x	-		x	x	x	x	-
punct	x	x	x	x	-	x		A	A	x	-
graph	-	-	-	-	-	x	-		A	-	-
print	-	-	-	-	-	x	-	-		-	-
xdigit	-	-	-	-	x	x	x	A	A		x
blank	x	x	x	x	A	-	*	*	*	x

Explanation of codes:

A
Automatically included; see text.
-
Permitted.
x
Mutually-exclusive.
*
See note 2.
The <space>, which is part of the space and blank classes, cannot belong to punct or graph, but shall automatically belong to the print class. Other space or blank characters can be classified as any of punct, graph, or print.

TR 14652

TR 14652 only provides the contents of each value with a long list of characters, not based upon a boolean combination of Unicode properties. And it is way out of date. So it is a bit hard to compare. However, there are two generated comparison files, based on Unicode version.

Some comments based on those comparisons:

[snip 1,2]

lower. Quite a number of items in the TR just look like they must have
been typos:
U+01F1 # (Ǳ) LATIN CAPITAL LETTER DZ
U+03E2 # (Ϣ) COPTIC CAPITAL LETTER SHEI
U+03E4 # (Ϥ) COPTIC CAPITAL LETTER FEI
U+03E6 # (Ϧ) COPTIC CAPITAL LETTER KHEI
U+03E8 # (Ϩ) COPTIC CAPITAL LETTER HORI
U+03EA # (Ϫ) COPTIC CAPITAL LETTER GANGIA
U+03EC # (Ϭ) COPTIC CAPITAL LETTER SHIMA
U+03EE # (Ϯ) COPTIC CAPITAL LETTER DEI
...
alpha. There is one place where the proposal misses a couple of characters in the ISO 14652 alpha:

In ISO_14652_alpha, but not in Alphabetic + gc=M:
U+309B..U+309C # (゛..゜) KATAKANA-HIRAGANA VOICED SOUND MARK
U+30FB # (・) KATAKANA MIDDLE DOT

It strongly looks like these 3 ought to be in Alphabetic, if that is the case. Also, the text says:

"alpha - Define characters to be classified as used to spell out the words for natural languages; such as letters, syllabic or ideographic characters."

Unclear whether this should include characters like the Hebrew Punctuation Gerish, which are parts of words.
space. For the ASCII range (00..7F), the POSIX standard only has:

space <tab>;<newline>;<vertical-tab>;<form-feed>;\
<carriage-return>;<space>

It seems very surprising for the TR to introduce

U+0008 # <BACKSPACE>

The other differences in the TR (other than being out of date) appear to be that it excludes the non-breaking spaces:

U+00A0 # ( ) NO-BREAK SPACE
U+2007 # ( ) FIGURE SPACE
U+202F # ( ) NARROW NO-BREAK SPACE

Very hard to say whether these should be in or out, since the POSIX standard gives little guidance. And if they are out, the question is whether they should be correspondingly in graph.
punct. The TR includes both gc=P and gc=S. A reasonable choice, given the way POSIX deals with them in ASCII. But note that this is counter to how Java, Windows, and Perl deal with them.
cntrl. This matches exactly. However, if one took the same approach with this as with punct, then gc=Cf characters might be included, and perhaps also Zp, Zl.
graph. We have:

In ISO_14652_graph, but not in All - gc=Cc, Cs, Cn, or Z:
U+00A0 # ( ) NO-BREAK SPACE
U+2000..U+200B # ( ..) EN QUAD..ZERO WIDTH SPACE
U+2028..U+2029 # ( .. ) LINE SEPARATOR..PARAGRAPH SEPARATOR
U+3000 # (　) IDEOGRAPHIC SPACE
Total: 16

which sort of lines up with what is done with space, but not really. Again, unclear whether Cfs ought to be in graph or in cntrl or neither. The TR also excludes private use and some other characters that seem reasonable to include.
xdigit. The narrow interpretation is just [0-9A-Fa-f]. But since xdigit is a superset of digit, including the wide digits ０９ so for consistency the wide letters ａ-ｆ, Ａ-Ｆ should be in. And in many ways, the broader definition is more useful. If you want to narrow a broader definition, it is easy to, say, mask the broader one with U+0000..U+007F. To broaden a narrow definition, on the other hand, requires a hard-coded list.

Philosophically, it is odd to have A١٢B as a hex number (with Arabic numerals), but for that matter it is odd to have 2١٢3 as a decimal number. In either case, it doesn't hurt much, and if any individual client wanted to impose masking on top of that it would be
easy.
blank. Surprising again that the TR diviates from the POSIX standard over ASCII by excluding <tab>. POSIX has:

blank <space>;<tab>

And for it to introduce the following is very odd.

U+0008 # <BACKSPACE>
print. This is only defined in a comment:
% "print" is by default "graph", and the <space> character
Yet the standard says that print may include other space characters, so it is not required that the only difference be <space>

Note: With both blank and space, this emphasizes to me again that U+200B # () ZERO WIDTH SPACE should not be in Zs; it should be in Cf. We correct that in White_Space property, but it will continue to be a source of confusion unless it is removed from Zs.