L2/03-139

Recommendations for POSIX-Style Properties

Latest Version: http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/posix_classes.html
Previous Versions: http://oss.software.ibm.com/cvs/icu/icuhtml/design/
Last updated: 2003-04-29, MED

The POSIX-style property names are are not well specified, and don't really map well to the broader types of characters available in Unicode/10646. For example, there is no provision for titlecase, nor for a distinction between symbols and punctuation. The Posix categories aren't really set up to make distinctions among combining marks, nor many of the other Unicode Properties.

Characters Mnemonics
HT U+0009 <CHARACTER TABULATION>
LF U+000A <LINE FEED>
VT U+000B <LINE TABULATION>
FF U+000C <FORM FEED>
CR U+000D <CARRIAGE RETURN>
IS4 U+001C <INFORMATION SEPARATOR FOUR>
IS3 U+001D <INFORMATION SEPARATOR THREE>
IS2 U+001E <INFORMATION SEPARATOR TWO>
IS1 U+001F <INFORMATION SEPARATOR ONE>
SP U+0020 SPACE
LL U+005F (_) LOW LINE
NEL U+0085 <NEXT LINE>
ZWSP U+200B ZERO WIDTH SPACE

However, many programs use the POSIX-style properties, so for compatibility it is best to come up with uniform set of recommendations for how they should be interpreted in a Unicode context. This also relates to Java, since many of the methods on Character ultimately derive from trying to match some of the POSIX categories.

The following compares current Perl, ICU, Java, Windows, and the POSIX spec, and tries to derive a recommendation for the best definition, given the way people use the properties in practice. Note that these are only current snapshots, since those environments may change their definitions, especially as they upgrade beyond Unicode 3.x.

Open Issues:

The main open issues are:

  1. Alphabetic also does not include the following, which TR 14652 does in alpha. This is not a formal requirement, but certainly the first two should be in alpha (also in Alphabetic!):
    1. U+309B # (゛) KATAKANA-HIRAGANA VOICED SOUND MARK
    2. U+309C # (゜) KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
    3. U+30FB # (・) KATAKANA MIDDLE DOT
  2. ZWSP, while a Z character, is for line break control and is not included in Whitespace. However, the fact that it is still in Z is very misleading. The recommendation is to change the General_Category to Cf to accurately reflect its status.

Feedback is welcome, at mailto:icu@oss.software.ibm.com.

Comparison Table

Notes:

Perl ICU Java Windows Recommended Comments
punct
P
u_charType
gc=P
getType
gc=P
iswpunct
IsPunctuation

gc=P
For a better match to the POSIX locale, add \p{gc=S}. Not recommended generally, due to the confusion of having punct include non-punctuation marks.
alpha
gc=L or M
u_isalpha
gc=L

u_isUAlphabetic
Alphabetic=true

isLetter
gc=L
iswalpha
IsLetter
gc=L
Alphabetic includes more than gc = Letter. Note that marks (Me, Mn, Mc) are required for words of many languages. While they could be applied to non-alphabetics, their principle use is on alphabetics. See DerivedCoreProperties  for Alphabetic, also DerivedGeneralCategory
lower
gc=Ll
u_islower
gc=Ll

u_isULowercase
Lowercase=true

isLowerCase
gc=Ll
iswlower
IsLower

gc=Ll
Lowercase includes more than gc = Lowercase_Letter (Ll). See DerivedCoreProperties.

For strict POSIX, intersect recommendation with {alpha}. One may also add Lt, although it logically doesn't belong.

upper
gc=Lu
u_isupper
gc=Lu

u_isUUppercase
Uppercase=true

isUpperCase
gc=Lu
iswupper
IsUpper

gc=Lu
Uppercase includes more than gc = Uppercase_Letter (Lu).

For strict POSIX, intersect recommendation with {alpha}. One may also add Lt, although it logically doesn't belong.

digit
gc=Nd

\d

u_isdigit
gc=Nd
isDigit
gc=Nd
iswdigit
IsDigit

gc=Nd
Non-decimal numbers (like Roman numerals) are normally excluded. In U4.0+, this is the same as gc = Decimal_Number (Nd). See DerivedNumericType

For strict POSIX, intersect recommendation with {ASCII}

xdigit
0..9, A..F, a..f
u_getIntPropertyValue

UCHAR_ASCII​_HEX_DIGIT
0-9 A-F a-f

UCHAR​_HEX_DIGIT
adds fullwidth

digit != -1
gc=Nd
a-f, A-F
The A-F are upper & lower, narrow and fullwidth. The POSIX spec requires that xdigit contains digit.

This also matches Java.

For strict POSIX, intersect recommendation with {ASCII}

alnum
gc=L or M or N
u_isalnum
gc=L or Nd
isLetterOrDigit
gc=L or Nd
iswalnum
IsLetterOrDigit
gc=L or Nd
Simple combination of other properties
Perl ICU Java Windows Recommended Comments
cntrl
gc=C
u_isISOControl
gc=Cc

u_iscntrl
gc=Cc or Cf or Zl or Zp

isISOControl
gc=Cc
iswcntrl
IsControl

gc=Cc
 
graph
gc=L or M or N or P or S or Co
iswgraph
not in .NET
??
Perl is the same as excluding: Z, Cc, Cf, Cs, Cn.

POSIX: includes alpha, digit, punct, excludes cntrl

print
gc=graph + Zs
u_isprint
All but gc=C
iswprint
not in .NET
??
POSIX: includes graph, <space>
Perl ICU Java Windows Recommended Comments
space
Z or HT..CR

\s

u_isWhitespace
=
Java_isWhitespace + NEL

u_isJavaSpaceChar
=
Java_isSpaceChar

u_isspace
= Z +  HT..CR, IS4..IS1

u_isUWhiteSpace
Unicode White_Space

isWhitespace
Z +  HT..CR, IS4..IS1
- no-break-whitespace

isSpaceChar
gc=Z

isSpace
HT..CR
(deprecated)

iswspace
IsWhiteSpace

gc=Z or HT..CR, NEL
See Whitespace_Comparison for a comparison of WhiteSpace to Z, and for no-break-whitespace.

See PropList for the definition of Whitespace (also in U3.1, U3.2)

Note: ZWSP, while a Z character, is for line break control and should not be included.

blank
gc=Zl or Zp or HT, SP
"horizontal" whitespace.

POSIX: Space, Tab,...

Perl ICU Java Windows Recommended Comments
word
L or M or N or "_"

\w

see below see below see below This is only an approximation to Word Boundaries (see below). The gc=Pc is added in for programming language identifiers, thus adding "_".
\X BreakIterator (ICU4C)
ubrk.h
BreakIterator (ICU4J)
BreakIterator BreakIterator See UAX #29: Text Boundaries, also GraphemeClusterBreakTest.html

Other functions are used for programming language identifier boundaries.

\b BreakIterator (ICU4C)
ubrk.h
BreakIterator (ICU4J)
BreakIterator BreakIterator If there is a requirement that \b align with \w, then it would use the approximation above instead. See UAX #29: Text Boundaries, also WordBreakTest.html.

Other functions are used for programming language identifier boundaries.

References:

ICU Background

Whitespace Comparison:

In White_Space, but not in Z:
        U+0009  # <CHARACTER TABULATION (HT)>
        U+000A  # <LINE FEED (LF)>
        U+000B  # <LINE TABULATION (VT)>
        U+000C  # <FORM FEED (FF)>
        U+000D  # <CARRIAGE RETURN (CR)>
        U+0085  # <NEXT LINE (NEL)>
Not in White_Space, but in Z:
        U+200B  # ZERO WIDTH SPACE
In both White_Space and Z:
        U+0020  # SPACE
        U+00A0  # NO-BREAK SPACE
        U+1680  # OGHAM SPACE MARK
        U+180E  # MONGOLIAN VOWEL SEPARATOR
        U+2000  # EN QUAD
        U+2001  # EM QUAD
        U+2002  # EN SPACE
        U+2003  # EM SPACE
        U+2004  # THREE-PER-EM SPACE
        U+2005  # FOUR-PER-EM SPACE
        U+2006  # SIX-PER-EM SPACE
        U+2007  # FIGURE SPACE
        U+2008  # PUNCTUATION SPACE
        U+2009  # THIN SPACE
        U+200A  # HAIR SPACE
        U+2028  # LINE SEPARATOR
        U+2029  # PARAGRAPH SEPARATOR
        U+202F  # NARROW NO-BREAK SPACE
        U+205F  # MEDIUM MATHEMATICAL SPACE
        U+3000  # IDEOGRAPHIC SPACE

No-break-whitespace is defined to be \p{dt=nb}&\p{Whitespace}. In Unicode 4.0, it consists of the following 3 characters:

NBSP U+00A0 NO-BREAK SPACE
NNBSP U+202F NARROW NO-BREAK SPACE
FSP U+2007 FIGURE SPACE

 See DerivedDecompositionType for nb (nobreak) values for dt (decomposition type).

General Categories

Abbr.

Description

Lu Letter, Uppercase
Ll Letter, Lowercase
Lt Letter, Titlecase
Lm Letter, Modifier
Lo Letter, Other
Mn Mark, Non-Spacing
Mc Mark, Spacing Combining
Me Mark, Enclosing
Nd Number, Decimal
Nl Number, Letter
No Number, Other
Pc Punctuation, Connector
Pd Punctuation, Dash
Ps Punctuation, Open
Pe Punctuation, Close
Pi Punctuation, Initial quote (may behave like Ps or Pe depending on usage)
Pf Punctuation, Final quote (may behave like Ps or Pe depending on usage)
Po Punctuation, Other
Sm Symbol, Math
Sc Symbol, Currency
Sk Symbol, Modifier
So Symbol, Other
Zs Separator, Space
Zl Separator, Line
Zp Separator, Paragraph
Cc Other, Control
Cf Other, Format
Cs Other, Surrogate
Co Other, Private Use
Cn Other, Not Assigned (no characters in the file have this property)

Posix Requirements

Below is a table from the Posix requirements. The effect of this is:

Table: Valid Character Class Combinations

 

Can Also Belong To

In Class

upper

lower

alpha

digit

space

cntrl

punct

graph

print

xdigit

blank

upper

 

-

A

x

x

x

x

A

A

-

x

lower

-

 

A

x

x

x

x

A

A

-

x

alpha

-

-

 

x

x

x

x

A

A

-

x

digit

x

x

x

 

x

x

x

A

A

A

x

space

x

x

x

x

 

-

*

*

*

x

-

cntrl

x

x

x

x

-

 

x

x

x

x

-

punct

x

x

x

x

-

x

 

A

A

x

-

graph

-

-

-

-

-

x

-

 

A

-

-

print

-

-

-

-

-

x

-

-

 

-

-

xdigit

-

-

-

-

x

x

x

A

A

 

x

blank

x

x

x

x

A

-

*

*

*

x

 

  1. Explanation of codes:

    A
    Automatically included; see text.
    -
    Permitted.
    x
    Mutually-exclusive.
    *
    See note 2.
  2. The <space>, which is part of the space and blank classes, cannot belong to punct or graph, but shall automatically belong to the print class. Other space or blank characters can be classified as any of punct, graph, or print.

TR 14652

TR 14652 only provides the contents of each value with a long list of characters, not based upon a boolean combination of Unicode properties. And it is way out of date. So it is a bit hard to compare. However, there are two generated comparison files, based on Unicode version.

Some comments based on those comparisons:

[snip 1,2]

  1. lower. Quite a number of items in the TR just look like they must have
    been typos:
     U+01F1 #  (DZ) LATIN CAPITAL LETTER DZ
     U+03E2 #  (Ϣ) COPTIC CAPITAL LETTER SHEI
     U+03E4 #  (Ϥ) COPTIC CAPITAL LETTER FEI
     U+03E6 #  (Ϧ) COPTIC CAPITAL LETTER KHEI
     U+03E8 #  (Ϩ) COPTIC CAPITAL LETTER HORI
     U+03EA #  (Ϫ) COPTIC CAPITAL LETTER GANGIA
     U+03EC #  (Ϭ) COPTIC CAPITAL LETTER SHIMA
     U+03EE #  (Ϯ) COPTIC CAPITAL LETTER DEI
    ...
  2. alpha. There is one place where the proposal misses a couple of characters in the ISO 14652 alpha:

    In ISO_14652_alpha, but not in Alphabetic + gc=M:
     U+309B..U+309C #  (゛..゜) KATAKANA-HIRAGANA VOICED SOUND MARK
     U+30FB #  (・) KATAKANA MIDDLE DOT

    It strongly looks like these 3 ought to be in Alphabetic, if that is the case. Also, the text says:

    "alpha - Define characters to be classified as used to spell out the words for natural languages; such as letters, syllabic or ideographic characters."

    Unclear whether this should include characters like the Hebrew Punctuation Gerish, which are parts of words.
  3. space. For the ASCII range (00..7F), the POSIX standard only has:

    space    <tab>;<newline>;<vertical-tab>;<form-feed>;\
             <carriage-return>;<space>

    It seems very surprising for the TR to introduce
     
    U+0008 # <BACKSPACE>

    The other differences in the TR (other than being out of date) appear to be that it excludes the non-breaking spaces:

     U+00A0 #  ( ) NO-BREAK SPACE
     U+2007 #  ( ) FIGURE SPACE
     U+202F #  ( ) NARROW NO-BREAK SPACE

    Very hard to say whether these should be in or out, since the POSIX standard gives little guidance. And if they are out, the question is whether they should be correspondingly in graph.
  4. punct. The TR includes both gc=P and gc=S. A reasonable choice, given the way POSIX deals with them in ASCII. But note that this is counter to how Java, Windows, and Perl deal with them.
  5. cntrl. This matches exactly. However, if one took the same approach with this as with punct, then gc=Cf characters might be included, and perhaps also Zp, Zl.
  6. graph. We have:

    In ISO_14652_graph, but not in All - gc=Cc, Cs, Cn, or Z:
     U+00A0 #  ( ) NO-BREAK SPACE
     U+2000..U+200B #  ( ..​) EN QUAD..ZERO WIDTH SPACE
     U+2028..U+2029 #  (
..
) LINE SEPARATOR..PARAGRAPH SEPARATOR
     U+3000 #  ( ) IDEOGRAPHIC SPACE
    Total: 16

    which sort of lines up with what is done with space, but not really. Again, unclear whether Cfs ought to be in graph or in cntrl or neither. The TR also excludes private use and some other characters that seem reasonable to include.
  7. xdigit. The narrow interpretation is just [0-9A-Fa-f]. But since xdigit is a superset of digit, including the wide digits 09 so for consistency the wide letters a-f, A-F should be in. And in many ways, the broader definition is more useful. If you want to narrow a broader definition, it is easy to, say, mask the broader one with U+0000..U+007F. To broaden a narrow definition, on the other hand, requires a hard-coded list.

    Philosophically, it is odd to have A١٢B as a hex number (with Arabic numerals), but for that matter it is odd to have 2١٢3 as a decimal number. In either case, it doesn't hurt much, and if any individual client wanted to impose masking on top of that it would be
    easy.
  8. blank. Surprising again that the TR diviates from the POSIX standard over ASCII by excluding <tab>. POSIX has:

    blank    <space>;<tab>

    And for it to introduce the following is very odd.

     U+0008 # <BACKSPACE>
  9. print. This is only defined in a comment:
    % "print" is by default "graph", and the <space> character
    Yet the standard says that print may include other space characters, so it is not required that the only difference be <space>

Note: With both blank and space, this emphasizes to me again that U+200B #  (​) ZERO WIDTH SPACE should not be in Zs; it should be in Cf. We correct that in White_Space property, but it will continue to be a source of confusion unless it is removed from Zs.