From: Mark Davis (mark.davis@jtcsv.com)
Date: Mon Apr 21 2003 - 21:30:58 EDT
The POSIX/C-style property names (punct, alpha, lower, upper, digit, xdigit,
alnum, cntrl, graph, print, space, blank) are not well specified, and don't
really map well to the broader types of characters available in
Unicode/10646. For example, there is no provision for titlecase, nor for a
distinction between symbols and punctuation. These categories aren't really
set up to make distinctions among combining marks, nor many of the other
Unicode Properties.
However, many programs use the POSIX-style properties, so for compatibility
it is best to come up with uniform set of recommendations for how they
should be interpreted in a Unicode context. This also relates to Java, since
many of the methods on Character ultimately derive from trying to match some
of the POSIX categories.
The following compares current Perl, ICU, Java, Windows, and the POSIX spec,
and tries to derive a recommendation for the best definition, given the way
people use the properties in practice. Note that these are only current
snapshots, since those environments may change their definitions, especially
as they upgrade beyond Unicode 3.x.
http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/posix_classes.
html
Feedback is welcome.
Mark
This archive was generated by hypermail 2.1.5 : Mon Apr 21 2003 - 22:02:24 EDT