Re: Character properties

From: Mark Davis (mark@macchiato.com)
Date: Wed Oct 11 2000 - 11:26:43 EDT

Next message: John Jenkins: "Re: Microsoft Office 2001 Mac"
Previous message: Mark Leisher: "Re: .bdf file format"
Maybe in reply to: Marcin 'Qrczak' Kowalczyk: "Character properties"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Character properties"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Here is my take on the way Unicode general categories should be mapped to
POSIX ones.

1. As a reminder, the Unicode General Categories are:

L* (letters): Lu, Ll, Lt , Lm, Lo
M* (marks): Mn, Mc, Me
N* (numbers): Nd, Nl, No
P* (punctuation): Pc, Pd, Ps, Pe, Pi, Pf, Po
S* (symbols): Sm, Sc, Sk, So
Z* (separators): Zs, Zl, Zp
C* (others): Cc, Cf, Cs, Co, Cn

(short descriptions are on
http://www.unicode.org/Public/UNIDATA/UnicodeData.html#General Category;
longer ones in The Unicode Standard, Version 3.0)

2. TAB, CR, LF, FF, NL (0085) are assigned Cc in the Unicode Character
Database. For our purposes, treat them as separate (new) values:

Zt TAB
Zb CR, LF, FF, NL

For a full discussion of newline characters, see
http://www.unicode.org/unicode/reports/tr13/

3. Co is Private Use. Depending on the conventions for private use
characters active in the current system, these would be remapped to other
values appropriately. If they are clearly unassigned, they should be treated
as Cn. If their status is simply unknown, then probably the safest is to
treat them as Lo.

4. We then get the following POSIX assignments (notes below).

Uppercase: Lu, Lt
Lowercase: Ll
Alpha: L*, M*
Graph: L*, M*, N*, P*, S*
Print: Graph, Zs
Space: Z*
Blank: Zs, Zt
Control: Cc, Cf
Punctuation: P*, S*
Digit: Nd
Xdigit 0-9, A-F, a-f

Notes:
a. The POSIX categories don’t make fine enough distinctions.
a. The recommendations here are based on the expected usage patterns
for the functions based on these categories.

b. It is probably better in POSIX to treat Lt as if it were Lu. They are
cased letters, and closer to Lu than Ll.

c. isAlpha is most likely used to determine words. If combining marks are
excluded, then perfectly valid words would look like they were broken in
two. For example, clients wouldn’t want the Arabic word “يونِكود” broken
into two, just because of the KASRA. Generally, combining marks should take
on the characteristics of the preceding base character. Since the majority
of the time they are applied to letters, the best treatment in POSIX would
be as isAlpha.

Mark

----- Original Message -----
From: "Kenneth Whistler" <kenw@sybase.com>
To: "Unicode List" <unicode@unicode.org>
Cc: <unicode@unicode.org>; <kenw@sybase.com>
Sent: Wednesday, October 04, 2000 18:33
Subject: Re: Character properties

> Marcin Kowalczyk asked about character properties, in a thread that
> wandered off into a discussion of digits in particular.
>
> > I am trying to improve character properties handling in the language
> > Haskell. What should the following functions return, i.e. what is
> > most standard/natural/preferred mapping between Unicode character
> > categories and predicates like isalpha etc.? What else should be
> > provided?
>
> My suggestion is that you also look at the informative data file,
> PropList.txt, which provides a number of suggestions regarding some
> class definitions for these character property predicates.
>
> It is quite clear that many important character properties cannot
> be deduced from the General Category values in UnicodeData.txt alone.
> And that is why I provided further suggestions, based upon Sybase
> implementations of Unicode character properties, in PropList.txt.
>
> > Here are definitions that I use currently:
> >
> > isControl = c < ' ' || c >= '\x7F' && c <= '\x9F'
>
> This is fine if isControl is aimed at the ISO control codes associated
> with the ISO 2022 framework. However, Unicode introduces a number
> of other control functions encoded with characters, and it depends
> on what you want the property API to be sensitive to. An obvious
> example is the set of bidirectional format control characters.
>
> > isPrint = category is other than [Zl,Zp,Cc,Cf,Cs,Co,Cn]
>
> It probably isn't a good idea to include Co (Other, private use) in
> the exclusion set for isPrint. In most typical usage, if a user-defined
> character is assigned, it will be a printable character.
>
> > isSpace = one of "\t\n\r\f\v" || category is one of [Zs,Zl,Zp]
>
> You need to decide whether this is for space per se or for whitespace
> (as you have defined it). Depending on your system, you may have to
> add U+0085 as well.
>
> > isGraph = isPrint c && not (isSpace c)
> > isPunct = isGraph c && not (isAlphaNum c)
>
> This is closer to a definition of something like isSymbol, rather
> than isPunct. It depends on what you want the isPunct function to be
> doing for you.
>
> > isAlphaNum = category is one of [Lu,Ll,Lt,Nd,Nl,No,Lm,Lo]
>
> This is definitely wrong. See isAlpha below, which has the same problem.
> The issue is that many scripts have combining characters which are fully
> alphabetic. Their General Category is typically Mc. You cannot omit those
> from an isAlpha or isAlphaNum and get the right results.
>
> > isHexDigit = isDigit c || c >= 'A' && c <= 'F' || c >= 'a' && c <= 'f'
> > isDigit = c >= '0' && c <= '9'
>
> Others pointed out the problem with this: isASCIIDigit <> isDigit.
>
> > isOctDigit = c >= '0' && c <= '7'
> > isAlpha = category is one of [Lu,Ll,Lt,Lm,Lo]
>
> This defines the "letters" of Unicode (actually, letters, syllables,
> and ideographs), but omits all the alphabetic combining marks. See
> PropList.txt for a suggested correct list. isAlpha is not derivable
> from General Category values.
>
> > isUpper = category is one of [Lu,Lt]
> > isLower = category is Ll
> > isLatin1 = c <= '\xFF'
> > isAscii = c < '\x80'
> >
> > isDigit intentionally recognizes ASCII digits only. IMHO it's more
> > often needed and this is what the Haskell 98 Report says. (But I
> > don't follow the report in some other cases.)
> >
> > Titlecase could be handled too. Even then I think that isUpper should
> > be True for titlecase letters (so it's usable for testing if the first
> > letter of a word is uppercase), and there should be a separate function
> > for category Lu only (for testing if all characters are uppercase).
>
> See UTR #21, Case Mappings, for guidelines on the case properties. There
> is a difference between detecting case properties for characters
> (including the compatibility titlecase digraphs) and detecting case
> properties for strings.
>
> --Ken
>
> >
> > --
> > __("< Marcin Kowalczyk * qrczak@knm.org.pl http://qrczak.ids.net.pl/
> > \__/
> > ^^ SYGNATURA ZASTÊPCZA
> > QRCZAK

Next message: John Jenkins: "Re: Microsoft Office 2001 Mac"
Previous message: Mark Leisher: "Re: .bdf file format"
Maybe in reply to: Marcin 'Qrczak' Kowalczyk: "Character properties"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Character properties"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT