Re: Character properties

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Oct 04 2000 - 21:48:17 EDT

Next message: Sandeep Krishna: "Re: do all browsers support UTF-8 encoding???"
Previous message: Kenneth Whistler: "Re: U+007E is informatively Sm?"
Maybe in reply to: Marcin 'Qrczak' Kowalczyk: "Character properties"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Character properties"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Character properties"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Marcin Kowalczyk asked about character properties, in a thread that
wandered off into a discussion of digits in particular.

> I am trying to improve character properties handling in the language
> Haskell. What should the following functions return, i.e. what is
> most standard/natural/preferred mapping between Unicode character
> categories and predicates like isalpha etc.? What else should be
> provided?

My suggestion is that you also look at the informative data file,
PropList.txt, which provides a number of suggestions regarding some
class definitions for these character property predicates.

It is quite clear that many important character properties cannot
be deduced from the General Category values in UnicodeData.txt alone.
And that is why I provided further suggestions, based upon Sybase
implementations of Unicode character properties, in PropList.txt.

> Here are definitions that I use currently:
>
> isControl = c < ' ' || c >= '\x7F' && c <= '\x9F'

This is fine if isControl is aimed at the ISO control codes associated
with the ISO 2022 framework. However, Unicode introduces a number
of other control functions encoded with characters, and it depends
on what you want the property API to be sensitive to. An obvious
example is the set of bidirectional format control characters.

> isPrint = category is other than [Zl,Zp,Cc,Cf,Cs,Co,Cn]

It probably isn't a good idea to include Co (Other, private use) in
the exclusion set for isPrint. In most typical usage, if a user-defined
character is assigned, it will be a printable character.

> isSpace = one of "\t\n\r\f\v" || category is one of [Zs,Zl,Zp]

You need to decide whether this is for space per se or for whitespace
(as you have defined it). Depending on your system, you may have to
add U+0085 as well.

> isGraph = isPrint c && not (isSpace c)
> isPunct = isGraph c && not (isAlphaNum c)

This is closer to a definition of something like isSymbol, rather
than isPunct. It depends on what you want the isPunct function to be
doing for you.

> isAlphaNum = category is one of [Lu,Ll,Lt,Nd,Nl,No,Lm,Lo]

This is definitely wrong. See isAlpha below, which has the same problem.
The issue is that many scripts have combining characters which are fully
alphabetic. Their General Category is typically Mc. You cannot omit those
from an isAlpha or isAlphaNum and get the right results.

> isHexDigit = isDigit c || c >= 'A' && c <= 'F' || c >= 'a' && c <= 'f'
> isDigit = c >= '0' && c <= '9'

Others pointed out the problem with this: isASCIIDigit <> isDigit.

> isOctDigit = c >= '0' && c <= '7'
> isAlpha = category is one of [Lu,Ll,Lt,Lm,Lo]

This defines the "letters" of Unicode (actually, letters, syllables,
and ideographs), but omits all the alphabetic combining marks. See
PropList.txt for a suggested correct list. isAlpha is not derivable
from General Category values.

> isUpper = category is one of [Lu,Lt]
> isLower = category is Ll
> isLatin1 = c <= '\xFF'
> isAscii = c < '\x80'
>
> isDigit intentionally recognizes ASCII digits only. IMHO it's more
> often needed and this is what the Haskell 98 Report says. (But I
> don't follow the report in some other cases.)
>
> Titlecase could be handled too. Even then I think that isUpper should
> be True for titlecase letters (so it's usable for testing if the first
> letter of a word is uppercase), and there should be a separate function
> for category Lu only (for testing if all characters are uppercase).

See UTR #21, Case Mappings, for guidelines on the case properties. There
is a difference between detecting case properties for characters
(including the compatibility titlecase digraphs) and detecting case
properties for strings.

--Ken

>
> --
> __("< Marcin Kowalczyk * qrczak@knm.org.pl http://qrczak.ids.net.pl/
> \__/
> ^^ SYGNATURA ZASTĘPCZA
> QRCZAK

Next message: Sandeep Krishna: "Re: do all browsers support UTF-8 encoding???"
Previous message: Kenneth Whistler: "Re: U+007E is informatively Sm?"
Maybe in reply to: Marcin 'Qrczak' Kowalczyk: "Character properties"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Character properties"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Character properties"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT