Re: Character properties

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Sun Oct 08 2000 - 04:43:03 EDT


Wed, 4 Oct 2000 18:48:17 -0700 (PDT), Kenneth Whistler <kenw@sybase.com> pisze:

> It is quite clear that many important character properties cannot
> be deduced from the General Category values in UnicodeData.txt alone.

What a pity. Especially as it does work for some properties and I
would like to avoid having too many arbitrary data sources.

> > isControl = c < ' ' || c >= '\x7F' && c <= '\x9F'
>
> This is fine if isControl is aimed at the ISO control codes associated
> with the ISO 2022 framework. However, Unicode introduces a number
> of other control functions encoded with characters, and it depends
> on what you want the property API to be sensitive to. An obvious
> example is the set of bidirectional format control characters.

The precise meaning is to be decided too.

I think that isControl should be more or less the complement of
isPrint, modulo unassigned characters and surrogates. They should
tell which characters should be output unescaped by programs like ls
(GNU ls uses isprint), or legal in the source of some languages or text
file formats. While isPrint are characters definitely safe for output,
isControl would be ones that should not occur in pure text and should
be always filtered out in some way before displaying (unless handled
explicitly like \n \t \f), and for characters in neither class it
depends on the application for which side does it want to err...
I'm not sure if this makes sense.

On the linux-utf8 mailing list I've got conflicting responses about
    U+2028 LINE SEPARATOR
    U+2029 PARAGRAPH SEPARATOR
Should they be plain control characters or ones in the "third" class
without clear status.

> > isPrint = category is other than [Zl,Zp,Cc,Cf,Cs,Co,Cn]
>
> It probably isn't a good idea to include Co (Other, private use) in
> the exclusion set for isPrint. In most typical usage, if a user-defined
> character is assigned, it will be a printable character.

I was told the same on linux-utf8, and for Cf as well. Cf surprised
me, and I was told that programs like ls should not avoid outputting
Cf characters. Hmm...

> > isSpace = one of "\t\n\r\f\v" || category is one of [Zs,Zl,Zp]
>
> You need to decide whether this is for space per se or for whitespace
> (as you have defined it).

I think whitespace - places safe to break a line into words, or
stuff allowed between identifiers in some file formats or programming
languages (those which say "any Unicode whitespace character", e.g.
Haskell source).

I was told that I should exclude
    U+00A0 NO-BREAK SPACE
    U+202F NARROW NO-BREAK SPACE
because of the application for line breaking. They are excluded from
is[w]space in the newest glibc.

> Depending on your system, you may have to add U+0085 as well.

I have never heard about U+0085 being used anywhere... What is it for?

> > isGraph = isPrint c && not (isSpace c)
> > isPunct = isGraph c && not (isAlphaNum c)
>
> This is closer to a definition of something like isSymbol, rather
> than isPunct.

I was told the same on linux-utf8, and thus now I have separate
isPunct and isSymbol (despite the standard C library which puts
both into is[w]punct).

> > isAlphaNum = category is one of [Lu,Ll,Lt,Nd,Nl,No,Lm,Lo]
>
> This is definitely wrong. See isAlpha below, which has the same problem.

This seems to be the biggest problem (and only real problem): the
number of exceptions from any category-based predicate is large.

> The issue is that many scripts have combining characters which are
> fully alphabetic. Their General Category is typically Mc. You cannot
> omit those from an isAlpha or isAlphaNum and get the right results.

IMHO isAlpha[Num] should tell which characters form words to be
used as identifiers in various contexts. This is one of predicates
important for Haskell source, not only its library.

I quickly wrote perl programs to compare PropList's Alphabetic +
Ideographic with subsets derived from categories. Basing on categories
L* + Mc + Nl, the exception list is still large: excluded twenty Lm
characters, two Mc characters, and 229 out of 447 Mn characters -
near the half! European accents are excluded, but many marks from
scripts that I don't know at all are included. It is not obvious why
characters like
    U+073F SYRIAC RWAHA
    U+0902 DEVANAGARI SIGN ANUSVARA
are included, and
    U+0742 SYRIAC RUKKAKHA
    U+093C DEVANAGARI SIGN NUKTA
are excluded.

I still don't know how to do it in an elegant way.

> Others pointed out the problem with this: isASCIIDigit <> isDigit.

OK, this is fixed.

Perhaps there are important character classes that I omitted at all.

-- 
 __("<  Marcin Kowalczyk * qrczak@knm.org.pl http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT