Re: Regular expressions in Unicode (Was: Ethiopic text)

From: Hallvard B Furuseth (h.b.furuseth@usit.uio.no)
Date: Thu Mar 12 1998 - 22:41:47 EST


Kenneth Whistler writes (but not in this order):

> Sometimes *lack* of power is important. (...)

Absolutely. And yes, applications should provide a simpler way than
regexps, such as a variant of wildcard syntax - however they spell it.
But I want regexps too.

> using "L" to
> stand for any letter, "N" for any digit, "?" for any character, etc.

Oops - now you are almost back to the beginning of this thread: Problems
with locale dependencies:-)

> Or the LIKE clause in a typical SQL implementation, which uses (...)
> [a-f] for a range,

and now you *are* back to the beginning: How to express ranges and what
they are supposed to mean, which is what I wondered most about in
Unicode regexps.

> We need to divorce the problem of what it means
> to specify a range from the particular encoding contingencies.
> (...)

Well said. I wish I had started this thread that way -- the answer
could probably be specialized and inserted in regexps afterwards.
Remove "regexp" from my posts and add my concern about greater risk of
user errors, and we seem to be saying much the same thing.

> Actually, what I had in mind was more along the lines of a serious,
> holistic analysis of what string pattern matching means in a universal
> character set context, accompanied by some thinking about how layering
> abstractions for pattern matching could result in different levels
> (implemented differently), depending on application needs.

Ouch. You just pointed out yet another of my implicit assumtions: I've
been thinking partly of already-existing code which must be upgraded to
use Unicode, and I lumped this together with other Unicode applications.
For *new* code, this approach could help a lot.

Anyway, on to regexps and disagreements:

> (Lycos finds "aardvark" just fine, but barfs on "aardv??k" or
> "aardv*k".)

One point need clarification: "Wildcard syntax" in whatever form, like
        `?' = 1 char,
        `*' = 0 or more chars,
is *not* a subset of any regexp syntax. In regexps, `*' means `0 or
more of the *previous* char or group'. (And `.' is usually `1 char').
So:
        wildcard syntax `?' --> regexp syntax `.'
        wildcard syntax `*' --> regexp syntax `.*'
Thus, "aardv*k" matches "aardvvvvvvk" but not aardvaaaaak.

This is a common misunderstanding, and one reason (other than laziness)
that some UNIX applications choose not to provide wildcard syntax. They
want regexps, and they don't want users to be confused about wildcard
vs. regexp syntax. Whether that's a good choice is left as an exercise
for the reader.

>> I'm a UNIX programmer, but also a UNIX user. (...)

What I meant to say is that a lot of us who do know and use regexps are
not about to give them up - either as programmers or as users - because
they are so useful to us even as *users*.

> Tautologous, I'm afraid. Any UNIX user who uses a regexp, is, as I
> see it, by definition a UNIX programmer.

Only if you also mean that someone who uses MS Word's "^#" and "^$"
(digit and letter) is a Word programmer. Except for that, I agree with
a lot of what you say -- and your definition does gives just the right
feel about one kind of user I'm thinking about.

-- 
Hallvard



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT