RE: Regular expressions in Unicode (Was: Ethiopic text)

From: Jeroen Hellingman (etmjehe@genesis.etm.ericsson.se)
Date: Mon Mar 16 1998 - 01:10:17 EST


Gianni wrote:

> I'm confused about this discussion.
>
> Regular expressions translate themselves to state machines. State
> machines can be used on unicode strings just like any other encoding. I
> have most of the makings of a lexical scanner generator for Unicode that I
> wrote years ago.
>
> Syntax like "*" and "?" have very generic meaning and work fine with
> Unicode and they translate to a set of states and tranitions.
>
> What am I missing ?
        

Well, as long as you use only '*' and '?', there will be no problem.
The problem is with the semantics of something like range specifications,
like [A-Z], or [A-Å] (note the A-circle) which can have different meanings,
depending on whether you use an English or Danish locale.
Ranges are internally handled as set, and can perfectly handled by a state
machine---the issue is what should go in the set.

Jeroen.

+---- Jeroen Hellingman ---------------------------------------------------------+
| work: Ericsson Telecommunicatie B.V., Ericssonstraat 2, Rijen, The Netherlands |
| Department ETM/RPU, Room 17116 |
| Tel: +31 161 242022 (834 2022), E-mail: <etmjehe@etm.ericsson.se> |
| home: Aletta Jacobsstraat 5, 3404 XD IJsselstein, The Netherlands |
| Tel: +31 30 6875444, E-mail: <jehe@kabelfoon.nl> |
| Homepage: <http://members.tripod.com/~jhellingman> |
+--------------------------------------------------------------------------------+



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT