Gianni wrote:
> I'm confused about this discussion.
>
> Regular expressions translate themselves to state machines. State
> machines can be used on unicode strings just like any other encoding. I
> have most of the makings of a lexical scanner generator for Unicode that I
> wrote years ago.
>
> Syntax like "*" and "?" have very generic meaning and work fine with
> Unicode and they translate to a set of states and tranitions.
>
> What am I missing ?
Well, as long as you use only '*' and '?', there will be no problem.
The problem is with the semantics of something like range specifications,
like [A-Z], or [A-Å] (note the A-circle) which can have different meanings,
depending on whether you use an English or Danish locale.
Ranges are internally handled as set, and can perfectly handled by a state
machine---the issue is what should go in the set.
Jeroen.
+---- Jeroen Hellingman ---------------------------------------------------------+
| work: Ericsson Telecommunicatie B.V., Ericssonstraat 2, Rijen, The Netherlands |
| Department ETM/RPU, Room 17116 |
| Tel: +31 161 242022 (834 2022), E-mail: <etmjehe@etm.ericsson.se> |
| home: Aletta Jacobsstraat 5, 3404 XD IJsselstein, The Netherlands |
| Tel: +31 30 6875444, E-mail: <jehe@kabelfoon.nl> |
| Homepage: <http://members.tripod.com/~jhellingman> |
+--------------------------------------------------------------------------------+
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT