From: Gregg Reynolds (unicode@arabink.com)
Date: Tue Jan 25 2005 - 06:51:00 CST
Martin Duerst wrote:
>
> What I would expect such an Unicode-enabled version of flex to do
> is to have something similar to <<EOF>>, let's call it <<NONCHAR>>
> for the moment. <<NONCHAR>> would match shortest non-UTF-8 byte
> sequences. The typical use would be for a grammar to have a single
> rule matching <<NONCHAR>>, e.g. like so:
>
> <<NONCHAR>> fprintf(stderr, "Illegal UTF-8 input.\n"); exit(1);
>
Yes; and to go with this I would expect any regex operators to be
defined in terms of characters, so '.' means 'any (well-formed)
character' and does not match ill-formed byte seqs. So the usual
introductory example for flex would include two catch-all rules, one for
chars '.' and one for non-chars '<<NONCHAR>>'.
For <<NONCHAR>> I nominate ☠ (\u2620, skull and crossbones). So the
last lines of the flex spec read:
. copy to stdout
☠ frprint(stderr...(as above)...
-gregg
This archive was generated by hypermail 2.1.5 : Tue Jan 25 2005 - 09:59:09 CST