From: Tom Emerson (tree@basistech.com)
Date: Wed Apr 20 2005 - 18:27:30 CST
Tex Texin writes:
> Tom, you are right it is the latter, Unicoded identifiers and such. I'll
> look at the Python docs, thanks for the tip.
Note that Python does not allow Unicode identifiers: just Unicode
string support. Java is probably your best template for dealing with
Unicode identifiers.
The big problem with a fully-Unicode enabled lexer (i.e., one that is
using UTF-16 or UTF-32 internally) is the sheer size of the lookup
tables: instead of an alphabet of less than 100 characters, you end up
with one with tens of thousands of characters. Ye Olde direct index
falls apart in the presence of these sparse tables. My two IUC
presentations (24 and whatever number happened in Dublin) talk about
some methods for dealing with these issues: unfortunately you end up
trading off size for a non-trivial speed hit, unless you are very
careful.
UTF-8 is a solution to the problem, though the depth of the automata
increases and you may end up having to convert your existing UTF-16/32
buffers to UTF-8 for lexing, then back again, dealing all the while
with returning correct offsets during error processing. PCRE, for
example, works in UTF-8, so if you want to use it on a UTF-16 buffer
you need to convert both ways. A RPITA.
-tree
-- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"
This archive was generated by hypermail 2.1.5 : Wed Apr 20 2005 - 18:28:22 CST