From: Tex Texin (tex@i18nguy.com)
Date: Wed Apr 20 2005 - 18:33:22 CST
All true, which is why I am looking for an existing implementation...
;-)
Tom Emerson wrote:
>
> Tex Texin writes:
> > Tom, you are right it is the latter, Unicoded identifiers and such. I'll
> > look at the Python docs, thanks for the tip.
>
> Note that Python does not allow Unicode identifiers: just Unicode
> string support. Java is probably your best template for dealing with
> Unicode identifiers.
>
> The big problem with a fully-Unicode enabled lexer (i.e., one that is
> using UTF-16 or UTF-32 internally) is the sheer size of the lookup
> tables: instead of an alphabet of less than 100 characters, you end up
> with one with tens of thousands of characters. Ye Olde direct index
> falls apart in the presence of these sparse tables. My two IUC
> presentations (24 and whatever number happened in Dublin) talk about
> some methods for dealing with these issues: unfortunately you end up
> trading off size for a non-trivial speed hit, unless you are very
> careful.
>
> UTF-8 is a solution to the problem, though the depth of the automata
> increases and you may end up having to convert your existing UTF-16/32
> buffers to UTF-8 for lexing, then back again, dealing all the while
> with returning correct offsets during error processing. PCRE, for
> example, works in UTF-8, so if you want to use it on a UTF-16 buffer
> you need to convert both ways. A RPITA.
>
> -tree
>
> --
> Tom Emerson Basis Technology Corp.
> Software Architect http://www.basistech.com
> "Beware the lollipop of mediocrity: lick it once and you suck forever"
-- ------------------------------------------------------------- Tex Texin cell: +1 781 789 1898 mailto:Tex@XenCraft.com Xen Master http://www.i18nGuy.com XenCraft http://www.XenCraft.com Making e-Business Work Around the World -------------------------------------------------------------
This archive was generated by hypermail 2.1.5 : Wed Apr 20 2005 - 18:33:57 CST