From: Hans Aberg (haberg@math.su.se)
Date: Wed Apr 20 2005 - 10:37:26 CST
At 10:23 -0400 2005/04/20, Frank Yung-Fong Tang wrote:
>I think one question we need to first answer is how do you define an
>
>Unicode Enabled Lexer
>
>I don't have a good answer. But I think it should at least include
>the following
>
>1. Have the ability to scane UTF-8 (and/or UTF-16) input file
Any lexer generator that admits full 8-bit bytes in the source and
scanning inputs has this property. For example, in Flex, if you feed
it a UTF-8 .l file, then in the Flex language part, it must of course
be 7-bit ASCII, as that is how the language defines it. But in string
rules "...", if 8-bit bytes are admitted, you could put in a UTF-8
string, and that would be matched literally by the generated lexer.
In an UTF-8 editor, you would just see the Unicode character string.
>2. Have the ability to return token in one or more transformation
>format of Unicode
I am not sure what you have in your mind here: The Flex generated
lexer typically just returns an int, if anything. Other semantically
data is returned imperatively in some state variable. One does that
by hand, by writing explicit rules. The default rule "." would not
work under UTF-8, to match any Unicode character, so some extension
might be needed.
>3. Have the ability to handle some set of Unicode regular expression features
>4. Have the ability to support programming language specific Unicode
>'escape' sequence. ( \uHHHH, &#ddddd; &#xxxxx; \HHHHH , etc) The
>lexer may not support it directly, but it should be able to let the
>Lexer caller to define a way to deal with it.
These are the extensions I addressed for Flex, i.e., translating
Unicode character classes into byte regular expressions that match
these strings if the lexer input is in UTF-8/32.
>5. Use some Unicode based String data type as primitive datatype to
>return the result in the token.[?]
Again, it is unclear what you mean here, as the lexer just returns
the int token values indicated by hand in the rule actions.
More advanced Unicode support might involve support for recognizing
common Unicode character classes. For example, one might want to
recognize letters, so that one can easily admit identifiers using
letters.
-- Hans Aberg
This archive was generated by hypermail 2.1.5 : Wed Apr 20 2005 - 10:39:00 CST