Re: Unicode lexer

From: Hans Aberg (haberg@math.su.se)
Date: Wed Apr 20 2005 - 10:37:26 CST

Next message: Mark Davis: "Re: String name and Character Name"

Previous message: Peter Constable: "RE: Unicode Bloopers"
In reply to: Frank Yung-Fong Tang: "Re: Unicode lexer"
Next in thread: Tex Texin: "Re: Unicode lexer"
Reply: Tex Texin: "Re: Unicode lexer"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

At 10:23 -0400 2005/04/20, Frank Yung-Fong Tang wrote:
>I think one question we need to first answer is how do you define an
>
>Unicode Enabled Lexer
>
>I don't have a good answer. But I think it should at least include
>the following
>
>1. Have the ability to scane UTF-8 (and/or UTF-16) input file

Any lexer generator that admits full 8-bit bytes in the source and
scanning inputs has this property. For example, in Flex, if you feed
it a UTF-8 .l file, then in the Flex language part, it must of course
be 7-bit ASCII, as that is how the language defines it. But in string
rules "...", if 8-bit bytes are admitted, you could put in a UTF-8
string, and that would be matched literally by the generated lexer.
In an UTF-8 editor, you would just see the Unicode character string.

>2. Have the ability to return token in one or more transformation
>format of Unicode

I am not sure what you have in your mind here: The Flex generated
lexer typically just returns an int, if anything. Other semantically
data is returned imperatively in some state variable. One does that
by hand, by writing explicit rules. The default rule "." would not
work under UTF-8, to match any Unicode character, so some extension
might be needed.

>3. Have the ability to handle some set of Unicode regular expression features
>4. Have the ability to support programming language specific Unicode
>'escape' sequence. ( \uHHHH, &#ddddd; &#xxxxx; \HHHHH , etc) The
>lexer may not support it directly, but it should be able to let the
>Lexer caller to define a way to deal with it.

These are the extensions I addressed for Flex, i.e., translating
Unicode character classes into byte regular expressions that match
these strings if the lexer input is in UTF-8/32.

>5. Use some Unicode based String data type as primitive datatype to
>return the result in the token.[?]

Again, it is unclear what you mean here, as the lexer just returns
the int token values indicated by hand in the rule actions.

More advanced Unicode support might involve support for recognizing
common Unicode character classes. For example, one might want to
recognize letters, so that one can easily admit identifiers using
letters.

-- 
   Hans Aberg

Next message: Mark Davis: "Re: String name and Character Name"
Previous message: Peter Constable: "RE: Unicode Bloopers"
In reply to: Frank Yung-Fong Tang: "Re: Unicode lexer"
Next in thread: Tex Texin: "Re: Unicode lexer"
Reply: Tex Texin: "Re: Unicode lexer"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Apr 20 2005 - 10:39:00 CST