scanning & flex cpsc 388 ellen walker hiram college
TRANSCRIPT
Scanning & FLEX
CPSC 388Ellen WalkerHiram College
Scanning (review)
• Input: characters from the source code
• Output: Tokens– Keywords: IF, THEN, ELSE, FOR …– Symbols: PLUS, LBRACE, SEMI …– Variable tokens: ID, NUM
•Augment with string or numeric value
Token Class (partial)
Class Token {Public: TokenType tokenval; string tokenchars; double numval;}
GetToken(): A scanning function
• Token *getToken(istream &sin)– Read characters from sin until a complete token is extracted, return the token
– Usually called by the parser
– Note: version in the book uses global variables and returns only the token type
Using GetToken (Review)
Token *myToken = GetToken(cin);While (myToken != NULL){ //process the token
switch (myToken->TokenType){ //cases for each token type }
myToken = GetToken(cin);
}
Result of GetToken (Review)
for (int i = 0 ; i < 100 ; i++){
for (int i = 0 ; i < 100 ; i++){
for (int i = 0 ; i < 100 ; i++){
TokenType: FOR
TokenType: LPAREN
Regular Expressions for Common Tokens
• Special characters: (the characters)
• Identifier: [a-zA-Z][a-zA-Z_]*• Numbers:
– Int: [1-9][0-9]*– Float: [1-9][0-9]*(|(.[0-9]*))– Scientific: [1-9][0-9]*(|(.[0-9]*))(E+e)(+|–| )[1-9][0-9]*
Reg. Exp. For Comments
• Comment to end of line– //[^\n]* (last part: (all chars except \n)* )
• /*…*/ comment– ab (~b|b~a)*b?ba <--- ab … ba– /\* (~\* | \*~/)*(\*)? \*/ <--- needs escapes!
– Does not require matching of “inner” /**/
Comments in Practice
• Often handled by “ad-hoc” methods
• Scanner simply loops to ignore characters from /* to */– If character is not ‘*’, ignore it– Else if next character is not “/”, ignore it
– Else ignore “/*” and return to scanning normally
Delimiters and Ambiguity
• Comments are not totally ignored!– “fo/**/r” is not the keyword “for” !
• Principle of longest substring (“maximal munch”)– “fork” is not “for” followed by “k”
• Disallow keywords as identifiers– Scan identifier, then look it up instead of including keywords explicitly in language
FORTRAN’s mistakes
• Ignored white space (no delimiters)– DO99I=1.2 (DO99I = 1.2) vs.– DO99I=1,2 (DO 99 I = 1 , 2)
• No reserved words– IF(IF.EQ.0)THENTHEN=17
• Result: arbitrary backtracking (or lookahead) needed!
TINY Lexemes
• Reserved words: if, then, else, end, repeat, until, read, write
• Symbols: +, -, *, /, =, <, (, ), ;, :=
• Other: number (integer only), identifier (letters only)
• Comment: {…}• Principle of longest substring holds
TINY DFA
start
inid
done
inumdigit [ !digit ]
[ !letter ]
letter
digit
letter
spacer
com
}{
~}:=
: =
punctuation
Using the TINY DFA
• Implement DFA directly or with a table
• Each call to gettoken() starts at the current point of the string, scans until no transition is possible.
• If final state is reached, return the token determined by the link to the final state. Otherwise, report an error.
• Characters in [ ] are not consumed
DFA pseudocodde
• State = Start_state• While (chars available ){• last_state = state;• state = next_state(next_char, state);
• if state = null return (final (last_state));
• } return final(last_state);
LEX (FLEX)
• FLEX generates a scanner automatically!– Input: description of regular expression for each token, optional additional code
– Output: lex.yy.c - includes function yylex() for parsing (like gettoken)
DFA Pseudocode
• state = initial-state• while(chars in string){• c = next char from string• state = next_state[state][c]
• }• If final[state] return ACCEPT
Parts of a LEX file
• Definitions – code for the top of the file, and define expressions such as “digit”
– All code in %{ and %} directly copied
• Rules– { expression } {code when recognized}
• Auxiliary Routines– Define additional functions here (including main)
Predefined items
• yylex() - lex scanning routine (like getToken) - generated by FLEX
• yytext - current string (a character array, not a C++ string class)
• Input() - get a char from flex input
• ECHO - print yytext to yyout
Example: Definitions
%{/* add line numbers to text and print */
#include <iostream>int lineno=1;%} line .*\n%%
Example: Rules & Aux. Code
{line} {cout << lineno++ <<“ “<< yytext;}
%%main(){ yylex(); return 0;}
Using the Scanner
• First, create the code– flex test.lex
• Next, compile the program– g++ lex.yy.c -o test -lfl
• Finally, scan the input file– ./test < input_file