scanning & flex cpsc 388 ellen walker hiram college

22
Scanning & FLEX CPSC 388 Ellen Walker Hiram College

Upload: eustacia-kennedy

Post on 03-Jan-2016

233 views

Category:

Documents


19 download

TRANSCRIPT

Page 1: Scanning & FLEX CPSC 388 Ellen Walker Hiram College

Scanning & FLEX

CPSC 388Ellen WalkerHiram College

Page 2: Scanning & FLEX CPSC 388 Ellen Walker Hiram College

Scanning (review)

• Input: characters from the source code

• Output: Tokens– Keywords: IF, THEN, ELSE, FOR …– Symbols: PLUS, LBRACE, SEMI …– Variable tokens: ID, NUM

•Augment with string or numeric value

Page 3: Scanning & FLEX CPSC 388 Ellen Walker Hiram College

Token Class (partial)

Class Token {Public: TokenType tokenval; string tokenchars; double numval;}

Page 4: Scanning & FLEX CPSC 388 Ellen Walker Hiram College

GetToken(): A scanning function

• Token *getToken(istream &sin)– Read characters from sin until a complete token is extracted, return the token

– Usually called by the parser

– Note: version in the book uses global variables and returns only the token type

Page 5: Scanning & FLEX CPSC 388 Ellen Walker Hiram College

Using GetToken (Review)

Token *myToken = GetToken(cin);While (myToken != NULL){ //process the token

switch (myToken->TokenType){ //cases for each token type }

myToken = GetToken(cin);

}

Page 6: Scanning & FLEX CPSC 388 Ellen Walker Hiram College

Result of GetToken (Review)

for (int i = 0 ; i < 100 ; i++){

for (int i = 0 ; i < 100 ; i++){

for (int i = 0 ; i < 100 ; i++){

TokenType: FOR

TokenType: LPAREN

Page 7: Scanning & FLEX CPSC 388 Ellen Walker Hiram College

Regular Expressions for Common Tokens

• Special characters: (the characters)

• Identifier: [a-zA-Z][a-zA-Z_]*• Numbers:

– Int: [1-9][0-9]*– Float: [1-9][0-9]*(|(.[0-9]*))– Scientific: [1-9][0-9]*(|(.[0-9]*))(E+e)(+|–| )[1-9][0-9]*

Page 8: Scanning & FLEX CPSC 388 Ellen Walker Hiram College

Reg. Exp. For Comments

• Comment to end of line– //[^\n]* (last part: (all chars except \n)* )

• /*…*/ comment– ab (~b|b~a)*b?ba <--- ab … ba– /\* (~\* | \*~/)*(\*)? \*/ <--- needs escapes!

– Does not require matching of “inner” /**/

Page 9: Scanning & FLEX CPSC 388 Ellen Walker Hiram College

Comments in Practice

• Often handled by “ad-hoc” methods

• Scanner simply loops to ignore characters from /* to */– If character is not ‘*’, ignore it– Else if next character is not “/”, ignore it

– Else ignore “/*” and return to scanning normally

Page 10: Scanning & FLEX CPSC 388 Ellen Walker Hiram College

Delimiters and Ambiguity

• Comments are not totally ignored!– “fo/**/r” is not the keyword “for” !

• Principle of longest substring (“maximal munch”)– “fork” is not “for” followed by “k”

• Disallow keywords as identifiers– Scan identifier, then look it up instead of including keywords explicitly in language

Page 11: Scanning & FLEX CPSC 388 Ellen Walker Hiram College

FORTRAN’s mistakes

• Ignored white space (no delimiters)– DO99I=1.2 (DO99I = 1.2) vs.– DO99I=1,2 (DO 99 I = 1 , 2)

• No reserved words– IF(IF.EQ.0)THENTHEN=17

• Result: arbitrary backtracking (or lookahead) needed!

Page 12: Scanning & FLEX CPSC 388 Ellen Walker Hiram College

TINY Lexemes

• Reserved words: if, then, else, end, repeat, until, read, write

• Symbols: +, -, *, /, =, <, (, ), ;, :=

• Other: number (integer only), identifier (letters only)

• Comment: {…}• Principle of longest substring holds

Page 13: Scanning & FLEX CPSC 388 Ellen Walker Hiram College

TINY DFA

start

inid

done

inumdigit [ !digit ]

[ !letter ]

letter

digit

letter

spacer

com

}{

~}:=

: =

punctuation

Page 14: Scanning & FLEX CPSC 388 Ellen Walker Hiram College

Using the TINY DFA

• Implement DFA directly or with a table

• Each call to gettoken() starts at the current point of the string, scans until no transition is possible.

• If final state is reached, return the token determined by the link to the final state. Otherwise, report an error.

• Characters in [ ] are not consumed

Page 15: Scanning & FLEX CPSC 388 Ellen Walker Hiram College

DFA pseudocodde

• State = Start_state• While (chars available ){• last_state = state;• state = next_state(next_char, state);

• if state = null return (final (last_state));

• } return final(last_state);

Page 16: Scanning & FLEX CPSC 388 Ellen Walker Hiram College

LEX (FLEX)

• FLEX generates a scanner automatically!– Input: description of regular expression for each token, optional additional code

– Output: lex.yy.c - includes function yylex() for parsing (like gettoken)

Page 17: Scanning & FLEX CPSC 388 Ellen Walker Hiram College

DFA Pseudocode

• state = initial-state• while(chars in string){• c = next char from string• state = next_state[state][c]

• }• If final[state] return ACCEPT

Page 18: Scanning & FLEX CPSC 388 Ellen Walker Hiram College

Parts of a LEX file

• Definitions – code for the top of the file, and define expressions such as “digit”

– All code in %{ and %} directly copied

• Rules– { expression } {code when recognized}

• Auxiliary Routines– Define additional functions here (including main)

Page 19: Scanning & FLEX CPSC 388 Ellen Walker Hiram College

Predefined items

• yylex() - lex scanning routine (like getToken) - generated by FLEX

• yytext - current string (a character array, not a C++ string class)

• Input() - get a char from flex input

• ECHO - print yytext to yyout

Page 20: Scanning & FLEX CPSC 388 Ellen Walker Hiram College

Example: Definitions

%{/* add line numbers to text and print */

#include <iostream>int lineno=1;%} line .*\n%%

Page 21: Scanning & FLEX CPSC 388 Ellen Walker Hiram College

Example: Rules & Aux. Code

{line} {cout << lineno++ <<“ “<< yytext;}

%%main(){ yylex(); return 0;}

Page 22: Scanning & FLEX CPSC 388 Ellen Walker Hiram College

Using the Scanner

• First, create the code– flex test.lex

• Next, compile the program– g++ lex.yy.c -o test -lfl

• Finally, scan the input file– ./test < input_file