![Page 1: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/1.jpg)
Lexical Analysis
Textbook:Modern Compiler DesignChapter 2.1
![Page 2: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/2.jpg)
A motivating example• Create a program that counts the number of lines in
a given input text file
![Page 3: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/3.jpg)
Solution
int num_lines = 0;%%\n ++num_lines;. ;%% main() { yylex(); printf( "# of lines = %d\n", num_lines); }
![Page 4: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/4.jpg)
Solution
int num_lines = 0;%%\n ++num_lines;. ;%% main() { yylex(); printf( "# of lines = %d\n", num_lines); }
initial
;
newline
\n
other
![Page 5: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/5.jpg)
Outline• Roles of lexical analysis
• What is a token
• Regular expressions and regular descriptions
• Lexical analysis
• Automatic Creation of Lexical Analysis
• Error Handling
![Page 6: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/6.jpg)
Basic Compiler Phases
Source program (string)
Fin. Assembly
lexical analysis
syntax analysis
semantic analysis
Tokens
Abstract syntax tree
Front-End
Back-End
Annotated Abstract syntax tree
![Page 7: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/7.jpg)
Example Tokens
Type Examples
ID foo n_14 last
NUM 73 00 517 082
REAL 66.1 .5 10. 1e67 5.5e-10
IF if
COMMA ,
NOTEQ !=
LPAREN (
RPAREN )
![Page 8: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/8.jpg)
Example Non Tokens
Type Examples
comment /* ignored */
preprocessor directive #include <foo.h>
#define NUMS 5, 6
macro NUMS
whitespace \t \n \b
![Page 9: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/9.jpg)
Example
void match0(char *s) /* find a zero */
{
if (!strncmp(s, “0.0”, 3))
return 0. ;
}
VOID ID(match0) LPAREN CHAR DEREF ID(s)
RPAREN LBRACE IF LPAREN NOT ID(strncmp) LPAREN ID(s) COMMA STRING(0.0) COMMA NUM(3)
RPAREN RPAREN RETURN REAL(0.0) SEMI RBRACE EOF
![Page 10: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/10.jpg)
• input
– program text (file)
• output
– sequence of tokens
• Read input file
• Identify language keywords and standard identifiers
• Handle include files and macros
• Count line numbers
• Remove whitespaces
• Report illegal symbols
• [Produce symbol table]
Lexical Analysis (Scanning)
![Page 11: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/11.jpg)
• Simplifies the syntax analysis– And language definition
• Modularity
• Reusability
• Efficiency
Why Lexical Analysis
![Page 12: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/12.jpg)
What is a token?
• Defined by the programming language
• Can be separated by spaces
• Smallest units
• Defined by regular expressions
![Page 13: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/13.jpg)
A simplified scanner for CToken nextToken(){char c ;loop: c = getchar();switch (c){
case ` `:goto loop ;case `;`: return SemiColumn;case `+`: c = ungetc() ;
switch (c) { case `+': return PlusPlus ; case '=’ return PlusEqual; default: ungetc(c);
return Plus; } case `<`:case `w`:
}
![Page 14: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/14.jpg)
Regular ExpressionsBasic patterns Matching
x The character x
. Any character expect newline
[xyz] Any of the characters x, y, z
R? An optional R
R* Zero or more occurrences of R
R+ One or more occurrences of R
R1R2 R1 followed by R2
R1|R2 Either R1 or R2
(R) R itself
![Page 15: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/15.jpg)
Escape characters in regular expressions
• \ converts a single operator into text– a\+ – (a\+\*)+
• Double quotes surround text– “a+*”+
• Esthetically ugly
• But standard
![Page 16: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/16.jpg)
Regular Descriptions
• EBNF where non-terminals are fully defined before first use
letter [a-zA-Z]digit [0-9]underscore _letter_or_digit letter|digitunderscored_tail underscore letter_or_digit+identifier letter letter_or_digit* underscored_tail
• token description – A token name– A regular expression
![Page 17: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/17.jpg)
The Lexical Analysis Problem
• Given – A set of token descriptions– An input string
• Partition the strings into tokens (class, value)
• Ambiguity resolution– The longest matching token – Between two equal length tokens select the first
![Page 18: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/18.jpg)
A Flex specification of C Scanner
Letter [a-zA-Z_]Digit [0-9]%%[ \t] {;} [\n] {line_count++;}“;” { return SemiColumn;}“++” { return PlusPlus ;}“+=” { return PlusEqual ;}“+” { return Plus}“while” { return While ; }{Letter}({Letter}|{Digit})* { return Id ;}“<=” { return LessOrEqual;}“<” { return LessThen ;}
![Page 19: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/19.jpg)
Flex• Input
– regular expressions and actions (C code)
• Output– A scanner program that reads the input and
applies actions when input regular expression is matched
flex
regular expressions
input program tokensscanner
![Page 20: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/20.jpg)
Naïve Lexical Analysis
![Page 21: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/21.jpg)
Automatic Creation of Efficient Scanners
• Naïve approach on regular expressions(dotted items)
• Construct non deterministic finite automaton over items
• Convert to a deterministic
• Minimize the resultant automaton
• Optimize (compress) representation
![Page 22: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/22.jpg)
Dotted Items
![Page 23: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/23.jpg)
Example
• T a+ b+
• Input ‘aab’
• After parsing aa– T a+ b+
![Page 24: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/24.jpg)
Item Types
• Shift item In front of a basic pattern– A (ab)+ c (de|fe)*
• Reduce item At the end of rhs– A (ab)+ c (de|fe)*
• Basic item– Shift or reduce items
![Page 25: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/25.jpg)
Character Moves
• For shift items character moves are simple
T c c T c c
Digit [0-9] 7 T [0-9] 7
![Page 26: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/26.jpg)
Moves
• For non-shift items the situation is more complicated
• What character do we need to see?
• Where are we in the matching?T a*T (a*)
![Page 27: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/27.jpg)
Moves for Repetitions
• Where can we get from T (R)* ?
• If R occurs zero times T (R)*
• If R occurs one or more times T ( R)* – When R ends ( R )*
(R)* ( R)*
![Page 28: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/28.jpg)
Moves
![Page 29: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/29.jpg)
I [0-9]+
F [0-9]*’.’[0-9]+
Input ‘3.1;’
I ([0-9])+
I (.[0-9])+
I ( [0-9] .)+
I (.[0-9])+ I ( [0-9])+.
![Page 30: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/30.jpg)
I [0-9]+
F [0-9]*’.’[0-9]+
Input ‘3.1;’
F ([0-9])*’.’([0-9])+
F ([0-9])*’.’([0-9])+ F ( [0-9])* ’.’([0-9])+
F ( [0-9] )*’.’([0-9])+
F ( [0-9])*’.’([0-9])+ F ( [0-9])* ’.’([0-9])+
F ( [0-9])*’.’ ([0-9])+
F ( [0-9])*’.’ ([0-9])+
F ( [0-9])*’.’ ( [0-9] )+
F ( [0-9])*’.’ ( [0-9])+ F ( [0-9])*’.’ ( [0-9])+
![Page 31: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/31.jpg)
Concurrent Search
• How to scan multiple token classes in a single run?
![Page 32: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/32.jpg)
I [0-9]+
F [0-9]*’.’[0-9]+
Input ‘3.1;’
I ([0-9])+ F ([0-9])*’.’([0-9])+
I ([0-9])+ F ([0-9])*’.’([0-9])+ F ([0-9])* ’.’([0-9])+
I ( [0-9] )+ F ( [0-9] )*’.’([0-9])+
I ([0-9])+ I ( [0-9])+ F ([0-9])*’.’([0-9])+ F ( [0-9])* ’.’([0-9])+
F ( [0-9])*’.’ ([0-9])+
![Page 33: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/33.jpg)
The Need for Backtracking
• A simple minded solution may require unbounded backtracking T1 a+;T2 a
• Quadratic behavior
• Does not occur in practice
• A linear solution exists
![Page 34: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/34.jpg)
A Non-Deterministic Finite State Machine
• Add a production S’ T1 | T2 | … | Tn
• Construct NDFA over the items– Initial state S’ (T1 | T2 | … | Tn)
– For every character move, construct a character transition
<T c , c> T c
– For every move construct an transition
– The accepting states are the reduce items
– Accept the language defined by Ti
![Page 35: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/35.jpg)
Moves
![Page 36: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/36.jpg)
I [0-9]+
F [0-9]*’.’[0-9]+ S(I|F)
I ([0-9]+)
F ([0-9]*)’.’[0-9]+
I ([0-9])+
I ( [0-9])+
[0-9]
I ( [0-9])+
F ([0-9]*)’.’[0-9]+ F ( [0-9]*) ’.’[0-9]+
F ( [0-9] *)’.’[0-9]+
[0-9]F [0-9]*’.’ ([0-9]+)
F [0-9]*’.’ ([0-9]+)F [0-9]*’.’ ( [0-9] +)
F [0-9]*’.’ ( [0-9] +)
“.”
[0-9]
![Page 37: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/37.jpg)
Efficient Scanners
• Construct Deterministic Finite Automaton– Every state is a set of items– Every transition is followed by an -closure– When a set contains two reduce items select the one
declared first• Minimize the resultant automaton
– Rejecting states are initially indistinguishable– Accepting states of the same token are indistinguishable
• Exponential worst case complexity– Does not occur in practice
• Compress representation
![Page 38: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/38.jpg)
I [0-9]+
F [0-9]*’.’[0-9]+
S(I|F)I ([0-9]+)I ([0-9])+
F ([0-9]*)’.’[0-9]+F ([0-9]*)’.’ [0-9]+F ([0-9]*) ’.’ [0-9]+
I ( [0-9])+F ( [0-9] *)’.’[0-9]+
I ( [0-9])+ I ([0-9])+
F ([0-9]*)’.’[0-9]+F ( [0-9]*) ’.’[0-9]+
[0-9]
Sink
[^0-9.].|\n
[0-9]
F [0-9] *’.’ ([0-9]+)F [0-9]*’.’([0-9]+)
“.”
“.”
F [0-9] *’.’ ([0-9] +)F [0-9]*’.’([0-9]+)
F [0-9]*’.’( [0-9]+)
[0-9]
[0-9]
[^0-9]
[^0-9]
[^0-9.]
![Page 39: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/39.jpg)
A Linear-Time Lexical AnalyzerProcedure Get_Next_Token;
set Start of token to Read Index;
set End of last token to uninitialized
set Class of last token to uninitialized
set State to Initial
while state /= Sink:
Set ch to Input Char[Read Index];
Set state = [state, ch];
if accepting(state):
set Class of last token to Class(state);
set End of last token to Read Index
set Read Index to Read Index + 1;
set token .class to Class of last token;
set token .repr to char[Start of token .. End last token];
set Read index to End last token + 1;
IMPORT Input Char [1..];Set Read Index To 1;
![Page 40: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/40.jpg)
Scanning “3.1;”
input state next
state
last
token
3.1; 1 2 I
3 .1; 2 3 I
3. 1; 3 4 F
3.1 ; 4 Sink F
1 Sink
[^0-9.]
2
[0-9]
[0-9]
3
“.”“.”
[^0-9.]
4
[0-9]
[0-9]
[^0-9]I
F
![Page 41: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/41.jpg)
Scanning “aaa”
input state next
state
last
token
aaa$ 1 2 T1
a aa$ 2 4 T1
a a a$ 4 4 T1
a a a $ 4 Sink T1
1T1 a+”;”
T2 a
Sink[^a]
[.\n]
2
[a]
[a]
3
“;”
[^a;]
T2
T1
[.\n]
4
“;”
[a]
[^a;]
![Page 42: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/42.jpg)
Error Handling
• Illegal symbols
• Common errors
![Page 43: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/43.jpg)
Missing
• Creating a lexical analysis by hand
• Table compression
• Symbol Tables
• Handling Macros
• Start states
• Nested comments
![Page 44: Lexical Analysis Textbook:Modern Compiler Design Chapter 2.1](https://reader036.vdocuments.net/reader036/viewer/2022081503/56649d7d5503460f94a603da/html5/thumbnails/44.jpg)
Summary
• For most programming languages lexical analyzers can be easily constructed automatically
• Exceptions:– Fortran– PL/1
• Lex/Flex/Jlex are useful beyond compilers