compiler construction lent term 2021 lecture 2 : lexical
TRANSCRIPT
1
Compiler Construction
Lent Term 2021
Lecture 2 : Lexical analysis
Timothy G. Griffin
Computer Laboratory
University of Cambridge
• Recall regular expressions
• Recall Finite Automata
• Recall NFA to DFA transformation
• What is the “lexing problem”?
• How DFAs are used to solve the lexing
problem?
1
What problem are we solving?
if m = 0 then 1 else if m = 1 then 1 else fib (m - 1) + fib (m -2)
Translate a sequence of characters
into a sequence of tokens
type token =
| INT of int| IDENT of string | LPAREN | RPAREN
| ADD | SUB | EQUAL | IF | THEN | ELSE
| …
IF, IDENT “m”, EQUAL, INT 0, THEN, INT 1, ELSE, IF,
IDENT “m”, EQUAL, INT 1, THEN, INT 1, ELSE, IDENT “fib”,
LPAREN, IDENT “m”, SUB, INT 1, RPAREN, ADD,
IDENT “fib”, LPAREN, IDENT “m”, SUB, INT 2, RPAREN
implemented with some data type
2
)(||||| * aeeeeeae
)()(
)()(
}{)(
)}(),(|{)(
)()()(
}{)(
}{)(
{})(
)(
0
*
1
0
22112121
2121
*
n
n
nn
eMeM
eeMeM
eM
eMweMwwweeM
eMeMeeM
aaM
M
M
eM
alphabet over sexpressionRegular e
3
Regular Expression (RE) Examples
},,,
,,,,,{
))(( *
aaaabbbbabbbaabb
ababbaaabbbaabbaabbabb
abbbaM
},,
,,,
,,,{
))(( *
M
4
Review of Finite Automata (FA)
),,,,( 0 FqQM
states :Q alphabet :
statestart : Q0 q states final :QF
(DFA)FA ticdeterminisfor
Qa)(q,,, aQq
(NFA)FA nisticnondetermifor
Qa)(q,}),{(, aQq
5
NFA Example
)cbbcaabM(a ** **
acceptingNFA An
c
c
a
b
a
b
start
6
A bit of notation
},|{)(
and ),( if
0
322131
qqFqwML
qqqaqqq
w
waw
ε
For deterministic FA.
For nondeterministic FA.
},|{)(
and),(q if
and),( if
0
321231
321231
qqFqwML
qqaqqq
qqqqqq
w
waw
ww
ε
7
Review of RE -> NFA
e startq finalq
A regular
expression.
A nondeterministic
FA accepting M(e) with
a single final state.
The construction is done by induction on
the structure of e.
)(eN
8
Review of RE -> NFA
0q 1q)(N
0q 1q)(N
0q 1q)(aNa
9
Review of RE -> NFA
)( 21 eeN
)( 1eN
)( 2eN
10
Review of RE -> NFA
)( 21eeN
)( 2eN)( 1eN
11
Review of RE -> NFA
)( *eN
)(eN
12
))(( *abbbaN
a
b
ab b
0 1
2 3
6
54
10 9 8
7
start
13
Review of NFA -> DFA
),,,,( 0 FqQM
),,,,( ''
0
''' FqQM
}|{
}{
})|),('({),(
}',|'{)(
}|{
'
0
'
0
'
'
FSQSF
qclosureq
SqaqqclosureaS
qqSqQqSclosure
QSSQ
14
?)( compute wedo How Sclosure
resultreturn
stackon u push
result :result then
result if
each for
stack theoff pop
empty not stack while
:result
stack a onto of elements allpush
:)(
{u}
u
)(q,u
q
S
S
Sclosure
Look familiar?
It’s just a version of
transitive closure!
15
)))((( *abbbaNDFA
10,7,6
,5,4,2,1
7,4
,2,1,0
8,7,6
,4,3,2,1
9,7,6
5,4,2,1b
a
ba
startb
b
b
aa
a
7,6
5,4,2,1
16
17
Traditional Regular Language Problem
?)( is , and Given eLwwe
Solution : construct NFA from e, then DFA, then run
the DFA on w.
But is this a solution to the “lexing problem?
No!
17
Something closer to the “lexing problem”
.
The expressions are ordered by priority. Why?
Is “if” a variable or a keyword? Need priority to
resolve ambiguity (so “if” matched keyword RE
before identifier RE.
We need to do a longest match. Why?
Is “ifif” a variable or two “if” keywords?
w
)w(i),w,(i),w,(i nn,21 ...21
keee ,, 21 and
find
Given
so that
n1 www=w ...2 )L(ewjijii
and what else?
and
18
19
Define Tokens with Regular Expressions (Finite
Automata)
Keyword: if
1i
2f
3
1i
2f
3
0
-{f}
-{i}
This FA is really shorthand for:
“dead state” 19
20
Define Tokens with Regular Expressions (Finite
Automata)
Keyword:
if1
i2
f3 KEY(IF)
Keyword:
then1
t2
h3
KEY(then)
5
e
n4
Regular Expression Finite Automata Token
Identifier:
[a-zA-Z][a-zA-Z0-9]*1 2
[a-zA-Z]
[a-zA-Z0-9]
ID(s)
20
21
Define Tokens with Regular Expressions (Finite
Automata)
Regular Expression Finite Automata Token
number:
[0-9][0-9]*1 2
[0-9]
[0-9]
NUM(n)
real:
([0-9]+ ‘.’ [0-9]*)
| ([0-9]* ‘.’ [0-9]+)
1
3
[0-9] NUM(n) 2
[0-9]
[0-9]
.
4
.
[0-9]5
[0-9]21
22
No Tokens for “White-Space”
White-space with one line
comments starting with %
1
3
%2
[A-za-z0-9’ ‘]
4
\n
\t
\n‘ ‘
22
Constructing a Lexer
an ordered list of regular expressions
Highest priority first, lowest last
INPUT: keee ,, 21
keeee 21for NFA
23
priority.highest
of with theassociated
state finaleach DFA with
ie
Constructing a Lexer
(1) Keyword : then
(2) Ident : [a-z][a-z]*
(2) White-space: ‘ ‘
24
1t
2:IDh
3:ID
5:THEN
e
n
4:ID
7:W
‘ ‘
6:ID
State 5 could accept
either an ID or
the keyword “then”.
The priority rules
eliminates this
ambiguity and
associates state 5
with the keyword.
25
What about longest match?
1t
2:IDh
3:ID
5:THEN
e
n
4:ID
7:W
‘ ‘
6:ID
|then thenx$ 1 0
t|hen thenx$ 2 2
th|en thenx$ 3 3
the|n thenx$ 4 4
then| thenx$ 5 5
then |thenx$ 0 5 EMIT KEY(THEN)
then| thenx$ 1 0 RESET
then |thenx$ 7 7
then t|henx$ 0 7 EMIT WHITE(‘ ‘)
then |thenx$ 1 0 RESET
then t|henx$ 2 2
then th|enx$ 3 3
then the|nx$ 4 4
then then|x$ 5 5
then thenx|$ 6 6
then thenx$| 0 6 EMIT ID(thenx)
Start in initial state,
Repeat:
(1) read input until dead state is
reached. Emit token associated
with last accepting state.
(2) reset state to start state
| = current position, $ = EOF
Input
current state
last accepting state