compiler construction class notesregdodds/cos729/compilerconstruction-slides.pdf · compiler...
TRANSCRIPT
Compiler Construction
Class Notes
Reg Dodds
Department of Computer Science
University of the Western Cape
c©2006,2017 Reg Dodds
March 22, 2017
Introduction
• What is a Compiler?
• What is an Interpreter?
• Why Compiler Construction?
• What languages?
• An example of a very simple compilation.
• Why write a compiler?
• Layout of a compiler.
1
What is interpretation?
• Let L ∈ L be a programming language, with
L = {Fortran, Lisp, Algol, COBOL, PL/1, BA-
SIC, APL, SNOBOL, Pascal, C, C++, Ada, SQL,
Java, ML, Haskell, · · ·}.
• IL is an interpreter for a program pL ∈ L, and
input ∈ A∗ is data, where A is usually called
a character set and A∗ is its Kleene-closure from
which IL computes output data output ∈ A∗.
• The execution of the interpreter may abort and
lead to an error condition:
IL : L×A∗ → A∗ ∪ {error}
pLinput
}
interpret−→by IL
output ∪ {error}, which we
may also write as:
IL(pL, input) = output ∪ {error}
A single process takes place: the source program
is directly interpreted.
2
Making interpreters efficient
• In a production quality interpreter it is advan-
tageous to produce some sort of compact inter-
pretable code by a process that is similar to com-
pilation, once, and then subsequently reinterpret
this compact code repeatedly.
This process is used by Java and many interpreters
for BASIC such as GWBasic. Typically a command-
line interface interprets the command directly.
An even better idea is to compile blocks of code in-
crementally, directly to executable machine code.
When a block is altered its corresponding code is
replaced with new code.
Interpreters often have direct access to the original
source code—this is very useful for finding errors
in the source program. Stepping mechanisms that
move line-by-line through the source are easily im-
plemented with interpreters.
3
One view of a compiler
• When compiling is involved, two processes are ap-
plied to execute a source program.
• A compiler CL for a language L translates a syn-
tactically correct source program pL ∈ L into
equivalent machine code.
✲ ✲Compiler
program
machine
language
Source
• Examples:
– A source program in C++ is translated into
Mips machine code.
– Visual Basic source code is compiled into Intel-
x86 machine code.
– A Java source program is translated into JVM
byte code.
4
Execution of machine language
• The machine code produced by the compiler is
somehow executed by hardware.
• Hardware may be emulated by microcode, or it
may be hardwired.
• Some instructions may be entirely executable by
hardware.
• Certain instructions may be emulated by microcode.
• The user is usually not aware that some of the
machine code instructions, or even all of them, are
being emulated.
• On some machines the machine instruction set may
change dynamically, depending on the application.
• It is likely that compiled machine code, on any
particular machine, runs faster than code running
on an interpreter on the same machine.
5
What is compilation?
• The source program pL ∈ L is first translated by a
compilerCL into an equivalent machine executable
program pM .
• Next pM is interpreted, or executed, by a machine
plus its input to create output and/or an error.
• To run a program: (1) it is compiled and (2) then
it is executed. CL(pL) = pM ∪ {error}, if there
are no compilation errors then the second step may
be invoked:IM (pM , input)=output ∪ {error}.
• Notice that the interpreter IL has now become IMwhich is perhaps hardware.
• Interpreters and computers are different realiza-
tions of computing machines.
• Sun’s picoJava chip or the Java Virtual Machine
on your computer can be used interchangeably to
run the same byte code program pM .
6
Java source program
public class simple {
public static void main (Strings argsv[ ]){
int a;
a = 41;
a = a + 19;
}
}
7
Java byte code
Compiled from public class simple.java
public class simple extends java.lang.Object{
public static void main (java.lang.String[ ]);
public simple();
}
Method void main(java.lang.String[ ]);
0 bipush 41
2 istore 1
3 iload 1
4 bipush 19
6 iadd
7 istore 1
8 return
Method simple()
0 aload 0
1 invokespecial #12 <Method java.lang.Object.()>
4 return
Note there is a main method and a constructor method.
8
Overview of course
• Programs related to compilers.
• The compilation process: phases, intermediate code,
structures.
• Bootstrapping and transfer, T-diagrams, Louden’s
TINY and TM.
• SEPL: interpreter, emulator, compiler.
9
Programs related to compilers
(Louden p 4-6)
• interpreters
• assemblers
• linkers
• loaders
• preprocessors
• editors
• debuggers
• profilers
• project managers—SCCS and RCS
10
The compilation process
(Louden p 7, 8-14)
• Phases Intermediate code
– source code
– scanner—lexical analyser tokens
– syntax analyser abstract syntax tree
– semantic analyser annotated syntax tree
– intermediate code optimizer intermediate code
– code generator target code
– target code optimizer optimized code
– linker-loader executable code
• Structures
– literals
– symboltable
– error handler
– temporary files
11
Bootstrapping and transfer of programminglanguages
(Louden ¶1.6, p 18-21)
• T-diagrams—next slide.
• Pascal in 1970 on CDC 6600.
• P-code compiler for Pascal in 1973.
• P-code emulator written in Algol 60, and in For-
tran lead to widespread usage of Pascal. (Why?)
12
T-diagrams
• A T-diagram represents Source language being
run in Host code to produce Target language.
Source
Host
Target
• Let two compilers run on the same host machine.
One compiler translates from language Start into
an intermediate language IL and the other com-
piler translates from IL into language Final.
Host
Start IL
Host
IL Final
Host
Start Final
⇒
We have produced a system that can compile from
Start into Final.
13
T-diagrams
• One compiler for Pascal creates P-code, but runs
on machine M.
• Another processor running on M can generate code
for machine N.
Pascal P−code P−codePascal
M M N N
M
⇒
• We have produced a system that can compile from
Pascal into P-code on a new machine.
14
T-diagrams
• define the SEPL language.
• write an interpreter for it.
• develop a machine emulator—or use an available
one.
• develop a compiler that compiles to our machine
machine code.
• add an optimizing phase to the compiler.
• alter the compiler to produce code for another ma-
chine.
16
Students Educational Programming Language(SEPL)
Various projects lie ahead.
• Define the SEPL language—Louden calls his ‘TINY’
• Develop its syntax and informal semantics.
• Write an interpreter for it using flex/lex and
bison/yacc.
• Decide on target machine.
• Develop a machine emulator for the target or use
a real machine.
• Develop a compiler that produces executable code.
• Introduce optimization phase—not really enough
time.
• How much time required to produce the compiler?
17
Scanning—Lexical analysis
(Louden Chapter 2)
• tokens from lexemesis done quite well by flex.
• regular expressions (Louden p 38).
• extension of notation for regular expressions—doesnot give the notation any more power, but simpli-fies its practical use.
• regular expressions are widely used: flex, vim, sed,emacs, python, bash, tcl/Tk, grep, awk, perl, etc.
• regular expressions and FSAs (Louden p 47–).
• DFSA-FSA relationship (Louden p46–72).
• minimization of number of states.
• Louden’s TINY-scanner:Gives insight into direct connection between FSAand scanner. (Louden ¶2.5)
• application of flex for scanning—lexical analysis.
18
Context-free languages (CFLs)and syntax analysis
(Louden Chapter 3)
• Syntax analysers are based on CFLs.
✲ ✲tokens
analysersyntax tree
abstractlist of
• syntaxtree = analyse();
19
Parse trees
• have dynamic structure.
• recursive structure.
• tree keeps track of attributes such as:
types,
scope,
liveliness,
nesting and
values.
✏✏✏✏✏✏✏✏✏✏
PPPPPPPPP
❝❝
❝❝❝
✑✑
✑✑
✑✑idi
integersubscript expression
assignment
ida
integer[]
number6
integer
integer
e.g. a[i] = 6;
20
Context-free grammars (CFGs)
(Louden ¶3.2)
• Formally a CFG is a fourtuple G = (N,T, P, S)where N and T are alphabets, N is the set ofnon-terminals—or variables—and T is the setof terminals, P ⊆ N × (N ∪ T )∗ is the setproductionrules and S ∈ N is the startsymbol.
• Example:
N = {exp, op},T = {number,+,−, ∗},P = {exp → exp op exp|(exp)|number,
op → +| − |∗} andS = exp
• Note that number is treated as a token.• The source string (117 − 17) ∗ 5 is first tokenizedto (number − number) ∗ number before it isanalysed.
• P1 = {E → E O E|(E)|n,O → +| − |∗} is aset of productions not different from P .
21
Derivations
• sententialform: any string ∈ (N∪T )∗ derived
from S, the start symbol.
• direct derivations: if one production is applied
to a part of a sentential form and transforms it by
matching the right hand side of a production with
this part and then replaces it with a non-terminal.
• Example:
The production exp → (exp). can be applied to
bring about the direct derivation
exp ∗ number ⇒ (exp) ∗ number.
• derivation when a chain of direct derivations are
applied one after the other to transform the sen-
tential form s0 to another sentential form sn. It is
written as s0∗⇒ sn.
• language: all strings s ∈ T ∗ that can be derived
from the start symbol S, symbolically:
L(G) = {s ∈ T ∗|S∗⇒ s}.
22
Derivation:exp
∗⇒ (number − number) ∗ number
[exp → exp op exp],exp ⇒ exp op exp,
[exp → number],⇒ exp op number,
[op → ∗],⇒ exp ∗ number,
[exp → (exp)],⇒ (exp) ∗ number,
[exp → exp op exp],⇒ (exp op exp) ∗ number,
[exp → number],⇒ (exp op number) ∗ number,
[op → −],⇒ (exp− number) ∗ number,
[exp → number],⇒ (number − number) ∗ number,
23
language, sentence, examples
• language: all strings s ∈ T ∗ that can be derivedfrom the start symbol S, symbolically:L(G) = {s ∈ T ∗|S
∗⇒ s}
• sentence: the elements of the language L(G),s ∈ L(G), are known as sentences.
Example:G = ({E}, {a, (, )}, {E → (E)|a}, E)E → a, i.e. E ⇒ a, i.e. E
∗⇒ a,a ∈ L(G). similarly
E ⇒ (E) ⇒ (a), i.e. E∗⇒ (a) and E ⇒ (E) ⇒
((E)) ⇒ ((a)), i.e. E∗⇒ ((a)).
Theorem: E∗⇒ (na)n, ∀n ∈ N0
Proof: Using induction.P0: E
∗⇒ (0a)0 = a, since E ⇒ E → a.
P1: E∗⇒ (a), because E ⇒ (E) ⇒ (a).
PkPk+1
: Assume that Pk holds, i.e. E∗⇒ (ka)k. Now
E → (E), in other words E ⇒ (E)∗⇒ ((ka)k) ≡
(k+1a)k+1, and E∗⇒ (na)n,∀n ∈ N,
i.e. L(G) = {(na)n|n ∈ N0}.
24
Examples
Problem with empty base
If P = {E → (E)} is L(G) = { } = ∅. This is empty
because it is impossible to form bases P0, or P1. Since
the base does not exist an infinite regress ensues.
However, we can prove that E∗⇒ (nE)n, but this is of
little value, since E can not be reduced to a terminal.
CFL using regular expressions
If P = {E → E + a|a}, is L(G) = a(+a)∗,
where a(+a)∗ ≡ {a, a + a, a + a + a, . . .}.
25
An if -statement
G = ({statement, if -statement, expression},{0, 1, if, else, other},{statement → if -statement|otherif -statement → if (expression) statement
|if (expression) statementelse statement
expression → 0|1},statement)
and L(G) =
{ other, if (0) other, if (1) other,if (0) other else other, if (1) other else other,if (0) if (0) other, if (1) if (0) other,if (0) if (1) other, if (1) if (1) other,if (0) if (0) other else other,if (1) if (0) other else other,if (0) if (1) other else other,if (1) if (1) other else other,. . .}
26
The use of ε
Consider the grammar—we only show the productionsP :
{statement → if -statement|other,if -statement → if (expression) statement
|if (expression) statementelse statement,
expression → 0 | 1}
It may be written if using an ε-grammar if follows:
{statement → if -statement|other,if -statement → if (expression) statement else-part,else-part → else statement | ε,expression → 0 | 1}
ε is also useful for lists:
list → statement; list | statementstatement → s
This generates the language
L(G) = {s, s; s, s; s; s, . . .} ≡ s+ It is rewritten
using ε if follows:
list → non-ε-list | εnon-ε-list → statement;non-ε-list |statementstatement → s
27
Left- and right recursion
The regular grammar a+ is represented as follows with
left recursive productions: A → Aa | a. a ∈ L(G)
since A → a, thus A∗⇒ a, but A → Aa, and A
∗⇒
aa, and we again expect that it may be replaced in
A → Aa and it follows that A∗⇒ aaa. It is simple to
prove with mathematical induction that L(G) = a+.
Our notation is rather informal: the set represented by
a+, was formerly represented more exactly by L(a+),
which represents the set {a, aa, aaa, . . .}.
Similarly we can prove that a grammar using the right
recursive productions A → aA | a generates the same
language.
How is a∗ represented?
A → Aa | ε or using A → aA | ε
What is L(G) for the grammar with the productions
A → (A)A | ε?
28
Parse trees and abstract syntax trees (ASTs)
• It is convenient to distinguish between a parse tree
and an abstract syntax tree.
• An abstract syntax tree is often called a syntax
tree.
• A parse tree contains all the information concern-
ing the syntactical
• Consider the parse tree and its corresponding stripped
down (abstract) syntax tree generated by the
derivation on the next slide.
• Syntax trees usually show the actual values at the
terminals and not merely the tokens.
29
Right derivation forexp
∗⇒ (number − number) ∗ number
The derivation below is executed in a determinate or-
der. The rightmost non-terminal is replaced in each
step until no more non-terminals remain.
(1) [exp → exp op exp],exp ⇒ exp op exp,(2) [exp → number],
⇒ exp op number,(3) [op → ∗],
⇒ exp ∗ number,(4) [exp → (exp)],
⇒ (exp) ∗ number,(5) [exp → exp op exp],
⇒ (exp op exp) ∗ number,(6) [exp → number],
⇒ (exp op number) ∗ number,(7) [op → −],
⇒ (exp− number) ∗ number,(8) [exp → number],
⇒ (number − number) ∗ number,
64
Parse tree and syntax tree for the derivationexp
∗⇒ (29 - 11) * 47
• Parse tree for (29 - 11) * 47
1
34 2
5
67
29 11
47
exp exp
number
exp
exp op exp
number
exp
number
*
−
)
op
(
• Syntax tree for (29 - 11) * 47
−
*
29 11
47
65
Right derivation forexp
∗⇒ (number − number) ∗ number
The derivation below is executed in a determinate or-
der. The rightmost non-terminal is replaced in each
step until no more non-terminals remain.
(1) [exp → exp op exp],exp ⇒ exp op exp,(2) [exp → number],
⇒ exp op number,(3) [op → ∗],
⇒ exp ∗ number,(4) [exp → (exp)],
⇒ (exp) ∗ number,(5) [exp → exp op exp],
⇒ (exp op exp) ∗ number,(6) [exp → number],
⇒ (exp op number) ∗ number,(7) [op → −],
⇒ (exp− number) ∗ number,(8) [exp → number],
⇒ (number − number) ∗ number,
64
Parse tree for right derivation ofexp
∗⇒ (number − number) ∗ number
1
34 2
5
678
✟✟✟✟✟✟✟✟✟✟✟
❍❍❍❍❍❍❍❍❍❍❍
✟✟✟✟✟✟✟✟✟✟✟
❍❍❍❍❍❍❍❍❍❍❍
✱✱
✱✱
✱✱
❧❧❧❧
❧❧
exp exp
number
exp
exp op exp
number
exp
number
*
-
)
op
(
65
Leftmost derivation forexp
∗⇒ (number − number) ∗ number
The derivation below is executed in a determinate or-
der. The leftmost non-terminal of the sentential form is
replaced each time—reduced—until there are no more
non-terminals.
(1) [exp → exp op exp],exp ⇒ exp op exp,(2) [exp → (exp)],
⇒ (exp) op exp,(3) [exp → exp op exp],
⇒ (exp op exp) op exp,(4) [exp → number],
⇒ (number op exp) op exp,(5) [op → −],
⇒ (number − exp) op exp,(6) [exp → number],
⇒ (number − number) op exp,(7) [op → ∗],
⇒ (number − number) ∗ exp,(8) [exp → number],
⇒ (number − number) ∗ number,
66
A Parse tree for the derivation ofexp
∗⇒ (number − number) ∗ number
✟✟✟✟✟✟✟✟✟✟✟
❍❍❍❍❍❍❍❍❍❍❍
✟✟✟✟✟✟✟✟✟✟✟
❍❍❍❍❍❍❍❍❍❍❍
✱✱
✱✱
✱✱
❧❧❧❧
❧❧
exp exp
number
exp
exp op exp
number
exp
number
*
-
( )
op
67
Right and left derivations fornumber + number
A left derivation
(1) exp ⇒ exp op exp,⇒ number op exp,⇒ number + exp,⇒ number + number,
1
2 3 4
✟✟✟✟✟✟✟✟✟✟✟
❍❍❍❍❍❍❍❍❍❍❍
exp
exp op exp
number + number//
68
Rightmost derivation
A rightmost derivation for number + number
(1) exp ⇒ exp op exp,⇒ exp op number,⇒ exp + number,⇒ number + number,
1
3 24
✟✟✟✟✟✟✟✟✟✟✟
❍❍❍❍❍❍❍❍❍❍❍
exp
exp op exp
number + number
69
Ambiguous grammars
The grammar with
P = { exp → exp op exp|(exp)|numberop → +| − ∗}
is ambiguous because it has two different parse trees. It
will also therefore have two different left—and rightmost—
derivations, because each parse tree has a unique left-
most derivation.
✏✏✏✏✏✏✏✏✏✏✏✏✏✏✏✏
✟✟✟✟✟✟✟✟✟✟✟
❍❍❍❍❍❍❍❍❍❍❍
❍❍❍❍❍❍❍❍❍❍❍
exp op exp
number
exp
number-
number
exp
exp
*
op
and now the other tree.
70
Ambiguous grammars
A different parse tree for number + number
✟✟✟✟✟✟✟✟✟✟✟PPPPPPPPPPPPPPP
✟✟✟✟✟✟✟✟✟✟✟
❍❍❍❍❍❍❍❍❍❍❍
exp
number -
exp op exp
exp op exp
number number*
Ambiguous: If two different parse trees can be derivedfrom a given grammar then it is ambiguous.
It is preferable to use an unambiguous grammar fordefining a computing language.
Ambiguity can be eliminated in two ways: the grammarcan be altered so that it becomes unambiguous, or—theway bison/yacc does it—precedence rules or associationrules can be applied when there ambiguities.
71
The dangling else problem(Louden p.120–123)
The string if (0) if (1) other else other has
two parse trees. This is the dangling else problem.statement
if−statement
exp
if−statement
statementexp
statement
1
( )
if ( )
other
else
other0
if statement
statement
if−statement
exp statement( )if
if−statement
1
if ( )
0
other
exp statement
other
statementelse
72
The dangling else problem
The C code
if (x != 0)
if (y == 1/x) OK = TRUE;
else z = 1/x;
could have had two interpretations:
if (x != 0) { if (x != 0) {
if (y == 1/x) OK = TRUE; if (y == 1/x) OK = TRUE;
} else z = 1/x;
else z = 1/x; }
C disambiguates if with the most closely nested rule
which resolves the ambiguity—right-hand side.
The grammar rules may be adapted as follows:
if -statement → matched | unmatchedmatched → if (exp) matched else matched
| otherunmatched → if (expression) if -statement
| if (exp) matched else unmatchedexpression → 0|1
The next slide shows the unambiguous parse tree.
73
An unambiguous grammar for C’s if-statement
if -statement → matched | unmatchedmatched → if (exp) matched else matched
| otherunmatched → if (expression) if -statement
| if (exp) matched else unmatchedexpression → 0|1
exp( )if
1
if ( )
0
other
exp
other
else
if−statement
unmatched
if−statement
matched
matched matched
74
Representations of syntax: BNF
• BNF—Bacchus-Naur form.
The metasymbol ::= is used like → in production
rules, | is separates alternatives. Angle brackets, <
and > delimit non-terminals. Terminals are written
in plain text, or in bold face.
The code below defines a <program>:
<program> ::= program
<declaration-list>
begin
<statement-list>
end.
A program, starts with program, and is followed
by a list of declarations, then a begin, and a list
of statements terminated with end and a fullstop.
• EBNF—Extended BNF. BNF was made more con-
venient to use by extending it slightly.
75
Representations of syntax: EBNF
• EBNF—Extended BNF.
Put optional items inside brackets [ and ],
<if-statement> ::=
if <boolean> then <statement-list>
[else <statement-list>]
end if ;
Repetition is done using braces, { and }.
<identifier> ::=
<letter> { <letter> | <digit> }
An <identifier> is a word that starts with a
letter and has any number of letters of digits.
<statement-list> ::=
<statement> { ; <statement-list> }
An <statement-list> is a <statement> or a
list of <statement>s separated by semicolons.
76
Representations of syntax: EBNF
• tramline diagrams—used by Wirth for Pascal, and
for ANS Fortran.
• two-level grammar—Algol 68.
• etc.
77
The Chomsky hierarchy (Louden p. 131)
Chomsky-type: Description
3: Regular languagesLet A ∈ N and α ∈ T then productions in thegrammar have the form A → α or A → Aαor alternatively: the recursion may be right about.Only one kind of recursion may be present, i.e. leftor right—otherwise G is a CFL.
2: Let A ∈ N and γ ∈ (N ∪ T )∗ and A → γ. In acontext-free language A can always be replacedin any context by γ.
1: If the productionA → γ is in a context sensitivelanguage, then it may be applied only in a pre-determined context, i.e., A may produce γ onlyif A is in a given context le, e.g. αAβ → αγβ,where α 6= ε. Such a rule is context sensitive.An example of context sensitivity is the restrictionthat variables must be declared before they maybe used.
0: Phrase structure grammars are the most powerful.
79
Top-down parsing(Louden Chapter 4)
• Recursive-descent
• LL(1) parsing
• first and follow sets
• Error recovery in
top-down-parsers
80
Top-down parsing
• A top-down parser executes a leftmost deriva-tion. It starts from the start symbol and worksits way down to the terminals in the form of tokens.
• Predictive parser: attempts to forecast the nextconstruction by using lookahead tokens.
• Backtracking parser: attempts different possi-bilities for parsing the known input, and backs upwhen it hits dead ends.
– Slower than predictive parsers.– Use exponential time.– More powerful.
• Recursive-descent parsing is usually applied tohand-written compilers—Wirth’s compilers oftenuse RD parsers. Your 1st-year compiler was RD.
• LL(1) parsing L—on left—input is followed fromleft to right. L—on right—derivation is leftmost.The 1 means that only one token is used to predictthe progress of the parser.
81
LL(1) parsing
• LL(1) parsers work from left to right through the
input and follow a leftmost derivation that uses
one lookahead token.
• Viable-prefix property—easy to see very quickly
in such languages that there is an error when the
lookahead token does not correspond with what we
expect. The viable prefix corresponds to first.
• LL(k) parsers are also possible where k > 1.
More difficult to see errors.
• first and follow sets derived from the grammar
are used to construct the tables that will be used
for LL(1) parsing.
82
first and follow sets
• The set first(X), where X is a terminal or ε, is
simply {X}.
• SupposeX is a nonterminal then first(X) is the
set of all xs such that {X∗⇒ xβ}, where β may
be ε.
• The definition may be altered to accommodate
LL(k) parsers by replacing x with strings of k ter-
minals, or if β is ε |x| < k.
• In other words first is the set of leading terminals
of the sentential forms derivable from X .
• The definition may be altered to accommodate
LL(k) parsers by replacing x with strings of k ter-
minals, or if β is ε |x| < k. (See also Louden p.
168)
83
first sets
• In the grammar for arithmetic expressions:
exp → exp addop term | termaddop → + | −term → term mulop factor | factormulop → ∗factor → ( exp ) | number
• first(addop) = { +, −}
• first(mulop) = { ∗ }
• first(exp) = { (, number}
• first(term) = { (, number}
• first(factor) = { (, number}
84
first in the grammar for an if -statement
G = ({statement, if -statement, expression},{0, 1, if, else, rest},{statement → if -statement | restif -statement → if (expression)
statement else-partelse-part → else statement | εexpression → 0 | 1},statement})
first(statement) = {if, rest}
first(expression) = {0, 1}
first(if -statement) = {if}
first(else-part) = {else, ε}
85
Basic LL(1) parsing(Louden p. 152)
• LL(1) parsers use a push-down-stack rather thanbacktracking from recursive procedure calls.
• Consider S → ( S ) S | ε
• Initialize stack to $S
• Parse action
Parsing stack Input Action1 $ S ()$ S → (S)S2 $ S)S( ()$ match3 $ S)S )$ S → ε4 $ S) )$ match5 $ S $ S → ε6 $ $ accept
• Two actions:
1. Replace A ∈ N at the top of the stack by α,where A → α, where α ∈ (N ∪ T )∗ and
2. Match the token on top of the stack with thenext input token.
86
LL(1) parsing
• Parse action
Parsing stack Input Action1 $ S ()$ S → (S)S2 $ S)S( ()$ match3 $ S)S )$ S → ε4 $ S) )$ match5 $ S $ S → ε6 $ $ accept
• At step 1 the stack contains §S and the input is
()$.
• Apply rule S → (S)S.
• The RHS is place stacked item-by-item onto the
stack so that it appears reversed.
• Remove the matched on top of the stack ( in step
2 because it matches the token at the start of the
input.
87
LL(1)-recursion-free productions for arithmetic(Louden p.160)
exp → term exp′
exp′ → addop term exp′ | εaddop → + | −term → factor term′
term′ → mulop factor term′ | εmulop → ∗factor → ( exp ) | number
88
Parse tree and syntax tree for 3-4-5(Louden p. 161)
• The parse tree for the expression 3 − 4 − 5 does
not represent the left associativity of subtraction.
• The parser should still construct the left associa-
tive syntax tree.
1. The value 3 must be passed up to the root exp
2. The root exp hands 3 down to exp‘ which sub-
tracts 4 from it.
3. The resulting −1 is passed down to the next
exp′,
4. which subtracts 5 yielding −6,
5. which is passed to the next exp′.
6. The rightmost exp′ has an ε child and finally
passes the −6 back to the root exp.
87
Building the syntax tree with anLL(1)-grammar
Implement exp → term exp′ as follows
exp(){ term; exp’; }
To compute the expression it is rewritten as:
int exp(){int temp;
temp = term;
return exp’(temp);
}
88
Code for arithmetic
The code for exp′ → addop term exp′ | ε is
exp’() {
switch(token) {
’+’: match(’+’); term; exp’; break;
’-’: match(’-’); term; exp’; break; }
}
To compute the expression it could be rewritten as:
int exp’(int val) {
switch(token) {
’+’: match(’+’); val += term; return exp’(val);
’-’: match(’-’); val -= term; return exp’(val);
default: return val;
}
Note that exp′ requires a parameter passed from exp.
89
Left factoring
Left factoring is needed when right-hand sides of
productions share a common prefix, e.g.
A → αβ | αγ
Typical practical examples are:
stmt-sequence → stmt;stmt-sequence|stmt
stmt → s
and
if -stmnt → if ( exp ) statement| if ( exp ) statement else statement
An LL(1) parser cannot distinguish between such pro-
ductions. The solution is to factor out the common
prefix as follows:
A → αA′, A′ → β | γ
For factoring to work properly α should be the longest
left prefix.
Louden gives a left-factoring algorithm and many ex-
amples on pp.164–166.
90
follow sets
• In this discussion we regard $ as a terminal.
• Recall that first(A) is the set of leading terminals
of the sentential forms derivable from A.
• Informally, follow(A) is the set of terminals that
may be derived from nonterminals appearing after
A on the right-hand side of productions, or it is
the set of those terminals that follow A in such
productions.
• Since $ is regarded as a terminal, if A is the start
symbol then $ is in follow(A).
• Formally: follow(A) is the set of terminals such
that if there is a production B → αAγ,
1. then first(γ) \ {ε} is in follow(A), and
2. and if ε is in first(γ), then follow(A) con-
tains follow(B).
• follow sets are only defined for nonterminals.
91
An algorithm for follow(A)—Algol style
for all nonterminals A dofollow(A) = { };
follow(start-symbol) = {$};while there are changes to any follow sets dofor each production A → X1X2 . . . Xn dofor each Xi that is a nonterminal doadd first(Xi+1Xi+2 . . . Xn) \ {ε}to follow(Xi)/* Note : if i = n then Xi+1Xi+2 . . . Xn = ε*/if ε ∈ first(Xi+1Xi+2 . . . Xn) thenadd follow(A) to follow(Xi)
92
An algorithm for follow(A)—C-style
for (all nonterminals A)follow(A) = { };
follow(start-symbol) = {$};while (there are changes to any follow sets)for (each production A → X1X2 . . . Xn)for (each Xi that is a nonterminal){add first(Xi+1Xi+2 . . . Xn) \ {ε}to follow(Xi)/* Note : if i = n then Xi+1Xi+2 . . . Xn = ε*/if ε ∈ first(Xi+1Xi+2 . . . Xn) thenadd follow(A) to follow(Xi)
}
93
Construct follow from the first set
• In the grammar for arithmetic expressions:
(1) exp → exp addop term(2) exp → term(3) addop → +(4) addop → -(5) term → term mulop factor(6) term → factor(7) mulop → *(8) factor → ( exp )(9) factor → number
• first(addop) = { +, - }
• first(mulop) = { * }
• first(factor) = { (, number }
• first(term) = { (, number }
• first(exp) = { (, number }
94
Constructing follow from first
• In the grammar for arithmetic expressions:
(1) exp → exp addop term(2) exp → term(3) addop → +(4) addop → -(5) term → term mulop factor(6) term → factor(7) mulop → *(8) factor → ( exp )(9) factor → number
• Ignore (3), (4), (7) and (9)—no RH nonterminals
• Set all follow(A) = { }; follow(exp) = {$}
• (1) affects follow of exp, addop and term
first(addop) is added to follow(exp), so
follow(exp) = { $, -,+} and
first(term) is added to follow(addop), so
follow(addop) = { (,number} and
follow(exp) is added to follow(term), so
follow(term) = { $,+, -}
95
Constructing follow from first
• In the grammar for arithmetic expressions:
(1) exp → exp addop term(2) exp → term(3) addop → +(4) addop → -(5) term → term mulop factor(6) term → factor(7) mulop → *(8) factor → ( exp )(9) factor → number
• (2) causes follow(exp) to be added to follow(term),
which does not add anything new.
• (5) is similar to (1). first(mulop) is added to
follow(term), so
follow(term) = { $,+, -, *} and
first(factor) is added to follow(mulop), so
follow(mulop) = { (,number} and
follow(term) is added to follow(factor), so
follow(factor) = { $,+, -, *}
96
Constructing follow from first
• In the grammar for arithmetic expressions:
(1) exp → exp addop term(2) exp → term(3) addop → +(4) addop → -(5) term → term mulop factor(6) term → factor(7) mulop → *(8) factor → ( exp )(9) factor → number
• (6) adds follow(term) to follow(factor)—no
effect.
• (8) adds first()) to follow(exp), such that
follow(exp) = { $,+, -, )}
• During the second pass (1) adds ) to follow(factor),
so that follow(factor) = { $,+, -, *, )}
97
Constructing LL(1) parse tables
The parse table M [A, a] contains productions added
according to the rules
1. If A → α is a production rule such that there is a
derivation α∗⇒ aβ, where a is a token, then the
rule A → α is added to M [A, a].
2. If A → α∗⇒ ε is an ε-production and there is
a derivation S$∗⇒ αAaβ, where S is the start
symbol and a is a token, or $, then the production
A → ε is added to M [A, a].
The token a in Rule 1 is in first(α) and the token in
Rule 2 is in follow(A). This is repeatedly applied for
each nonterminal A and each production A → α.
1. For each token a in first(α), add A → α to the
entry M [A, a].
2. If ε ∈ first(α), for each element a ∈ follow(A),
add A → α to M [A, a].
98
Characterizing an LL(1) grammar
A grammar in BNF is LL(1) if the following conditions
are satisfied:
1. For every productionA → α1|α2| . . . |αn, first(αi)∩
first(αj) is empty for all i and j and i, j ∈
[1..n], i 6= j.
2. For every nonterminal A such that first(A) ⊃ ε,
first(A) ∩ follow(A) is empty.
99
Bottom-up parsing
• Overview.
• Finite automata of LR(0) items
and LR(0) parsing.
• SLR(1) parsing.
• General LR(1) and LALR(1) parsing.
• bison an LALR(1) parser generator.
• Generation of a parser using bison.
• Error recovery in bottom-up parsers.
101
Bottom-up parsing—an overview
• The most general bottom-up parser is the LR(1)
parser—the L indicates that the input is processed
from the left to the right, and theR indicates that
a rightmost derivation is applied, and the one
indicates that a single token is used for lookahead.
• LR(0) parsers are also possible where there is no
lookahead, i.e. the “lookahead” token can be ex-
amined after it appears on the parse stack.
• SLR(1) parsers improve on LR(0) parsing.
• An even more powerful method, but still not as
general as LR(1) parsers is the LALR(1) parser.
• Bottom-up parsers are generally more powerful than
their top-down counterparts—for example left re-
cursion can be handled.
• Bottom-up parsers are unsuitable for hand coding,
so parser generators like bison are used.
102
Bottom-up parsing—overview
• Parse stack contains tokens and nonterminals PLUS
state information.
• Parse stack starts empty and ends with start symbol
alone on the stack and an empty input string.
• Actions: shift, reduce and accept.
• A shift merely moves a token from the input to
the top of the stack.
• A reduce replaces the string α on top of the stack
with a nonterminal A, given A → α.
• Top-down parsers are generate-match parsers and
bottom-up parsers are shift-reduce parsers.
• If the grammar does not possess a unique start
symbol that only appears once in the grammar,
then bottom-up parsers are always augmented by
such a start symbol.
103
Bottom-up parse of ()
• Consider the grammar with P = {S → ( S ) S | ε}.
• Augment it by adding: S′ → S.
• A bottom-up parse for the parenthesis grammar of
() follows:
Parsing stack Input Action1 $ ()$ shift2 $ ( )$ reduce S → ε3 $ ( S )$ shift4 $ ( S ) $ reduce S → ε5 $ ( S ) S $ reduce S → ( S ) S6 $ S $ reduce S′ → S7 $ S′ $ accept
• The bottom-up parser looks deeper into its parse
stack and thus requires arbitrary stack lookahead.
• The derivation is: S′ ⇒ S ⇒ (S)S ⇒ (S) ⇒ ()
Clearly the rightmost nonterminal is reduced at
each derivation step.
104
A bottom-up parse of + grammar
• Consider the grammar with P ={E → E+n|n}.
• Augment it by adding: E′ → E.
• A bottom-up parse for the + grammar of n + n:
Parsing stack Input Action1 $ n + n$ shift2 $ n + n$ reduce E → n3 $ E + n$ shift4 $ E + n$ shift5 $ E + n $ reduce E → E + n6 $ E $ reduce E′ → E7 $ E′ $ accept
• The derivation is: E′ ⇒ E ⇒ E + n ⇒ n + n
We see that the rightmost nonterminal is reduced
at each derivation step.
105
Bottom-up parse—overview
Parsing stack Input Action1 $ n + n$ shift2 $ n + n$ reduce E → n3 $ E + n$ shift4 $ E + n$ shift5 $ E + n $ reduce E → E + n6 $ E $ reduce E′ → E7 $ E′ $ accept
• In derivation: E′ ⇒ E ⇒ E+n ⇒ n+n, each of
the intermediate strings is called a right sentential
form, and it is split between the parse stack and
the input.
• E+n occurs in step 3 of the parse as E‖+n, and
as E + ‖n in step 4, and finally as E + n‖.
• The string of symbols on top of the stack is called
a viable prefix of the right sentential form. E,
E+ and E + n are all viable prefixes of E + n.
• The viable prefixes of n + n are ε and n, but n+
and n + n are not.
106
Bottom-up parse—overview
• A shift-reduce parser will shift terminals to the
stack until it can perform a reduction to obtain
the next right sentential form.
• This occurs when the stack top matches the right-
hand side of a production.
• This string together with the position in the right
sentential form where it occurs and the production
used to reduce it, is known as the handle.
• Handles are unique in unambiguous grammars.
• The handle of n + n is thus E → n and the
handle of E + n, to which the previous form is
reduced is E → E + n .
• The main task of a shift-reduce parser is finding
the next handle.
107
Bottom-up parse—overview
Parsing stack Input Action1 $ ()$ shift2 $ ( )$ reduce S → ε3 $ ( S )$ shift4 $ ( S ) $ reduce S → ε5 $ ( S ) S $ reduce S → ( S ) S6 $ S $ reduce S′ → S7 $ S′ $ accept
• The main task of a shift-reduce parser is findingthe next handle.
• Reductions only occur when the reduced string isa right sentential form.
• In step 3 above the reduction S → ε cannot beperformed because the resulting string after theshift of ) onto the stack would be (S S) which isnot a right sentential form. Thus S → ε is not ahandle at this position of the sentential form (S.
• To reduce with S → (S)S the parser knows that(S)S appears on the right of a production andthat it is already on the stack by using a DFA of“items”.
108
LR(0) items
• The grammar with P = {S′ → S, S → (S)S | ε}
has three productions and eight LR(0) items:
S′ → .SS′ → S.S → .(S)SS → (.S)SS → (S.)SS → (S).SS → (S)S.S → .
• When P = {E′ → E, E → E + n| n} there are
three productions and eight LR(0) items:
E′ → .EE′ → E.E → .E + nE → E. + nE → E + .nE → E + n.E → .nE → n.
109
LR(0) parsing—LR(0) items
• An LR(0) item of a CFG is a production with a
distinguished position in its right-hand side.
• The distinguished position is usually denoted with
the meta symbol: . i.e. period.
• e.g. if A → α and β and γ are any two strings
of symbols including ε such that α = βγ then
A → .βγ, A → β.γ and A → βγ. are all LR(0)
items.
• They are called LR(0) items because they contain
no explicit reference to lookahead.
• The item “records” the recognition of the right-
hand side of a particular production.
• Specifically A → β.γ constructed from A → βγ
denotes that the β part has already been seen and
it may be possible to derive the next input tokens
from γ.
110
LR(0) parsing—LR(0) items
• The item A → .α indicates that A could be re-
duced from α—it is called an initial item.
• The item A → α. indicates that α is on the top of
the stack and may be the handle if A → α is used
to reduce α to A—it is called a complete item.
• The LR(0) items are used as states of a finite au-
tomaton that maintains information about the parse
stack and the progress of a shift-reduce parse.
111
LR(0) parsing—finite automata of items
• LR(0) items denote the states of a FSA that main-
tains the progress of a shift-reduce parse.
• One approach is to first construct a nondetermin-
istic FSA of LR(0) items and then derive a DFA
from it. Another approach is to construct theDFA
of sets of LR(0) items directly.
• What transitions are represented in the NFA of
LR(0) items?
• Suppose that the symbol X ∈ (N ∪T ). Let A →
α.Xη be an LR(0) item which represents a state
reached where α has been recognized and where
the focal point, is directly before X .
• If X is a token, then there is a transition on the
token X to next LR(0) state: A → αX.η.
A → α.Xη A → αX.ηX
112
LR(0) parsing—finite automata of items
• We are considering A → α.Xη where the focalpoint, is directly before X .
• Suppose that X is a nonterminal, then it can-not be directly matched with a token on the inputstream. The transition:
A → α.Xη A → αX.ηX
corresponds to pushing X onto the stack as a re-sult of a reduction of some β to X as a result ofapplying the rule X → β
• Such a reduction must be preceded by the recog-nition of β. The state denoted by X → .β repre-sents the start of the process of recognizing β.
• So when X is a nonterminal ε-transitions mustalso be provided: leaving from A → α.Xη forevery production X → β with X on the left andgoing to the LR(0) state X → .β.
A → α.Xη ε X → .β
113
LR(0) parsing—finite automata of items
• The two transitions:
A → α.Xη A → αX.ηX
and
A → α.Xη ε X → .β
are the only ones in the NFA of LR(0) items.
• The start state of the NFA must correspond to
the initial conditions of the parser: the parse stack
is empty and the S the start symbol is about to be
parsed, i.e. any initial item S → .α can be used.
• Since we want the start state to be unique, the
simple device of augmenting the grammar with a
new, unique start symbol S′ for which S′ → S
suffices.
• The start state then is S′ → .S.
114
LR(0) parsing—finite automata of items
• What are the accepting states of the NFA?
• The NFA does not need accepting states.
• The NFA is not being used to do the recognition
of the language.
• The NFA is merely being applied to keep track of
the state of the parse.
• The parser itself determines when it accepts an
input stream by determining that the input stream
is empty and the start symbol is on the top of the
parse stack.
115
LR(0) parsing—finite automata of items
• The grammar with P = {S′ → S, S → (S)S | ε}
has three productions and eight LR(0) items:
S′ → .SS′ → S.S → .(S)SS → (.S)SS → (S.)SS → (S).SS → (S)S.S → .
• The NFA of LR(0) items for the S grammar:
S′ → .S S′ → S.
S → .S → .(S)S
S → (.S)S S → (S.)S S → (S).S
S → (S)S.
ε
εεε
εε
()
SS
S
• The next step is to produce the DFA that corre-
sponds to the NFA.
116
LR(0) parsingConverting the NFA into a DFA
S′ → .S S′ → S.
S → .S → .(S)S
S → (.S)S S → (S.)S S → (S).S
S → (S)S.
ε
εεε
εε
()
SS
S
• Form the ε-closure of each LR(0) item.
• The closure always contains the set itself
• Add each item for which there are ε-transitions
from the the original set.
• Then recursively add all sets which are ε-reachable
from the sets already aggregated.
• Do this for every LR(0) item in the NFA.
• Add the terminal transitions that leave each ag-
gregate.
117
LR(0) parsingan NFA and its corresponding DFA
• The NFA for the S grammar:
S′ → .S S′ → S.
S → .S → .(S)S
S → (.S)S S → (S.)S S → (S).S
S → (S)S.
ε
εεε
εε
()
SS
S
• The DFA derived from the NFA:
0.
1.
2.
3.
4.
5.
S′ → .SS′ → S.
S → .
S → .
S → .
S → .(S)S
S → .(S)S
S → .(S)S
S → (.S)S
S → (S.)S
S → (S).SS → (S)S.
(
(
(
)
S
S
S
118
LR(0) parsing—finite automata of items
• When P = {E′ → E, E → E + n| n} there are
three productions and eight LR(0) items:
E′ → .EE′ → E.E → .E + nE → E. + nE → E + .nE → E + n.E → .nE → n.
• The NFA of LR(0) items for the E grammar:
E′ → .E E′ → E.
E → .E + n E → .n E → n.
E → E. + n E → E + .n E → E + n.
ε
εε
ε
E
E
+ n
n
• The next step is to produce the DFA that corre-
sponds to the NFA.
119
LR(0) parsing: NFA and equivalent DFA
• The NFA for the E grammar:
E′ → .E E′ → E.
E → .E + n E → .n E → n.
E → E. + n E → E + .n E → E + n.
ε
ε
ε
ε
E
E
+ n
n
• The DFA derived from the above NFA:
0.
1.
3.2. 4.
E′ → .E E′ → E.E → .E + nE → .n
E → n.
E → E. + n
E → E + .n E → E + n.
E
+n
n
• The items that are added by the ε-closure areknown as closure items and those items thatoriginate the state are called kernel items.
120
LR(0) parsing
• The LR(0) algorithm keeps track of the currentstate in the DFA of LR(0) items.
• The parse stack need hold only state numbers
since they represent all the necessary information.
• For the sake of simplifying the description of thealgorithm the symbol will also be pushed onto theparse stack before the state number .
• The parse starts with:
Parsing stack Input1 $ 0 input string$
• Suppose the token n is shifted onto the stack andthe next state is 2:
Parsing stack Input2 $ 0 n 2 rest of input string$
• The LR(0) parsing algorithm chooses its next ac-tion depending on the state on the top of the stackand the current input token.
121
The LR(0) parsing algorithm
Let s be the current state.
1. If state s contains the item A → α.Xβ where X
is a terminal, then the action is a shift.
• If the token is X then the next state is A →
αX.β.
• If the token is not X then there is an error.
2. If s contains a complete action such as A → γ.
then the action is to reduce γ by the rule A → γ.
• When the start symbol S is reduced by the rule
S′ → S and the input is empty, then accept;
if it is not empty then announce an error.
• In every other case the next state is computed
as follows:
(a) pop γ off the stack.
(b) Set s = top, which contains B → αA.β.
(c) push(A) and push(B → αA.β).
122
LR(0) parsingshift-reduce and reduce-reduce conflicts
• A grammar is said to be an LR(0) grammar if the
parser rules are unambiguous.
• If a state contains the complete item A → α. then
it can contain no other items.
• If such a state were also to contain the shift item
A → α.Xβ, where X is a terminal, then an ambi-
guity arises as to whether action (1) or (2) must be
executed. This is called a shift-reduce conflict.
• If such a state were also to contain another com-
plete item B → β., then an ambiguity arises as to
which production to apply—A → α. orB → β.—
this is known as a reduce-reduce conflict.
• A grammar is therefore LR(0) if and only if each
state is either a shift state or a reduce state con-
taining a single complete item.
123
SLR(1) parsing
• The SLR(1) parsing algorithm.
• Disambiguating rules for parsing conflicts.
• Limits of SLR(1) parsing power.
• SLR(k) grammars.
124
The SLR(1) parsing algorithm
• Simple LR(1), i.e. SLR(1) parsing, uses a DFA of
sets of LR(0) items.
• The power of LR(0) is significantly increased by
using the next token in the input stream to direct
its actions in two ways:
1. The input token is consulted before a shift is
made, to ensure that an appropriateDFA tran-
sition exists, and
2. It uses the follow set of a nonterminal to
decide if a reduction should be performed.
• This is powerful enough to parse almost all com-
mon language constructs.
125
The SLR(1) parsing algorithm
Let s be the current state, i.e. the state on top of the
stack.
1. If s contains any item of the form A → α.Xβ,
where X is the next token in the input stream,
then shift X onto the stack and push the state
containing the item A → αX.β.
2. If s contains the complete item A → γ. and the
next token in the input stream is in follow(A),
then reduce by the rule A → γ.—more details
follow on next slide.
3. If the next input token is not accommodated by
the (1) or (2), then an error is declared.
126
The SLR(1) parsing algorithm—2.
2. If s contains the complete item A → γ. and the
next token in the input stream is in follow(A),
then reduce by the rule A → γ.
The reduction by S′ → S, where S is the start
state, and the next token is $, implies acceptance,
otherwise the new state is computed as follows:
(a) Remove the string γ and all its corresponding
states from the parse stack.
(b) Back up the DFA to the state where the con-
struction of γ started.
(c) By construction, this state contains an item of
the form B → γ.Aβ. Push A onto the stack
and push the item containing B → γA.β.
127
SLR(1) grammar
A grammar is an SLR(1) grammar if the application of
the SLR(1) parsing rules do not result in an ambiguity.
A grammar is an SLR(1) grammar ⇐⇒.
1. For any item A → α.Xβ, where X is a token
there is no complete item B → γ. in s with X ∈
follow(B).
A violation of this condition is a shift-reduce
conflict.
2. For any two complete items A → α. ∈ s and
A → β. ∈ s, follow(A) ∩ follow(B) = ∅.
A violation of this condition is a reduce-reduce
conflict.
128
Table-driven SLR(1) grammar
• The grammar with P = {E′ → E,E → E +
n|n} is not LR(0) but is SLR(1) and its DFA of
sets of items is:
0.
1.
3.2. 4.
E′ → .E E′ → E.E → .E + nE → .n
E → n.
E → E. + n
E → E + .n E → E + n.
E
+n
n
• follow(E′) = {$}, and follow(E) = {$,+}
• q(1, $) = accept instead of r(E′ → E)
State Input Go ton + $ E
0 s2 11 s3 accept2 r(E → n) r(E → n)3 s44 r(E → E + n) r(E → E + n)
129
SLR(1) parse of n + n + n
State Input Go ton + $ E
0 s2 11 s3 accept2 r(E → n) r(E → n)3 s44 r(E → E + n) r(E → E + n)
Parsing stack Input Action1 $ 0 n + n + n$ shift 22 $ 0 n 2 + n + n$ reduce E → n3 $ 0 E 1 + n + n$ shift 34 $ 0 E 1 + 3 n + n$ shift 45 $ 0 E 1 + 3 n 4 + n$ reduce E → E + n6 $ 0 E 1 + n$ shift 37 $ 0 E 1 + 3 n$ shift 48 $ 0 E 1 + 3 n 4 $ reduce E → E + n9 $ 0 E 1 $ accept
130
SLR(1) parse of ()()
State Input Go to( ) $ S
0 s2 r(S → ε) r(S → ε) 11 accept2 s2 r(S → ε) r(S → ε) 33 s44 s2 r(S → ε) r(S → ε) 55 r(S → (S)S ) r(S → (S)S )
Parsing stack Input Action1 $ 0 ()()$ shift 22 $ 0 ( 2 )()$ reduce S → ε3 $ 0 ( 2 S 3 ()$ shift 44 $ 0 ( 2 S 3 ) 4 ()$ shift 25 $ 0 ( 2 S 3 ) 4 ( 2 )$ reduce S → ε6 $ 0 ( 2 S 3 ) 4 ( 2 S 3 $ shift 47 $ 0 ( 2 S 3 ) 4 ( 2 S 3 ) 4 $ reduce S → ε8 $ 0 ( 2 S 3 ) 4 ( 2 S 3 ) 4 S 5 $ reduce S → (S)S9 $ 0 ( 2 S 3 ) 4 S 5 $ reduce S → (S)S10 $ 0 S 1 $ accept
131
Disambiguating rules for parsing conflicts
• shift-reduce have a natural disambiguating rule:
prefer the shift over the reduce.
• reduce-reduce conflicts are more complex to resolve—
they usually require the grammar to be altered.
• Preferring the shift over the reduce in the dangling-
else ambiguity, leads to incorporating the most-
closely-nested-if rule.
• The grammar with the following productions is
ambiguous:
statement → if -statement|otherif -statement → if(exp)statement
if(exp)statement|else statementexp → 0|1
• We will consider the even simpler grammar:
S → I | otherI → ifS | ifSelseS
132
Disambiguating a shift-reduce conflict
Consider the grammar:
S → I | otherI → if S | if S else S
Since follow(I) = follow(S) = {$, else}, there is
a parsing conflict—in state 5 the complete item I →
if S. indicates a reduction on inputting else or $, but
the item I → if S.else S indicates a shift when else
is read.
0.
1.
2.
3.
5.7.
4.
6.
S′ → .S S → S.
I → .if S
I → .if S
I → .if S
I → .if S else S
I → .if S else S
I → .if S else S
other
other
other
if
if
else
S → other.
II
IS → I.
SS
S
I → if.S
I → if.S else S
I → if S.
I → if.S else S
I → ifS else.S
I → ifS else S.
S → .I
S → .I
S → .I
S → .other
S → .other
S → .other
133
SLR(1) table without conflicts
• The rules are numbered:
(1) S → I(2) S → other(3) I → if S(4) I → if S else S
• The SLR(1) parse table:
State Input Go toif else other $ S I
0 s4 s3 1 21 accept2 r1 r13 r2 r24 s4 s3 5 25 s6 r36 s4 s3 6 27 r4 r4
134
Limits of SLR(1) parsing power
• Consider the grammar which describes parameter-less procedures and assignment statements:
stmt → call-stmt | assign-stmtcall-stmt → identifierassign-stmt → var := expvar → var [ exp] | identifierexp → var | number
• Assignments and procedure calls both start with
an identifier.
• The parser can only decide at the end of the state-
ment or when the token ‘:=’ appears if a call or an
assignment is being processed.
135
Limits of SLR(1) parsing power
• Consider the simplified grammar:
S → id | V := EV → idE → V | n
• The start state of the DFA of sets items contains:
S′ → .SS → .idS → .V := EV → .id
• The state has a shift transition on id to the state:
S → id.V → id.
• follow(S) = {$} and follow(V ) = {:=, $}. On
getting the input token $ the SLR(1) parser will
try to reduce by both the rules S → id and V →
id—this is a reduce-reduce conflict.
• This simple problem can be solved by using an
SLR(k) grammar.
136
SLR(k) grammars
• The SLR(1) algorithm can be extended to SLR(k)
parsing, with k ≥ 1 lookahead symbols.
• Use firstk and followk sets and the two rules:
1. If s ⊃ A → α.Xβ where X is a token and
Xw ∈ firstk(Xβ) are the next k tokens in
the input stream, then the action is to shift
the current input token onto the stack, and to
push the state containing the itemA → αX.β
2. If s ⊃ A → α. and w ∈ followk(A) are the
next tokens in the input string, then the action
is to reduce by the rule A → α
• SLR(k) parsing is more powerful than SLR(1) pars-
ing when k > 1, but it is substantially slower, since
the cost of parsing grows exponentially in k.
• Typical non-SLR(1) constructs are handled using
an LALR(1) parser, by using standard disambiguat-
ing rules, or by rewriting the grammar.
137
General LR(1) and LALR(1) parsing
• LR(1), also called canonical LR(1), parsing over-
comes the problem with SLR(1) parsing but also
at the cost of increased time complexity.
• LookaheadLR(1) orLALR(1) preserves the efficiency
of SLR(1) parsing and retains the benefits of gen-
eral LR(1) parsing.
• We will discuss:
– Finite automata of LR(1) items.
– The LR(1) parsing algorithm.
– LALR(1) parsing.
138
Finite automata of LR(1) items(Louden p. 217–220)
• SLR(1) applies lookahead after constructing theDFA of LR(0) items—the construction ignores theadvantages that may ensue from considering looka-heads.
• General LR(1) uses the new DFA that has looka-heads built in from the start.
• ThisDFA uses items that are an extension ofLR(0)items.
• They are called LR(1) items because they includea single lookahead token in each item.
• LR(1) items are written:
[A → α.β, a]
where A → α.β is an LR(0) item, and a is thelookahead token.
• Next the transitions between LR(1) items will bedefined.
139
Transitions between LR(1) items
• There are several similarities withDFAs of LR(0)
items.
• They include ε-transitions.
• The DFA states are also built from ε-closures.
• However, transitions betweenLR(1) items must keep
track of the lookahead token.
• Normal, i.e. non-ε-transitions, are quite similar to
those in DFAs of LR(0) items.
• The major difference lies in the definition of ε-
transitions.
140
Definition of LR(1)-transitions
• Given an LR(1) item, [A → α.Xγ, a], where X ∈
N∪T , there is a transition on X to the item [A →
αX.γ, a].
• Given an LR(1) item, [A → α.Bγ, a], where
B∈N , there are ε-transitions to items
[B → .β, b] for every production B → β and for
every token b ∈ first(γa).
• Only ε-transitions create new lookaheads.
141
DFA of sets of LR(0) items for A → (A)|a(Louden p. 208)
• The augmented grammar with P = {A′ → A,A →(A)|a} has the DFA of sets of LR(0) items:
0.
1.
5.
3.
2.
4.
A′ → .AA′ → A.
A → .a
A → .a
A → a.
S → (.A)S → .(A)
S → .(A)
S → (A.) S → (A).(
(
)A
A
a
a
• The parsing actions for the input ((a)) follow:
Parsing stack Input Action1 $ 0 ((a))$ shift2 $ 0 ( 3 (a))$ shift3 $ 0 ( 3 ( 3 a))$ shift4 $ 0 ( 3 ( 3 a 2 ))$ reduceA → a5 $ 0 ( 3 ( 3 A 4 ))$ shift6 $ 0 ( 3 ( 3 A 4 ) 5 )$ reduceA → (A)7 $ 0 ( 3 A 4 )$ shift8 $ 0 ( 3 A 4 ) 5 $ reduceA → (A)9 $ 0 A 1 $ accept
142
DFA of sets of LR(1) items for A → (A)|a(Louden p. 218)
• Augment the grammar by adding A′ → A.
• State 0: first put [A′ → .A, $]into State 0.
To complete the closure, add ε-transitions to items
with an A on the left of productions with $ as the
lookahead:
[A → .(A), $], and [A → .a, $].
0.
[A′ → .A, $]
[A → .(A), $][A → .a, $]
• State 1: There is a transition from State 0 on A
to the closure of the set that includes [A′ → A., $].
The action for this state will be to accept.
1.[A′ → A., $]
143
DFA of sets of LR(1) items for A → (A)|a
0.
[A′ → .A, $]
[A → .(A), $][A → .a, $]
• State 2: There is a transition on ‘(’ leaving State 0
to the closure of the LR(1) item [A → (.A), $]
which forms the basis of State 2. Since there are
ε-transitions from this item to [A → .(A), )] and
to [A → .a, )] because the follow of theA in paren-
theses is first()$) = {)}.
• Note that there is a new lookahead item.
• The complete State 2 is:
2.
[A → (.A), $]
[A → .(A), )][A → .a, )]
144
DFA of sets of LR(1) items for A → (A)|a
• State 3: This state emanates from State 0 with
a transition on ‘a’ from [A → .a, $] to [A → a., $]
3.[A → a., $]
• Note that the lookahead does not change.
• This completes the states that emanate from State 0.
2.
[A → (.A), $]
[A → .(A), )][A → .a, )]
• State 4: An ε-transition on A leaves State 2 to
the state containing [A → (A.), $].
4.[A → (A.), $]
145
DFA of sets of LR(1) items for A → (A)|a
• The next state emanates from State 2:
2.
[A → (.A), $]
[A → .(A), )][A → .a, )]
• State 5: The transition on ‘(’ to the ε-closure of
[A → (.A), )], which are once again is the set of
all the items with A on the left of a production,
namely [A → .(A), )], and [A → .a, )].
5.
[A → (.A), )]
[A → .(A), )][A → .a, )]
• States 2 and 5 differ only in the lookaheads of their
first item.
146
DFA of sets of LR(1) items for A → (A)|a
• State 6: The last state emanating from State 2
is the transition on ‘a’ to the item [A → a., )].
• It differs from State 3 in the lookahead.
• State 7: There is a transition on ‘)’ from State 4
to the item [A → (A)., $].
• State 8: State 5 has a transition on ‘(’ to itself
and a transition on ‘A’ to the item [A → (A.), )].
• State 9: There is a transition on ‘)’ from State 8
to the item [A → (A)., )].
147
DFA of sets of LR(1) items for A → (A)|a
6.
8. 9.
2.
5.
4.
3.
0.
1.
5.
[A′ → .A, $] [A′ → A., $]
[A → .(A), $]
[A → (.A), $]
[A → (.A), )]
[A → (A.), $] [A → (A)., $]
[A → (A)., )][A → (A.), )]
[A → .(A), )]
[A → .(A), )]
[A → a., $]
[A → a., )]
[A → .a, )]
[A → .a, )]
[A → .a, $]
A
A
A
aaaaa
aaaa
a
)
)
(
(
(
148
The general LR(1) parsing algorithm(Louden p. 220–223)
Let s be the current state, i.e. the state on top of the
stack. The actions are defined as follows:
1. If s contains any LR(1) item of the form [A →
α.Xβ, a], where X is the next token in the input
stream, then shift X onto the stack and push the
state containing the LR(1) item [A → αX.β, a].
2. If s contains the complete LR(1) item [A → γ., a]
and the next token in the input stream is a, then
reduce by the rule A → γ.—more details follow
on next slide.
3. If the next input token is not accommodated by
the (1) or (2), then an error is declared.
149
The general LR(1) parsing algorithm—2.
2. If s contains the complete item A → γ. and the
next token in the input stream is a, then reduce
by the rule A → γ.
The reduction by S′ → S, where S is the start
state, and the next token is $, implies acceptance,
otherwise the new state is computed as follows:
(a) Remove the string γ and all its corresponding
states from the parse stack.
(b) Back up the DFA to the state where the con-
struction of γ started.
(c) By construction, this state contains an LR(1)
item of the form [B → γ.Aβ, b]. Push A
onto the stack and push the item containing
[B → γA.β, b].
150
LR(1) grammar
A grammar is an LR(1) grammar if the application of
the LR(1) parsing rules do not result in an ambiguity.
A grammar is an LR(1) grammar ⇐⇒.
1. For any item [A → α.Xβ, a] ∈ s, where X is a
token, there is no complete item in s of the form
[B → γ.,X ]
A violation of this condition is a shift-reduce
conflict.
2. There are no two complete LR(1) items of the form
[A → α., a] ∈ s and [A → β., a] ∈ s, otherwise
it would lead to a reduce-reduce conflict.
151
LR(1) parse table for A → (A)|a
Number the two productions as follows:
(1) A → (A) and(2) A → a
TheLR(1) parse table:
State Input Go to( a ) $ A
0 s2 s3 11 accept2 s5 s6 43 r24 s75 s5 s6 86 r27 r18 s99 r1
• That the parse table is extracted from the DFA of
sets of LR(1) items.
• This grammar is LR(0) and thus also SLR(1) .
152
General LR(1) parsing
• The grammar with the rules
S → id|V :=EV → id
E → V |n
proves not to be SLR(1)
• We construct its DFA of sets of LR(1) items.
• The start state is the ε-closure of the LR(1) item
[S′ → .S, $]. So it also contains the LR(1) items
[S → .id, $] and [S → .V :=E, $].
• The last item, in turn, gives rise to the LR(1) item
[V → .id, :=].
• The lookahead is ‘:=’ because a ‘V ’ must only be
recognized if it is actually followed by ‘:=’
0.[V → .id, :=]
[S → .S, $]
[S → .id, $]
[S → .V :=E, $]
153
General LR(1) parsing
• Consider state 0:
0.[V → .id, :=]
[S → .S, $]
[S → .id, $]
[S → .V :=E, $]
A transition from state 0 on ‘S’ goes to state 1:
1.[S → S., $]
• State 0 has a transition on ‘id’ to state 2:
2.
[S → id., $][V → id., :=]
• State 0 has a transition on ‘V ’ to state 3:
3.[S → V.:=E, $]
• No transitions leave states 1 and 2.
154
General LR(1) parsing
• The third state has a transition on ‘:=’ to the clo-
sure of the item [S → V :=.E, $]. Since E has
no symbols following it, the lookaheads will be ‘$’.
The two items [E → .V, $] and [E → .n, $] must
be added. The first of these leads to the item
[V → .id, $]. 3.[S → V.:=E, $]
• Each of these items in state 4 has the general
form [A → α.Xβ] and in turn leads to a transition
on X ∈ {E, V, n, id}, to a state with the single
item [A → αX.β] in it.
• State 2 gave rise to a parsing conflict in the SLR(1)
parser. The LR(1) items now clearly distinguish
between the two reductions by their lookaheads:
Select S → id on ‘$’ and V → id on ‘:=’.
155
General LR(1) parsing(Louden p. 223)
0.
1.
2.
4.
3.
5.
7.6. 8.
S
V
V
E
:=
id
idn
[V → .id, :=]
[V → id., :=]
[E → .V, $][E → .n, $]
[V → id, $]
[E → n., $] [V → id., $]
[S → S., $][S → .S, $]
[S → .id, $]
[S → .V :=E, $]
[S → id., $]
[S → V.:=E, $]
[S → V :=.E, $]
[S → V :=E., $]
[S → V :=E., $]
156
LALR(1) parsing(Louden p. 224–226)
• In theDFA of sets of LR(1) items many states differonly in some of the lookaheads of their items.
• The DFA of sets of LR(0) items of the grammarwith P = {A → (A)|a} has only 6 states whileits DFA of sets of LR(1) items has 10 items.
• In the DFA of sets of LR(1) items states 2–5, 4–8,7–9, and 3–6 differ only in some item lookaheads.
• e.g. the item [A → (.A), $] from state 2 differsfrom the item [A → (.A), )] from state 5 only inits lookahead.
• The LALR(1) algorithm caters for these almost-duplicate items by coalescing such pairs into itemswith sets of lookahead items, e.g. [A → (.A), $/)].
• The DFA of sets of LALR(1) items is identical tothe corresponding DFA of sets of LR(0) items, ex-cepting that the former includes sets of lookaheaditems.
157
LALR(1) parsing
• The LALR(1) parsing algorithm preserves the ben-
efit of the smaller DFA of sets of LR(0) items with
the advantage of some of the benefit of LR(1) pars-
ing over SLR(1) parsing.
• Definition: the core of a state DFA of sets of
LR(1) items is the set of LR(0) items consisting of
the first components of all LR(1) items of the state.
• First principle of LALR(1) parsing
The core of a state of the DFA of sets of LR(1)
items is a state of the DFA of sets of LR(0) items
• Second principle of LALR(1) parsing
Given two states s1 and s2 of the DFA of sets
of LR(1) items that have the same core, suppose
there is a transition on the symbol X from state
s1 to state t1, then there is also a transition on
the symbol X from state s2 to state t2 and the
states t1 and t2 have the same core.
158
LALR(1) parsing
• The two principles of LALR(1) parsing allow us to
construct the DFA of sets of LALR(1) items which
is built up from the DFA of sets of LR(1) items
by identifying all states that have the same core
and forming the union of the lookahead symbols
for each LR(0) item.
• Thus each LALR(1) item in this DFA will have
an LR(0) item as its first component and a set of
lookahead tokens as its second component.
• Multiple lookaheads are separated by ‘/’.
159
LALR(1) parsing
• The DFA of sets of LALR(1) items.
2. 4.
3.
0.
1.
7.(
[A′ → .A, $] [A′ → A., $]
[A → .(A), $]
[A → (.A), $/)]
[A → (A.), $/)] [A → (A)., $/)][A → .(A), )]
[A → a., $/)]
[A → .a, )]
[A → .a, $]
A
A
a
aaaaaa
)
(
• As would be expected this DFA is identical to the
DFA of sets of LR(0) items for this grammar, ex-
cepting for lookaheads.
160
LALR(1) parsing algorithm
• The LALR(1) parsing algorithm is identical to the
general LR(1) parsing algorithm.
• Definition: if no parsing conflicts arise when
parsing a grammar with the LALR(1) parsing al-
gorithm it is known as an LALR(1) grammar.
• It is possible for the LALR(1) construction to create
parsing conflicts that do not exist in general LR(1)
parsing.
• There cannot be any shift-reduce conflicts but
reduce-reduce conflicts are possible.
• Every SLR(1) grammar is certainly LALR(1) and
LALR(1) parsers often do as well as general LR(1)
parsers in removing typical conflicts that occur in
SLR(1) parsing.
• The id grammar is not SLR(1) but is LALR(1).
161
LALR(1) parsing
• Combining LR(1) states to form the DFA of sets of
LALR(1) items solves the problem of large parsing
tables, but it still requires the entire DFA of sets
of LR(1) items to be computed.
• It is possible to compute theDFA of sets ofLALR(1)
items directly from the DFA of sets of LR(0) items
by propagating lookaheads which is a relatively
simple process.
• Consider the LALR(1) of the A-grammar.
• Begin constructing lookaheads by adding ‘$’ to the
lookahead of the item A′ → A in state 0. The ‘$’
is said to be spontaneously generated.
• Then by the rules of ε-closure the ‘$’ propagates
to the two closure items of ‘.A’. By following the
three transitions leaving state 0, the ‘$’ propagates
to states 1, 2, and 3.
162
LALR(1) parsing
• Continuing with state 2 the closure items get the
lookahead ‘)’ by spontaneous generation—because
in A → (.A), the core item of the state, ‘.A’ is
followed by ‘)’.
• The transition on a to state 3 causes the ‘)’ to be
propagated to the lookahead of the item in that
state.
• The transition of ‘(’ from state 2 to itself causes
the ‘)’ to propagate to the lookahead of the kernel
item—which now has ‘)’ and ‘$’ in its lookahead
set.
• Now this lookahead set ‘)/$’ propagates to states
4 and 7.
• We have now demonstrated how to build the DFA
of sets of LALR(1) directly from the DFA of sets of
LR(0) items.
163
The hierarchy of parsers
LL(1)LALR(1)
LR(1)
LL(0)
SLR(1)
LR(0)
LR(k)
SLR(k)
LL(k)
• All LL(0) grammars are LR(0) but there exist LR(0)
grammars that are not LL(0).
• LR(0) grammars are SLR(1) and there are SLR(1)
grammars that are not LR(0) grammars.
• SLR(1) grammars areLALR(1) and there areLALR(1)
grammars that are not SLR(1) grammars.
• LALR(1) grammars are LR(1) and there are LR(1)
grammars that are not LALR(1).
164
The hierarchy of parsers—continued
LL(1)LALR(1)
LR(1)
LL(0)
SLR(1)
LR(0)
LR(k)
SLR(k)
LL(k)
• LL(k) grammars cut across these grammars but are
a subset of LR(k) and obviously include all LL(1)
and LL(0) grammars.
• Similarly LR(k) are obviously a superset of LR(1)
and LR(0) grammars.
165
bison an LALR(1) parser generator(Louden p. 226–250)
• bison basics.
• bison options.
• Parsing conflicts and disambiguating rules.
• Tracing the execution of a bison parser.
• Arbitrary value types in bison.
• Embedded actions in bison
166
Generation of a parser using bison
The TINY parser using bison.
• See the code for tiny.y
• Use of YYPARSER
• YYSTYPE is used to define the values returned bythe bison procedures as follows:#define YYSTYPE TreenNode *where TreeNodeis defined as:
typedef struct treeNode {struct treeNode * child[MAXCHILDREN];struct treeNode * sibling;int lineno;NodeKind nodekind;union {StmtKind stmnt; ExpKind exp;} kind;union {TokenType op;
int val;char * name;} attr;
ExpType type; /* for type checking of exps */} TreeNode;
167
Error recovery in bottom-up parsers
• The normal state of a compiler is dealing with
errors—most source files presented to a compiler
are erroneous.
• It is not acceptable for a compiler to give up on
the first parse error it finds.
• Compilers should be designed to cope with as many
errors as possible.
• Detecting errors in bottom-up parsing.
• Panic mode error recovery.
• Error recovery in bison.
• Error recovery in your compiler.
168
Detecting errors in bottom-up parsing
• An empty entry in parse table indicates an error.
• Contrary to intuition such entries are very useful—
a negative effect is that they increase the size op
the table.
• Because of the their construction an LR(1) parser
detects errors better than an LALR(1) which can
detect errors earlier that an LR(0) parser.
• Using the a-grammar with the incorrect input ‘(a$’
an ‘LR(1)’ parser will shift ‘(’ and ‘a’ onto the stack
and move to state 6, where it immediately reports
that there is no entry under ‘$.’
• An LR(0) parser will first reduce by A → a before
discovering the missing ‘)’
• Note that none of these parsers can ever shift a
terminal token in error.
169
Panic mode error recovery
It is possible to achieve good error recovery by removing
symbols from either the parse stack or the input stream
or both.
There are three possible actions:
1. Pop a state from the stack.
2. Successively pop tokens from the input until a to-
ken is seen for which we can restart the parse.
3. Push a new state onto the stack.
170
Panic mode error recovery
An effective way to choose an action when an error
occurs is to:
1. Pop states from the parse stack until a state with
nonempty ‘goto’ entries is found.
2. If there is a legal action on the current input token
from one of the ‘goto’ states, push that state onto
the stack and restart the parse. If there are several
such states, prefer a shift to a reduce. Among the
reduce actions, prefer one whose associated non-
terminal is least general.
3. If there is no legal action on the current input token
from one of the ‘goto’ states advance the input
until there is a legal action or the end is reached.
These rules have the effect of forcing the recognition
of the item being recognized—it is known as panic
mode error recovery.
171
Panic mode error recovery—example
Consider the parse below of (2+*)—it proceeds nor-mally until the * is seen. At that point panic modewould cause the following actions to take place on theparsing stack.
Parsing stack Input Action. . . . . . . . .$ 0(6E10 + 7 ∗)$ error: push T, goto 11$ 0(6E10 + 7T 11 ∗)$ shift 9$ 0(6E10 + 7T 11 ∗ 9 )$ error: push F, goto 13$ 0(6E10 + 7T 11 ∗ 9F 13 )$ reduce T → T ∗ F. . . . . . . . .
• At the first error the parser is in state 7, whichhas legal goto states 11 and 4. Since state 11 hasa shift on the next input token ‘*,’ that goto ispreferred, and the token is shifted.
• The parser goes into state 9, with ‘)’ as input—another error.
• In state 9 there is a single goto entry to state 11and state 11 does not have a legal action for ‘)’, sothe parse can now proceed normally.
172
Error recovery in bison.
• bison uses error productions.
• An error production contains the pseudo token
error.
• error marks the context in which erroneous to-
kens can be removed until a suitable synchronizing
token is seen.
• error productions allow the programmer to man-
ually mark those nonterminals whose goto entries
can be used in error recovery.
• bison also uses the function ‘yyerrok’ which pre-
vents particular tokens from being discarded when
it is doing error recovery.
173
How bison uses error.
• When an error is encountered, states are popped
from the parse stack until it reaches a state in
which ‘error’ is a legal lookahead.
• If ‘error’ is never a legal lookahead for a shift then
the parse stack will be emptied, aborting the parse.
• When a state with ‘error’ as legal lookahead is
found, the parser carries on as if it has seen ‘error’
followed by the lookahead that caused the error.
• The previous lookahead token can be discarded
with ‘yyclearin’.
• If the parser is in the Error State and then dis-
covers further errors, the input tokens causing the
errors will be discarded without any messages un-
til three tokens have been shifted legally onto the
parsing stack.
174
bison and the error state.
• While the parser is in the error recovery mode the
value of ‘YYRECOVERING’ is 1—normally it is 0.
• The parser can be removed from from the error
state by using ‘yyerrok’.
175
Error recovery in bison—examples
Consider the example
%token NUMBER%%command
: exp { printf("%d\n", $1);}; // prints the result.
exp: exp ’+’ term { $$ = $1 + $3;}| exp ’-’ term { $$ = $1 - $3;}| term { $$ = $1;};
term: term ’*’ factor { $$ = $1 * $3;}| factor { $$ = $1;};
factor: NUMBER { $$ = S1;}| ’(’ exp ’)’ { $$ = S2;};
%%
176
Error recovery in bison—examples
Consider this replacement for command
command: exp { printf("%d\n", $1);}| error { yyerror("incorrect expression");};
Now suppose the input ‘2++3’ is given to the parser.This will lead to the configuration:
Parsing stack Input$ 0 exp 2 + 7 +3$
The parser enters the error state and begins poppingstates from the stack until the state 0 is uncovered.Then the production for command provides that erroris a legal lookahead, and it is shifted onto the stack andimmediately reduced to command causing the errormessage “incorrect expression” to be printed. The stacknow becomes:
Parsing stack Input$ 0 command 1 +3$
At this stage the only legal lookahead is ‘$’ correspond-
ing to the return of ‘EOF ’ by ‘yylex’ and the parser
will delete the remaining input tokens ‘+3’ before exit-
ing the error state.
177
Error recovery in bison—examples
• A better idea is to reenter the line after the error.This is done with a synchronizing token, such as‘\n’.
command: exp ’\n’ { printf("%d\n", $1);}| error ’\n’
{ yyerrok;printf("reenter expression: ");}
command;
• When the error occurs the parser will skip all the
tokens up to the end-of-line symbol, when it will
execute ‘yyerrok’ and the printf statement and
will then try to get another command.
• The call to yyerrok is needed to cancel the error
state, otherwise bison will eat up input until it
finds three legal tokens.
178
Where to put error
Follow these goals when placing error• as close as possible to the start symbol—ensuresthat there is always a point to recover from
• as close as possible to each terminal—improve re-covery further by inserting action yyerrok;
• without introducing new conflicts—could be diffi-cult. Allow parsing to continue beyond expressionbut trash the rest of the statement.
Place error symbols:• into each recursive construct or repetition.
• don’t add yyerrok; in productions with error—it may lead to cascading error messages and evenloops if the parser cannot discard input.
• non-empty lists require two error variants, one atthe start of a list and another for the end.
• possibly empty lists require an error symbol in-side the empty branch—otherwise add the symbolwhere the empty list is being used.
179
Where to put error
The table below is a good guideline:Construct EBNF bison inputoptional sequence x : {y} x
: /* empty */| x y {yyerrok;}| x error;
sequence x : y{y} x: y| x y {yyerrok;}| error| x error;
list x : y{Ty} x: y| x T y {yyerrok;}| error| x error| x error y {yyerrok;}| x T error;
Note that we used a yyerrok; action after a produc-
tion with an error symbol.
180
Semantic analysis
Introduction
• Semantic analysis is sometimes referred to as context-
sensitive analysis because coping with some of the
simplest semantics—such as using a variable if and
only if they have already been declared—is beyond
the capabilities of a CFG.
• Generally, use symbol table and bison actions to
perform or compute semantics.
• More formally, syntax-directed translation with at-
tribute grammars may be used.
• Use type-checking algorithms based on attribute
dependency and propagation.
181
Semantic analysis
• Semantic analysis involves computing beyond the
reach of CFGs and parsing algorithms and staged
in the realm of context-sensitive grammars.
• The semantic information is closely related to the
eventual meaning or semantics of the program be-
ing translated.
• Since it takes place prior to execution it may be
regarded as static.
• Semantic analysis in a statically typed language
such as C involves building a symbol table to keep
track of the meanings of identifiers and performing
type inference to propagate these meanings and
type checking on expressions and statements.
182
Semantic analysis
What sort of meaning is involved that extends beyondthe capabilities of a CFG?
1. Has x been declared only once?
2. Has x already been declared before its first use?
3. Has x been defined before its first use?
4. Is x a scalar, an array, a function, or a class?
5. Is x declared but never used?
6. To which declaration does x refer?
7. Are the types in an expression compatible?
8. Does the dimension match the declaration?
9. Is an array reference within its declared bounds?
10. Where is x stored? When is it allocated or created?
11. Does *p refer to the result of a new or of a malloc()?
12. Does the expression produce a constant value?
183
Semantic analysis
Semantic analysis can be divided into two categories:
the analysis of a program
1. to establish its correctness in order to guarantee
proper execution—varies according to their typing
strength of the language in question. Languages
can be ordered more or less in terms of their typing
strength:
LISP ≺ Smalltalk ≺ Fortran ≺ Basic ≺
C ≺ Pascal ≺ Oberon ≺ Ada andJava
2. to enhance the efficiency of its execution—this
is usually relegated to “optimization.”
184
Static semantic analysis
• Involves both the description of the analyses toperform, as well as the implementation of the anal-yses using appropriate algorithms.
• Denotational semantics (Strachey) may be used.
• Attributes and attribute grammars (Donald Knuth1965) may be used to write semantic rules.
• bison’s actions often boil down to semantic rules.
• Attribute grammars may be useful for languageswhich obey the principle of syntax-directed se-mantics:– the semantic content of a program isclosely related to its syntax.
• Modern programming languages tend to follow thisprinciple.
• Despite this, semantics are often not formally spec-ified by the language designers and the task offiguring out the attribute grammar is left to thecompiler writer.
185
Attributes and attribute grammars
• An attribute is a property of a programming lan-
guage construct.
• The attributes of an object
scope
size
extent
lexical/dynamic
register/memory
transient/persistent
static/automatic
locationvalue
type
name:
• data type of an object.
• value of an expression, or object code.
• location of a variable in memory or on disk.
• number of significant digits in a variable.
• more attributes are given in the figure.
186
Attributes and attribute grammars
• Abstract syntax represented by an abstract syntax
tree is a better basis for semantics—but this too is
usually left to the whims of the compiler writer.
• Attributes may be fixed prior to the compilation
process.
• Binding is the process of computing an attribute
and associating its computed value with the lan-
guage construct.
• The time that it occurs is called binding time.
• Attributes that can be prebound by the compiler
are static.
• Those attributes that are bound at run time are
dynamic.
187
Binding time
• In C or Pascal the data type of a variable can bedetermined at compile time by the type checker.In LISP this is usually done at run time.
• The values in expressions are usually dynamic. Someconstant expressions such as (1+2)*3 can be cal-culated during compilation—constant folding.
• Variables are usually allocated during compilation.Storage for variables can also be created at runtime, but typically these are used via pointers re-siding in statically allocated relative addresses.
• The object code is static, because all of it is createdat run time. Some languages cater for doing thisdynamically.
• In a language likeBASIC the size and type of num-bers is determined at run time. But usually thescanner needs to know the number of digits it mustaccumulate—otherwise numbers can be stored asstrings and the accuracy sorted out by the run timeroutines.
188
Attribute grammars
• If X ∈ N ∪ T and a is an attribute of X , writeX.a for the value of a associated with X .
• Given a collection of attributes a1, a2, . . . , ak theprinciple of syntax-directed semantics implies thatfor each grammar ruleX0 → X1X2 . . . Xn, whereX0 ∈ N and Xi ∈ X ∪ T for i ∈ [1..n], thevalues of the attributes Xi.aj of each symbol Xi
are related to the values of the other symbols inthe rule.
• Each relationship is specified by an attribute equa-tion.
Xi.aj = fij(X0.a1, . . . , X0.ak,X1.a1, . . . , X1.ak,...,Xn.a1, . . . , Xn.ak)
• An attribute grammar for the attributes a1, a2,
. . . , ak is the collection of all such equations forall the grammar rules of the language.
189
Attribute grammar for number grammar
Typically attribute equations are written with with each
grammar rule.
The number grammar has the 12 productions:
number → number digit | digitdigit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Its attribute grammar follows
Grammar rules Semantic rulesnumber1 → number1.val =
number1 digit number2.val ∗ 10 + digit.valnumber → digit number.val = digit.valdigit → 0 digit.val = 0digit → 1 digit.val = 1digit → 2 digit.val = 2digit → 3 digit.val = 3digit → 4 digit.val = 4digit → 5 digit.val = 5digit → 6 digit.val = 6digit → 7 digit.val = 7digit → 8 digit.val = 8digit → 9 digit.val = 9
190
The parse tree for number grammar
The parse tree for the number grammar of the integer
345 follows
number
digit
number(val=34*10+5=345)
number digit(val=5)(val=3*10+4=34)
digit
3
4
5
(val=4)
(val=3)
(val=3)
191
Attribute grammar for exp grammar
The exp grammar has the 7 productions:
exp → exp + term | exp− term | termterm → term ∗ factor | factorfactor → (exp) | number
Its attribute grammar follows
Grammar rules Semantic rulesexp1 → exp2+term exp1.val = exp2.val + term.valexp1 → exp2-term exp1.val = exp2.val − term.valexp → term exp.val = term.valterm1 → term1.val =
term2*factor term2.val ∗ factor.valterm → factor term.val = factor.valfactor → (exp) factor.val = exp.valfactor → number factor.val = number.val
• Note that the ‘+’ in the grammar rule ‘exp1 →
exp2+term’ represents the token in the source pro-
gram, and the ‘+’ in the semantic rule represents
the arithmetic operation to be performed at exe-
cution time.
• There is no equation with ‘number.val’ on the LHS,
since this value is calculated prior to the semantic
phase, e.g. by the scanner.
192
Parse tree for exp grammar
The parse tree of (34-3)*42 for the exp grammar
exp(val=1302)
term(val=31*42=1302)
term
factor(val=31) (val=42)
exp(val=34−3=31)
exp term
term
factor
factor(val=3)
factor(val=34)
(val=34)
(val=34)
(val=3)
(val=3)
(val=42)(val=31) *
−
( )
number
number
number
(val=34)
193
Attribute grammar for decl grammar
The decl grammar has the 5 productions:
decl → type | var-listtype → int | floatvar-list → id, var-list | id
Its attribute grammar follows
Grammar rules Semantic rulesdecl → type | var-list var-list.dtype = type.dtypetype → int type.dtype = integertype → float type.dtype = realvar-list1 → id, var-list2 id.dtype = var-list1.dtype
var-list2.dtype = var-list1.dtypevar-list → id id.dtype = var-list.dtype
The parse tree for the string float x,y
type(decl=real)
var-list(decl=real)
var-list
(dtype=real)
float id(x)
,
id(y)
(dtype=real)
(dtype=real)
decl
194
Attribute grammar for based-num grammar
The based-num grammar has the 15 productions:based-num → num basechar,basechar → o | d,num → num | digit,digit → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Its attribute grammar followsGrammar rules Semantic rulesbased-num → based-num.val = num.val
num basechar num.base = basechar.basebasechar → o basechar = 8basechar → d basechar = 10num1 → num2 digit num1.val =
if digit.val = erroror num2.val = error
then errorelse num2.val ∗ num1.base
+digit.valnum2.base = num1.basedigit.base = num1.base
num → digit num.val = digit.valdigit.base = num.base
digit → 0 digit.val = 0digit → 1 digit.val = 1· · ·digit → 7 digit.val = 7digit → 8 digit.val = if digit.base = 8
then error else 8digit → 9 digit.val = if digit.base = 8
then error else 9
195
Parse tree for based-num grammar
The parse tree for the based-num grammar follows
based-num
num
(base=8)
num digit
(base=8) (base=8)
(base=8)(base=8)
digitnum
digit
(base=8)
basechar(base=8)
(val=3)
(val=3)
(val=3*8+4=28)
(val=28*8+5=229)
(val=229)
(val=5)
(val=4)
o
5
4
3
196
Simplifications and extensions to AGs
Some useful but obvious extensions to the AG meta-
language are
• The use of if ... then ... else statements,
and a case statement.
• Certain functions can also enhance the function-
ality, e.g. the function ‘numval’ in the attribute
equation digit.val = numval(D) which converts
‘D’—the token for a digit—into its numerical value.
The C function below does the trick
int numval(char D) {
return (int)D - (int)’0’;
}
197
AG for creating abstract syntax tree
An abstract syntax tree is created by the semantic rules.Grammar rules Semantic rulesexp1 → exp2+term exp1.tree =
mkOpNode(+, exp2.tree, term.tree)exp1 → exp2-term exp1.tree =
mkOpNode(-, exp2.tree, term.tree)exp → term exp.tree = term.treeterm1 → term1.tree =
term2*factor mkOpNode(*, term2.tree, factor.tree)term → factor term.tree → factor.treefactor → (exp) factor.tree → exp.treefactor → number factor.tree =
mkNumNode(number.lexval)
198
Algorithms for attribute computation
• Consider the attribute equation
Xi.aj = fij(X0.a1, . . . , X0.ak,X1.a1, . . . , X1.ak,...,Xn.a1, . . . , Xn.ak).
• It may be viewed as an assignment of the value
of the RHS to the attribute Xi.aj where all the
attributes used on the RHS must be known.
• Some of the RHS attributes depend on having the
value of others available before they can be com-
puted.
• These dependencies are subject to some inherent
order preordained by the code being translated.
The dependencies are quite easy to determine by
building a dependency graph.
199
Dependency graphs and evaluation order
• Given an attribute grammar, each production rule
has an associated dependency graph.
• This graph has a node labelled by each attribute
Xi.aj of each symbol in the grammar rule, and for
each attribute equation
Xi.aj = fij(. . . , Xm.ak, . . .)
associated with the grammar rule there is an edge
from each node Xm.ak in the RHS to the node
Xi.aj. Xi.aj
Xm.ak
• e.g. for the production rule and its attribute equa-
tionnumber1 → number1.val =
number1 digit number2.val ∗ 10 + digit.val
the dependency graph is
number1.val
number2.val digit.val
200
Dependency graphs
• The dependency graphs for grammar rules of the
form digit → D are simple. They consist of pure
nodes without any edges digit.val
where digit.val = numval(D)
• The grammar rule and its attribute equation
number → digit number.val = digit.val
has the dependency graph
number.val
digit.val
• The dependency graph for the string 345.
number.val
number.val
number.val
digit.val
digit.val
digit.val
201
Dependency graphs
• In the grammar for declarations, the rule
var-list1 → id, var-list2has two associated attribute equations:id.dtype = var-list1.dtypevar-list2 = var-list1and the dependency graph
var-list.dtype
var-list.dtype
id.dtype
• Similarly the grammar rule var-list → id has the
dependency graph
var-list.dtype
id.dtype
• The two rules type → int and type → float
have trivial dependency graphs.
202
Dependency graph of decl grammar
• The rule
decl → type var-list var-list.dtype = type.dtype
has the dependency graph (DG)
var-list.dtypetype.dtype
• Since decl is not involved in the DG, it is not clear
which grammar rule has been associated with it.
So the DG is drawn over its corresponding parse
tree segment as follows
var-list.dtypetype.dtype
decl
• Now it is clear to which rule the dependency is
associated.
203
Dependency graph of decl grammar
• The dependency graph superimposed over the parse
tree for var-list1 → id, var-list2now becomes
,id var-list
var-list
dtype dtype
dtype
• The dependency graph for float x,y
idfloat ,(x)
id(y)
var-list
var-list
decl
dtype
dtype
dtypedtype
type dtype
204
Dependency graph of based-num grammar
• The dependency graph for the grammar rule
based-num → num basechar
follows below
based-num
num val
val
basebase basechar
The graph shows the dependencies based-num =
num.val and num.base = basechar.base.
• The dependency for digit → 9 follows.
9
valbase digit
The dependency is created by
digit.valif digit.base = 8 then error else 9
digit.val depends upon digit.base.
205
Dependency graph of based-num grammar
• The dependency graph for num → num digit
num
num valval
valbase
basebase digit
The graph shows the dependencies of the three at-
tribute equationsnum1.val =
if digit.val = error or num2.val = errorthen errorelse num2.val ∗ num1.base + digit.val
num2.base = num1.basedigit.base = num1.base
• The dependency for num → digit is similar.
num
val
val
base
base
digit
206
Dependency graph for 345o
• The graph shows the dependencies for 345o
5
3
4
o
num
num
num val
val
val
val
val
val
val
basebase
base
basebase
basebase
digit
digit
digit
based-num
basechar
207
Dependency graph for 345o
• The graph shows the dependencies for 345o in or-der of computation.
5
3
4
o
5 6
7 8 94
3 10 11 12
14
1132num
num
num val
val
val
val
val
val
val
basebase
base
basebase
basebase
digit
digit
digit
based-num
basechar
208
Synthesized and inherited attributes
• Rule-based attribute evaluation is based on traver-
sal of the parse or syntax tree.
• There are various approaches.
• The simplest to handle are synthesized attributes.
• Definition: An attribute is synthesized if all its
dependencies point upwards—from child to parent—
in the parse tree.
• An attribute a is synthesized if, given the rule
A → X1X2 . . . Xn, the only associated attribute
equation with an a on the LHS is of the formA.a = f(X1.a1, . . . , X1.ak, . . . , Xn.a1, . . . , Xn.ak)
• An attribute grammar in which all the attributes
are synthesized is an S-attributed grammar.
• From the examples on Slides 200, 201 and 202 it
follows that the number grammar is S-attributed.
209
S-attributed grammar—example
• The decl grammar is not S-attributed as can isobvious from dependency graph for float x, y
given below.
• The dependency graph for float x,y
idfloat ,(x)
id(y)
var-list
var-list
decl
dtype
dtype
dtypedtype
type dtype
• The S-synthesized attribute grammar can be calcu-lated by a single bottom-up, LRN—or postorder—traversal of the parse or syntax tree.
• The following pseudo code may be used:
void posteval(treenode T){
for(each child C of T)
posteval(C);
compute all synthesized attributes of T;
}
210
C-code for posteval
enum {Plus, Minus, Times} OpKind;enum {Opkind, Constkind} ExpKind;struct streenode {
ExpKind kind;OpKind op;struct streenode *lchild, *rchild;int val;};
void posteval(treenode *t){int temp;if (t->kind == Opkind){
posteval(t->lchild); // traverse Left childposteval(t->rchild); // traverse Right childswitch (t->op) {
case Plus:t->val = t->lchild->val + t->lchild->val;break;
case Minus:t->val = t->lchild->val - t->lchild->val;break;
case Times:t->val = t->lchild->val * t->lchild->val;break;} // end case
} // end if} // end posteval
211
Inherited attributes
• Attributes may be synthesized.
• Definition: An attribute that is not synthesized is
inherited.
• There are three kinds of attribute inheritance, viz.
(a) Inheritance from parent to siblings,
(b) Inheritance from sibling to sibling,
(c) Sibling inheritance via sibling pointers.
a B
B
aC
A
AAa
a B a C a a C
(b)(a)
(c)
• Inherited attributes are evaluated by a preorder—
or NLR—traversal, of the parse or syntax tree:
212
Inherited attributes
• Inherited attributes are calculated by NLR or pre-order traversal of the parse or syntax tree.
void preeval(treenode T) {for(each child C of T) {
compute all inherited attributes of T;preeval(C);}
}
213
Evaluating inherited attributes
• The decl grammar with semantic rules is as fol-
lows.
Grammar rules Semantic rulesdecl → type | var-list var-list.dtype = type.dtypetype → int type.dtype = integertype → float type.dtype = realvar-list1 → id, var-list2 id.dtype = var-list1.dtype
var-list2.dtype =var-list1.dtype
var-list → id id.dtype = var-list.dtype• The pseudo code to evaluate the dtype attributes:
void evaltype(treenode T) {switch (T->nodekind) {case decl:
evaltype(T->typechild);(T->var-listchild) = (T->typechild)->dtype;evaltype(T->var-listchild);
break;case type:
if ((T->child)==int) T.dtype = integer;else T.dtype = real;
break;case var-list:
T->firstchild in list = T.dtype;if (T->thirdchild in list !=NIL) {
T->thirdchild in list = T.dtype;evaltype(T->typechild);}
} // end case} // end evaltype
214
NLR traversal for float x,y;
2
5
4idfloat ,(x)
id(y)
1
3 var-list
var-list
decl
dtype
dtype
dtypedtype
type dtype
C code for NLR:
enum {decl, type, id} nodekind;enum {integer, real} typekind;struct treeNode {
nodekind kind;struct treeNode *lchild, *rchild, *sibling;typekind dtype; // for type and id nodeschar *name; // for id nodes};
void evaltype(treeNode *t) {switch (t->kind) {case decl:
t->rchild->dtype = t->lchild->dtype;evaltype(t->rchild);break;
case id:if (t->sibling != NULL) {
t->sibling->dtype = t->dtype;evaltype(t->sibling);}
break;} // end case
} // end evaltype
215
Attribute grammars—more examples
See examples of calculating attributes for
• ‘based− num’ grammar
• ‘exp → exp / exp|num|num.num’ grammar
in Louden pp. 282–284.
216
Other means of handling attributes
• Attribute as parameters/returned values
It sometimes saves storage space to use parameters
and returned values to transfer attribute values—
rather than storing them in the nodes of the syntax
tree record structure. Louden gives a worked ex-
ample of this on pp. 285–287.
• Attribute as external data structures
It may be convenient to use data structures extra
to the symbol table such as another lookup table,
graphs, stacks, etc. Louden has examples on pp.
287–289.
• Computing attributes during parsing
This question is more interesting because depend-
ing on the grammar the total effort during compila-
tion and the effort of constructing the compiler can
be reduced be having fewer passes through the syn-
tax tree and a shorter compiler. This is discussed
by Louden on pp. 288–295.
217
The symbol table
• The structure of the symbol table
• Declarations
• Scope rules and block structure
• Interaction of same-level declarations
• Attribute grammar using a symbol table
218
The symbol table
• After the syntax tree, the symbol table is the major
inherited attribute in a compiler.
• It is possible, but unnecessary, to delay building
the symbol table until after the parsing has com-
pleted.
• It is easier to build the symbol table as information
becomes available from the scanner and parser.
• The principal operations are:
– insert—for putting properties such as type and
scope into the symbol table,
– lookup—for retrieving attributes of a name in
the table, and
– delete—for removing items from the table.
• Using a hash table for the symbol table is prefer-
able because of its speed for these three operations.
219
The structure of the symbol table
• The hash table implements insertion, lookup and
deletion in O(1) time.
• The greatest disadvantage of using a hash table is
that it cannot produce a lexicographical listing of
its entries.
• Using slowerO(log(n)) methods such as a binary-
sequence search tree (BSST), AVL trees or even
B trees which can easily display the symbols al-
phabetically in O(n) time is not warranted.
• Closed hash tables where the entries are placed
directly into the table tend to become slower as
they become over 85% full. The tables must then
be resized and rehashed.
• Open hash tables are easier to tune and behave
more consistently as more entries are added.
220
Open hash table
• The performance of an open hash table with m
buckets is easy to tune by increasing the size of m.
• A uniform random mapping of keys to indices i ∈
[0..m] is usually good enough.
• A simple hash code is
h0 = 0,hi+1 = (αhi + ci) mod m
where α is a suitable number and ci is some integer
representing a character from the key.
• A variation is to apply the mod once only after
the final iteration.
221
Open hash table
• It is not unusual to mix predetermined keywords
with transient variable names in one and the same
table.
• Separating keywords and programmer identifiers
unnecessarily complicates the code.
• In the diagram ‘Hsize = m.’
H: 0
1
2
4
5
6
Hsize
apricots
carrots
marrows
apples
pears
beans
stNode*
stNode
stNode
stNode stNode
stNode
stNode
Λ
Λ
Λ
Λ
Λ
ΛΛ
Λ
Λ
222
-
// symboltable.h
#ifndef SYMBOLTABLE_H#define SYMBOLTABLE_Hstruct stNode;typedef stNode* stNodeP;class st {public:
bool isEmpty() const;// post: return value == nil or true;
st();// Constructor// post: st created && st.isEmpty() == true;// hashtable H is set up with Hsize elements// each element H[i] == NULL
~st();int lookupId(char* id, int idClass, int idState);int printSymbolTable();int insertId(char* id, int& idClass, int& idState);int deleteId(char* id, int idClass, int idState);int hash(char * id);
private:enum {Hsize=127};st* root;stNodeP H[Hsize];int noNodes;};
#endif
223
Declarations
Four kinds of declarations occur frequently:
1. constants, such as
const int SIZE = 199
2. types, such as
type Table = array [1..SIZE] of Entry;
and struct and union declarations such as
struct Entry {
char *name;
int count;
struct Entry *next;
}
and a typedef mechanism such as
typedef struct Entry *EntryPtr;
3. variables and
4. procedures/functions.
224
Declarations
Four kinds of declarations occur frequently:
1. constants,
2. types,
3. variables such asint a, b[100]; which defines a and b and allo-cates memory to them, andstatic int a, b[100]; declares a variable lo-cal to the procedure but that is not placed on theprocedure’s stack,extern int a, b[100]; tells the compiler thatthe linker will find this variable allocated and ini-tialized in another module andregister int x; which allocates the variable toa register instead of memory.
4. and procedures/functions are defined by giving abody of statements to execute. Prototypes mayalso be declared.
225
Scope rules and block structure
• Explicit declaration prior to use helps the pro-grammer to reduce errors of type references
• This simplifies the symbol table operations—it iseasy to detect variables that have not been de-clared and tends to make programming more idiotproof.
• It also enables single-pass compilation.
• Languages where explicit declaration or declara-tion before use is not required cannot easily becompiled in a single pass.
• Block structure leads to ‘older’ variables beingshadowed by ‘more recently’ declared variables ofthe same name.
• Although block structure is not a particularly use-ful programming feature, it makes it possible tosave some run-time memory at the cost of moreelaborate execution procedures.
226
Scope rules and block structure
int i,j;
int f (int size) {
char i, temp;
... // body of f
{ double j; // block A
...
}
... // body of f
{ char *j; // block B
...
}
} // end of f
The nonlocal int i cannot be reached from the com-pound statement body of the function f and is thussaid to have a scope hole there.
The nonlocal int j can be reached from the compoundstatement body of the function f but not from withineither of the two blocks.
In Pascal and Ada functions can be nested, complicat-ing the run time environment.
227
Scope rules and block structure
The Pascal code below reflects a similar symbol tablestructure as the C example. But that is where thesimilarity ends. Very interesting scoping and accessproblems arise during run time, since f, g and h can becalled by one another in many different ways.
program Ex;var i,j: integer;
function f(size: integer): integer;var i, temp: char;
procedure g;var j: real;begin
...end; { g }
procedure h;var j: ^char;begin
...end; { h }
begin { body f }...end; { f }
begin { main program }...end.
228
Scope rules and block structure
• Implementing nested scopes and shadowing, the
stInsert operation must not overwrite the previous
declarations, but must temporarily hide them from
view such that stLookup will find only shadowing
variables.
• Similarly delete must only remove the shadowing
variable and leave the previously hidden variables.
• The shadowing is thus easily implemented.
• See also the procedures.ps Slides 4–16 of Mooly
Sagiv and Reinhard Wilhelm made available on
the class web page.
• Some pictorial examples from Louden pp. 304–305
follow.
229
Symbol table contents
• After the declarations of the body of f:
3
4
char int
intint
char
function
stNode
stNode
stNode
H:
Λ
stNode*
Λ
Λ
Λ
Λ0
1
2
=Hsize
i i
jsize
temp
f
• After the declaration of Block B in the body of f:
3
4
char
int
char
function
char *
int
int
stNode
stNode
stNode stNode
H:
Λ
stNode*
Λ
Λ
0
1
2
=Hsize
i
temp
f
j
i Λ
Λjsize
• After leaving f and deleting its declarations:
3
4
int
function
int
stNode
stNode
H:
Λ
stNode*
Λ
0
1
2
=Hsize
i
f
Λ
Λ
Λj
230
Using separate tables for each scope
• Using separate tables for each scope
3
4
3
4
char
int
function
char *
3
4
char
int
int
stNodeΛ
0
1
2
Λ
0
1
2
i
stNode
i Λ
Λf
stNode
stNode
stNode
H:j
Λ
0
1
2
Λ
Λ
temp
size
Λ
Λj
=Hsize
=Hsize
stNode*
=HsizeΛ
Λ
Λ
Λ
Λ
Λ
Λ
stNode*
stNode*
231
Variables in scope holes
• The global integer variable i in the Ex program
on Slide 228 may be accessible by using notation
that defines the scope, i.e. Ex.i
• In this manner the references to the various vari-
ables called j, could be referred to as Ex.j, Ex.f.g.j
or Ex.f.h.j.
• The nesting depth may also be used. Ex has a
nesting depth 0, and f has nesting depth 1, while
g and h both have a nesting depth of 2.
• The scope resolution operator of C++ may also be
used to define the scope of a class declaration:
Class A {...int f(); // f is a member function...};
A::f() { // this is the definition of f in A...}
232
Dynamic scope
• Common Lisp normally uses static—also called lexical—scope, but older Lisp implementations often useddynamic scope.
• Variables with dynamic variables are possible inCommon Lisp.
• The following C++ example illustrates dynamicscoping:
#include <iostream.h>int i = 1;void f(void) {
cout << "i =" << i << endl;}
void main (void) {int i = 2;f();return 0;}
• In a normal C++ program the value of ‘i’ will be‘1’ sinc C++ uses static scoping.
• If C++ used dynamic scoping the value printed is‘2’.
233
Interaction of same-level declarations
• In a correct C++ compiler the following example
typedef int i;
int i
should produce a compilation error.
• This kind of error is detected by using the symbol
table—the symbol table should not permit a spe-
cific variable to be inserted once it is already there
at a given level.
234
Interaction of same-level declarations
• In this example
#include <iostream.h>
int i = 1;
void f(void) {
int i = 2, j = i + 1;
cout << "i = " << j << endl;
}
void main (void) {
return 0;
}
The value printed for ‘j’ is a ‘3’ because the dec-
larations are evaluated sequentially.
• Some languages permit collateral declaration in
which case the value of ‘j’ is derived from the‘i’ in
the outer block, because the inner ‘i’ cannot yet
be regarded to exist.
• this kind of declaration is possible in Common
Lisp, ML and Scheme.
235
Interaction of same-level declarations
• Yet another possibility is recursive declaration in
which declarations refer to one another.
int gcd(int m, int n) {
if (m==0) return n;
else return gcd(m,n % m);
}
• In this case the name ‘gcd must be added to the
symbol table before the body of the function is
processed otherwise the function will not be known
when the compiter encounters it at the recursive
call.
236
Interaction of same-level declarations
• In the case of mutual recursive functions eventhat is not enough.
void f(void) {... g(); ...
}void g(void) {
... f(); ...}
Some sort of a forward declaration is needed, suchas a prototype in C++.
void g(void); // prototype for g().void f(void) {
... g(); ...}void g(void) {
... f(); ...}
In Pascal the keyword ‘forward’ is used to prede-
clare the procedure heading.
237
Data types and type checking
• Type expressions and type constructors.
• Type names, type declarations and recursive types.
• Type equivalence.
• Type inference and type checking.
• Additional topics in type checking.
260
Run-time environments
• Memory organization during program execution.
• Fully static run-time environments.
• Stack-based run-time environments.
• Dynamic memory.
• Parameter passing mechanisms.
300
Code generation
• Intermediate code and data structures for code
generation.
• Basic code generation techniques.
• Code generation of data structure references.
• Code generation of control statements.
• Code generation of procedure and function calls.
• Code generation in commercial compilers.
• TM: a simple target machine.
• A survey of code optimization techniques.
320
Code generation
• Generate executable code for a target machine
• Executable code depends on:
1. source language,
2. runtime environment,
3. target machine and the
4. operating system.
• May produce assembler output or relocatable bi-
nary, necessitating an assembler and a linker.
• Code should be optimized.
• Use an intermediate representation (IR) such as a
parse tree, or produce code directly.
• Our compiler produces directly executable P-code.
321
Intermediate representation (IR)
We will discuss two popular forms of intermediate code:
• Principal IR is an abstract syntax tree (AST).
• AST relies heavily upon the symbol table
• AST does not resemble target code well enough.
• Forms of intermediate code that are closer to target
machines are:
– 3-Address code
– P-code
• Data structures for implementing 3-address code
– Usually represented as quads.
– Often avoided because an extra compilation
pass is needed to produce final target code.
– Discussion is merited because it assists in un-
derstanding code generation.
322
3-Address code
• The general form of an arithmetic operation is
x = y op z
The obvious semantics apply, namely, x must be
an L-value and y and z may be either L-values or
R-values without run-time addresses.
• The expression 2*a+(b-3) has the syntax tree+
*
a b 3
−
2
• The expression 2*a+(b-3) may be translated into
t1 = 2 * a
t2 = b - 3
t3 = t1 * t3
323
A simple program with its 3-address code
read x; { input an integer )if 0 < x then { don’t compute if x <= 0 }
fact := 1;repeat
fact := fact * x;x := x - 1until x = 0;
write fact { output factorial of x }end
read x
t1 = x > 0
if_false t1 goto Ll
fact = 1
label L2
t2 = fact * x
fact = t2
t3 = x - 1
x = t3
t4 = x == 0
if_false t4 goto L2
write fact
label Ll
halt
324
Represented as triples
read x; { input an integer )if 0 < x then { don’t compute if x <= 0 }
fact := 1;repeat
fact := fact * x;x := x - 1until x = 0;
write fact { output factorial of x }end
(0) (rd,x,_)
(1) (gt,x,o)
(2) (if_f,(1),(11))
(3) (asn,l,fact)
(4) (mul,fact,x)
(5) (asn,(4),fact)
(6) (sub,x,l)
(7) (asn,(6),x)
(8) (eq,x,o)
(9) (if_f,(8),(4))
(10) (wri,fact,_)
(11) (halt,_,_)
325
Represented as quads
read x; { input an integer )if 0 < x then { don’t compute if x <= 0 }
fact := 1;repeat
fact := fact * x;x := x - 1until x = 0;
write fact { output factorial of x }end
(rd,x,_,_)
(gt,x,o,tl)
(if_f,t1,L1,_)
(asn,l,fact,_)
(lab,L2,_,_)
(mul,fact,x,t2)
(asn,t2,fact,_)
(sub,x,l,t3)
(asn,t3,x,_)
(eq,x,o,t4)
(if_f,t4,L2,_)
(lab,L1,_,_)
(halt,_,_,_)
326
3-address code with P-code equivalent
• The expression 2*a+(b-3) may be translated into
t1 = 2 * a
t2 = b - 3
t3 = t1 + t2
• This code may be translated into:
ldc a A(t1)
ldc a A(a) ; t1 = 2 * a
ldi i
ldc i 2
mul ; leave value of t1 on stack
ldc a A(b) ; t2 = b - 3
ldi i
ldc i 3
sub ; leave value of t2 on stack
add ; t3 = t1 + t2, t3 on stack
327
Intermediate code as a synthesized attribute
(x=x+3)+4
lda X
lod X
ldc 3
adi
stn
ldc 4
adi
Grammar Rule Semantic Rulesexp1 → id = exp2 exp1.pcode = "lda" || id.strval
++ exp2.pcode ++"stn"exp → aexp exp.pcode = aexp.pcodeaexp1 → aexp2 + factor aexp1.pcode = aexp2.pcode
++factor.pcode++"adi"aexp → factor aexp.pcode = factor.pcodefactor → (exp) factor.pcode = exp.pcodefactor → num factor.pcode = "ldc"||num.strvalfactor → id factor.pcode = "lod"||ld.strval
328
Practical code generation: genCode
procedure genCode (T: treenode);beginif T is not nil thengenerate code to prepare for code of left child of T;gencode(left child of T);generate code to prepare for code of right child of T;gencode(right child of T);generate code to implement the action of T;end;
enum Optype {Plus,Assign) ;enum NodeKind {OPKind,ConstKind,IdKind};typedef struct streenode {
NodeKind kind;Optype op; // used with opkindstruct streenode *lchild,*rchild;int val; // used with ConstKindchar * strval; // used for identifiers-and numbers} STreeNode;
typedef STreeHode *SyntaxTree;
329
Practical code generation: genCode
void genCode( SyntaxTree t) {char codestr[CODESIZE]; // CODESIZE = max length of a P-code line
if (t != NULL) {switch (t->kind) {case OPKind:
switch (t->op) {case Plus:
genCode(t->lchild);genCode(t->rchild);emitCode("adi"),break;
case Assign:sprintf(codestr,"%s %s", "lda", t->strval);emitCode(codestr);genCode(t->lchild);emitCode("stn"),break;
default:emitCode("Error"),break;
}break;
case ConstKind:sprintf (codestr,"%s %s", "ldc", t->strval);emitCode(codestr);break;
case IdKind:sprintf(codestr,"%s %s", "lod", t->strval);emitCode(codestr);break;
default:emitCode("Error");
break;}
}}
330
Practical code generation: Bison
%{#define YYSTYPE char *// make Bison/Yacc use strings as values// Other inclusion code ...%}%token NUM ID
exp: ID { sprintf(codestr,"%s %s, "lda", $1);
emitcode(codestr);’=’ exp { emitcode("stn"); ]}
| aexp;
aexp: aexp ’+’ factor { emitCode("adi"); }| factor;
factor : ’(’ exp ’)’| NUM { sprintf(codestr, "%s %s", "lda",$1);
emitcode(codestr);}
| ID { sprintf(codestr, "%s %s","lod",$1);emitcode (codestr);}
;%%// utility functions ...
331
expbig1.latex
✟✟✟✟✟✟✟✟✟✟✟
❍❍❍❍❍❍❍❍❍❍❍
✟✟✟✟✟✟✟✟✟✟✟
❍❍❍❍❍❍❍❍❍❍❍
✱✱
✱✱
✱✱
❧❧❧❧
❧❧
exp exp
number
exp
exp op exp
number
exp
number
*
-
( )
op
392
expbig2.latex
1
34 2
5
678
✟✟✟✟✟✟✟✟✟✟✟
❍❍❍❍❍❍❍❍❍❍❍
✟✟✟✟✟✟✟✟✟✟✟
❍❍❍❍❍❍❍❍❍❍❍
✱✱
✱✱
✱✱
❧❧❧❧
❧❧
exp exp
number
exp
exp op exp
number
exp
number
*
-
)
op
(
393
exp-ambigous1.latex
✏✏✏✏✏✏✏✏✏✏✏✏✏✏✏✏
✟✟✟✟✟✟✟✟✟✟✟
❍❍❍❍❍❍❍❍❍❍❍
❍❍❍❍❍❍❍❍❍❍❍
exp op exp
number
exp
number-
number
exp
exp
*
op
396