comp3190: principle of programming languages formal language syntax
TRANSCRIPT
COMP3190: Principle of Programming Languages
Formal Language Syntax
- 2 -
Motivation
The problem of parsing structured text is very commonConsider the structure of email addresses (using a grammar):<emailAddress> := <person> @ <host><person> := <word><host> := <word> | <word>.<host>Describe and recognize email addresses in arbitrary text.
- 3 -
Outline
DFA & NFA Regular expression Regular languages Context free languages &PDA Scanner Parser
- 4 -
Deterministic Finite Automata (DFA)
Q: finite set of states Σ: finite set of “letters” (alphabet) δ: QxΣ -> Q (transition function) q0: start state (in Q)
F : set of accept states (subset of Q) Acceptance: input consumed with the automata
in a final state.
- 5 -
Example of DFA
q1 q2
1
0
0 1
δ 0 1
q1 q1 q2
q2 q1 q2
Accepts all strings that end in 1
- 6 -
Another Example of a DFA
S
q1
q2
r1
r2
a b
a
ab
b
b
a b
a
Accepts all strings that start and end with “a” OR start and end with “b”
- 7 -
Non-deterministic Finite Automata (NFA)
Transition function is different δ: QxΣε -> P(Q)
P(Q) is the powerset of Q (set of all subsets) Σε is the union of Σ and the special symbol ε
(denoting empty)String is accepted if there is at least one path leading to an accept state, and input consumed.
- 8 -
Example of an NFA
q1 q2 q3 q4
0, 11 0, ε 1
0, 1
δ 0 1 ε
q1 {q1} {q1, q2}
q2 {q3} {q3}
q3 {q4}
q4 {q4} {q4}
What strings does this NFA accept?
- 9 -
Outline
DFA & NFA Regular expression Regular languages Context free languages &PDA Scanner Parser
- 10 -
Regular Expressions
R is a regular expression if R is “a” for some a in Σ. ε (the empty string). member of the empty language. the union of two regular expressions. the concatenation of two regular expr. R1
* (Kleene closure: zero or more repetitions of R1).
- 11 -
Regular Expression Notation a: an ordinary letter ε: the empty string M | N: choosing from M or N MN: concatenation of M and N M*: zero or more times (Kleene star) M+: one or more times M?: zero or one occurence [a-zA-Z] character set alternation (choice) . period stands for any single char exc. newline
- 12 -
Examples of Regular Expressions
{0, 1}* 0 all strings that end in 0{0, 1} 0* string that start with 1 or 0 followed by zero or more 0s.{0, 1}* all strings{0n1n, n >=0} not a regular expression!!!
- 13 -
Converting a Regular Expression to an NFA
εε
ε
ε
εM
N
M
M N
ε
a
M|N
MN
M*
- 14 -
Regular expression->NFA
Language: Strings of 0s and 1s in which the number of 0s is even
Regular expression: (1*01*0)*1*
- 15 -
Converting an NFA to a DFA
For set of states S, closure(S) is the set of states that can be reached from S without consuming any input.
For a set of states S, DFAedge(s, c) is the set of states that can be reached from S by consuming input symbol c.
Each set of NFA states corresponds to one DFA state (hence at most 2n states).
- 16 -
NFA -> DFA
Initial classes:{A, B, E}, {C, D}
No class requires partitioning!
Hence a two-stateDFA is obtained.
- 17 -
Obtaining the minimal equivalent DFA
Initially two equivalence classes: final and nonfinal states.
Search for an equivalence class C and an input letter a such that with a as input, the states in C make transitions to states in k>1 different equivalence classes.
Partition C into k classes accordingly Repeat until unable to find a class to partition.
- 18 -
Example (cont.)
- 19 -
Outline
DFA & NFA Regular expression Regular languages Context free languages &PDA Scanner Parser
- 20 -
Regular Grammar
Later definitions build on earlier ones Nothing defined in terms of itself (no
recursion)
Regular grammar for numeric literals in Pascal:digit -> 0|1|2|...|8|9unsigned_integer -> digit digit*unsigned_number -> unsigned_integer (( . unsigned_integer) | ε ) (( e (+ | - | ε ) unsigned_integer ) | ε )
- 21 -
Languages and Automata in Programming Languages
Regular languages» Recognized(accepted) by finite automata» Useful for tokenizing program text
(lexical analysis) Context-free languages
» Recognized(accepted) by pushdown automata» Useful for parsing the syntax of a program
- 22 -
Important Theorems
A language is regular if a regular expression describes it.
A language is regular if a finite automata recognizes it.
DFAs and NFAs are equally powerful.
- 23 -
Outline
DFA & NFA Regular expression Regular languages Context free languages &PDA Scanner Parser
- 24 -
Context-free Grammars
Context-free grammars are defined by substitution rules
Big Jim ate gree cheesegreen Jim ate green cheeseJim ate cheeseCheese ate Jim
P -> NP -> APS -> PVP
A -> big|greenN -> cheese|JimV -> ate
- 25 -
Context-free Grammars
Context-free grammars are used to formally describe the syntax of programming languages.
Every syntactically correct program is derived using the context-free grammar of the language.
Parsing a program involves tracing such derivation, given the context-free grammar and the program.
- 26 -
Context-free Grammars
A context-free grammar consists of V: a finite set of variables Σ: a finite set of terminals R: a finite set of rules of the form
variable -> {variable, terminal}* S: the start variable
- 27 -
Pushdown Automata (PDA)
A pushdown automata consists of Q: a set of states Σ: input alphabet (of terminals) Γ: stack alphabet δ: a set of transition rules
Q x Σε x Γε -> P(Q x Γε)currentState, inputSymbol, headOfStack ->newState, pushSymbolOnStack
q0: the start state F: the set of accept states (subset of Q)
Deterministic: At most one move is possible from any configuration
- 28 -
How does a PDA accept?
By final state: » Consume all the input while» Reaching a final state
By empty stack:» Consume all the input while» Having an empty stack» Set of final states is irrelevant
- 29 -
Example of a PDA
q1 q2
q3q4
ε, ε ->$ 0, ε->0
1, 0->ε
1, 0->εε, $->ε
Notation: a, b->c: when PDA reads “a” from input, it replaces “b” at the top of stack with “c”.
What does this PDA accept?
- 30 -
Important Theorems
A language is context-free iff a pushdown automata recognizes it
Non-deterministic PDA are more powerful than deterministic ones
- 31 -
Example of Context-free Language That Requires a Non-deterministic PDA
{w wR | w belongs to {0, 1}*}
i.e. wR is w written backwards
Idea:
Non-deterministically guess the middle of the input string
- 32 -
The Solution
q1 q2
q3q4
ε, ε ->$ 0, ε->01, ε->1
ε, ε->ε
1, 1->ε0, 0->ε
ε, $->ε
- 33 -
Derivations and Parse Trees
Nested constructs require recursion, i.e. context-free grammars
CFG for arithmetic expressions
expression -> identifier | number | - expression | (expression) | expression operator expression
operator -> + | - | * | /
- 34 -
Parse Tree for Slope*x + Intercept
Is this the only parse tree for this expression and grammar?
- 35 -
A Better Expression Grammar
1. expression -> term | expression add_op term
2. term -> factor | term mult_op factor
3. factor -> identifier | number | - factor | (expression)
4. add_op -> + | -
5. mult_op -> * | /
A good grammar reflects the internal structure of programs.
This grammar is unambiguous and captures (HOW?):- operator precedence (*,/ bind tighter than +,- )- associativity (ops group left to right)
- 36 -
And Better Parse Trees...
3 + 4 * 5
10 - 4 - 3
- 37 -
Syntax-directed Compilation
Parser calls scanner to obtain tokens. Assembles tokens into parse tree. Passes tree to later phases of compilation. Scanner: deterministic finite automata. Parser: pushdown automata. Scanners and parsers can be generated
automatically from regular expressions and CFGs (e.G. lex/yacc).
- 38 -
Outline
DFA & NFA Regular expression Regular languages Context free languages &PDA Scanner Parser
- 39 -
Scanning
Accept the longest possible token in each invocation of the scanner.
Implementation.» Capture finite automata.
Case(switch) statements. Table and driver.
- 40 -
Scanner for Pascal
- 41 -
Scanner for Pascal(case Statements)
- 42 -
Scanner (Table&driver)
- 43 -
Scanner Generators
Start with a regular expression. Construct an NFA from it. Use a set of subsets construction to obtain an
equivalent DFA. Construct the minimal equivalent DFA.
- 44 -
Outline
DFA & NFA Regular expression Regular languages Context free languages &PDA Scanner Parser
» Top-down parsing» Bottom-up Parsing» Comparison
- 45 -
Parsing approaches Parsing in general has O(n3) cost. Need classes of grammars that can be parsed in
linear time» Top-down or
predictive parsing orrecursive descent parsingor LL parsing (Left-to-right Left-most)
» Bottom-up or shift-reduce parsing orLR parsing (Left-to-right Right-most)
- 46 -
A Simple Grammar for a Comma-separated List of Identifiers
id_list -> id id_list_tail
id_list_tail -> , id id_list_tail
id_list_tail -> ;_________________________
String to be parsed: A, B, C;
- 47 -
Top-down/bottom-up Parsing
- 48 -
Outline
DFA & NFA Regular expression Regular languages Context free languages &PDA Scanner Parser
» Top-down parsing» Bottom-up Parsing» Comparison
- 49 -
Top-down Parsing
Predicts a derivation Matches non-terminal against token observed in
input
- 50 -
LL(1) Grammar
A grammar for which a top-down deterministic parser can be produced with one token of look-ahead.
LL(1) grammar:» For a given non-terminal, the lookahead symbol
uniquely determines the production to apply
» Top-down parsing = predictive parsing
» Driven by predictive parsing table of non-terminals x terminals productions
- 51 -
From Last Time: Parsing with Table
Partly-derived String Lookahead parsed part unparsed partES’ ( (1+2+(3+4))+5(S)S’ 1 (1+2+(3+4))+5(ES’)S’ 1 (1+2+(3+4))+5(1S’)S’ + (1+2+(3+4))+5(1+ES’)S’ 2 (1+2+(3+4))+5(1+2S’)S’ + (1+2+(3+4))+5
S ES’ S’ | +S E num | (S)
num + ( ) $S ES’ ES’S’ +S E num (S)
- 52 -
How to Construct Parsing Tables?
Needed: Algorithm for automatically generatinga predictive parse table from a grammar
S ES’S’ | +SE number | (S)
num + ( ) $S ES’ ES’S’ +S E num (S)
??
- 53 -
Constructing Parse Tables Can construct predictive parser if:
» For every non-terminal, every lookahead symbol can be handled by at most 1 production
FIRST() for an arbitrary string of terminals and non-terminals is:» Set of symbols that might begin the fully expanded
version of FOLLOW(X) for a non-terminal X is:
» Set of symbols that might follow the derivation of X in the input stream
FIRST FOLLOW
X
- 54 -
Parse Table Entries
Consider a production X Add to the X row for each symbol in
FIRST() If can derive ( is nullable), add
for each symbol in FOLLOW(X) Grammar is LL(1) if no conflicting entries
num + ( ) $S ES’ ES’S’ +S E num (S)
S ES’S’ | +SE number | (S)
- 55 -
Computing Nullable
X is nullable if it can derive the empty string:» If it derives directly (X )
» If it has a production X YZ ... where all RHS symbols (Y,Z) are nullable
Algorithm: assume all non-terminals are non-nullable, apply rules repeatedly until no change
S ES’S’ | +SE number | (S)
Only S’ is nullable
- 56 -
Computing FIRST Determining FIRST(X)
1. if X is a terminal, then add X to FIRST(X)
2. if X then add to FIRST(X)
3. if X is a nonterminal and X Y1Y2...Yk then a is in FIRST(X) if a is in FIRST(Yi) and is in FIRST(Yj) for j = 1...i-1 (i.e., its possible to have an empty prefix Y1 ... Yi-1
4. if is in FIRST(Y1Y2...Yk) then is in FIRST(X)
- 57 -
FIRST Example
S ES’S’ | +SE number | (S)
Apply rule 1: FIRST(num) = {num}, FIRST(+) = {+}, etc.Apply rule 2: FIRST(S’) = {}Apply rule 3: FIRST(S) = FIRST(E) = {}
FIRST(S’) = FIRST(‘+’) + {} = { , + }FIRST(E) = FIRST(num) + FIRST(‘(‘) = {num, ( }
Rule 3 again: FIRST(S) = FIRST(E) = {num, ( }FIRST(S’) = {, + }FIRST(E) = {num, ( }
- 58 -
Computing FOLLOW
Determining FOLLOW(X)1. if S is the start symbol then $ is in FOLLOW(S)
2. if A B then add all FIRST() != to FOLLOW(B)
3. if A B or B and is in FIRST() then add FOLLOW(A) to FOLLOW(B)
- 59 -
FOLLOW Example
S ES’S’ | +SE number | (S)
FIRST(S) = {num, ( }FIRST(S’) = {, + }FIRST(E) = { num, ( }
Apply rule 1: FOL(S) = {$}Apply rule 2: S ES’ FOL(E) += {FIRST(S’) - } = {+}
S’ | +S -E num | (S) FOL(S) += {FIRST(‘)’) - } = {$,) }
Apply rule 3: S ES’ FOL(E) += FOL(S) = {+,$,)}(because S’ is nullable)
FOL(S’) += FOL(S) = {$,)}
- 60 -
Putting it all TogetherFOLLOW(S) = { $, ) }FOLLOW(S’) = { $, ) }FOLLOW(E) = { +, ), $ }
FIRST(S) = {num, ( }FIRST(S’) = {, + }FIRST(E) = { num, ( }
Consider a production X
Add to the X row for each symbol in FIRST()
If can derive ( is nullable), add for each symbol in FOLLOW(X)
num + ( ) $S ES’ ES’S’ +S E num (S)
S ES’S’ | +SE number | (S)
- 61 -
Ambiguous Grammars
Construction of predictive parse table for ambiguousgrammar results in conflicts in the table (ie 2 or moreproductions to apply in same cell)
S S + S | S * S | num
FIRST(S+S) = FIRST(S*S) = FIRST(num) = { num }
- 62 -
Class Problem
E E + T | TT T * F | FF (E) | num |
1. Compute FIRST and FOLLOW sets for this G2. Compute parse table entries
- 63 -
Top-Down Parsing Up to This Point
Now we know» How to build parsing table for an LL(1)
grammar (ie FIRST/FOLLOW)» How to construct recursive-descent parser
from parsing table» Call tree = parse tree
Open question – Can we generate the AST?
- 64 -
Creating the Abstract Syntax Tree Some class definitions to assist
with AST construction class Expr {} class Add extends Expr {
» Expr left, right;
» Add(Expr L, Expr R) { left = L; right = R;
» }
} class Num extends Expr {
» int value;
» Num(int v) {value = v;}
}
Expr
Num Add
Class Hierarchy
- 65 -
Creating the AST
++ 5
1 +
2 +
3 4
(1 + 2 + (3 + 4)) + 5S
E + S
( S ) E
E + S 5
E + S1
2 E
( S )
E + S
E3 4
• We got the parse treefrom the call tree
• Just add code to eachparsing routine to createthe appropriate nodes
• Works because parse treeand call tree are the sameshape, and AST is just acompressed form of theparse tree
- 66 -
AST Creation: parse_E
Expr parse_E() {» switch (token) {
case num: // E number Expr result = Num(token.value); token = input.read(); return result;
case ‘(‘: // E (S) token = input.read(); Expr result = parse_S(); if (token != ‘)’) ParseError(); token = input.read(); return result;
default: ParseError();
» }
}
Remember, this is lookahead token
S ES’S’ | +SE number | (S)
- 67 -
AST Creation: parse_S
Expr parse_S() {» switch (token) {
case num: case ‘(‘: // S ES’
Expr left = parse_E(); Expr right = parse_S’(); if (right == NULL) return left; else return new Add(left,right);
default: ParseError();
» }
}
S ES’S’ | +SE number | (S)
- 68 -
Grammars Have been using grammar for language “sums
with parentheses” (1+2+(3+4))+5 Started with simple, right-associative grammar
» S E + S | E» E num | (S)
Transformed it to an LL(1) by left factoring:» S ES’» S’ | +S» E num (S)
What if we start with a left-associative grammar?» S S + E | E» E num | (S)
- 69 -
Reminder: Left vs Right Associativity
+
1 +
2 +
3 4
S E + SS EE num
S S + ES EE num
+
1
+
2
+ 34
Right recursion : right associative
Left recursion : left associative
Consider a simpler string on a simpler grammar: “1 + 2 + 3 + 4”
- 70 -
Left Recursion
derived string lookahead read/unreadS 1 1+2+3+4S+E 1 1+2+3+4S+E+E 1 1+2+3+4S+E+E+E 1 1+2+3+4E+E+E+E 1 1+2+3+41+E+E+E 2 1+2+3+41+2+E+E 3 1+2+3+41+2+3+E 4 1+2+3+41+2+3+4 $ 1+2+3+4
Is this right? If not, what’s the problem?
S S + ES EE num
“1 + 2 + 3 + 4”
- 71 -
Left-Recursive Grammars
Left-recursive grammars don’t work with top-down parsers: we don’t know when to stop the recursion
Left-recursive grammars are NOT LL(1)!» S S» S
In parse table» Both productions will appear in the predictive
table at row S in all the columns corresponding to FIRST()
- 72 -
Eliminate Left Recursion
Replace» X X1 | ... | Xm» X 1 | ... | n
With» X 1X’ | ... | nX’» X’ 1X’ | ... | mX’ |
See complete algorithm in Dragon book
- 73 -
Class Problem
E E + T | TT T * F | FF (E) | num
Transform the following grammar to eliminate left recursion:
- 74 -
Creating an LL(1) Grammar
Start with a left-recursive grammar S S + E S E
» and apply left-recursion elimination algorithm S ES’ S’ +ES’ |
Start with a right-recursive grammar S E + S S E
» and apply left-factoring to eliminate common prefixes S ES’ S’ +S |
- 75 -
Top-Down Parsing Summary
Language grammarLeft-recursion elimination
Left factoring
LL(1) grammar
predictive parsing tableFIRST, FOLLOW
recursive-descent parser
parser with AST gen
- 76 -
Outline
DFA & NFA Regular expression Regular languages Context free languages &PDA Scanner Parser
» Top-down parsing» Bottom-up Parsing» Comparison
- 77 -
New Topic: Bottom-Up Parsing
A more power parsing technology LR grammars – more expressive than LL
» Construct right-most derivation of program» Left-recursive grammars, virtually all
programming languages are left-recursive» Easier to express syntax
Shift-reduce parsers» Parsers for LR grammars» Automatic parser generators (yacc, bison)
- 78 -
Bottom-Up Parsing (2)
Right-most derivation – Backward» Start with the tokens» End with the start symbol» Match substring on RHS of production,
replace by LHS
S S + E | EE num | (S)
(1+2+(3+4))+5 (E+2+(3+4))+5 (S+2+(3+4))+5 (S+E+(3+4))+5 (S+(3+4))+5 (S+(E+4))+5 (S+(S+4))+5 (S+(S+E))+5 (S+(S))+5 (S+E)+5 (S)+5 E+5 S+E S
- 79 -
Shift-Reduce Parsing
Parsing actions: A sequence of shift and reduce operations
Parser state: A stack of terminals and non-terminals (grows to the right)
Current derivation step = stack + input
Derivation step stack Unconsumed input(1+2+(3+4))+5 (1+2+(3+4))+5(E+2+(3+4))+5 (E +2+(3+4))+5(S+2+(3+4))+5 (S +2+(3+4))+5(S+E+(3+4))+5 (S+E +(3+4))+5...
- 80 -
Shift-Reduce Actions
Parsing is a sequence of shifts and reduces Shift: move look-ahead token to stack
Reduce: Replace symbols from top of stack with non-terminal symbol X corresponding to the production: X (e.g., pop , push X)
stack input action( 1+2+(3+4))+5 shift 1(1 +2+(3+4))+5
stack input action(S+E +(3+4))+5 reduce S S+ E(S +(3+4))+5
- 81 -
Shift-Reduce Parsing
derivation stack input stream action(1+2+(3+4))+5 (1+2+(3+4))+5 shift(1+2+(3+4))+5 ( 1+2+(3+4))+5 shift(1+2+(3+4))+5 (1 +2+(3+4))+5 reduce E num(E+2+(3+4))+5 (E +2+(3+4))+5 reduce S E(S+2+(3+4))+5 (S +2+(3+4))+5 shift(S+2+(3+4))+5 (S+ 2+(3+4))+5 shift(S+2+(3+4))+5 (S+2 +(3+4))+5 reduce E num(S+E+(3+4))+5 (S+E +(3+4))+5 reduce S S+E(S+(3+4))+5 (S +(3+4))+5 shift(S+(3+4))+5 (S+ (3+4))+5 shift(S+(3+4))+5 (S+( 3+4))+5 shift(S+(3+4))+5 (S+(3 +4))+5 reduce E num
...
S S + E | EE num | (S)
- 82 -
Potential Problems
How do we know which action to take: whether to shift or reduce, and which production to apply
Issues» Sometimes can reduce but should not» Sometimes can reduce in different ways
- 83 -
Action Selection Problem
Given stack and look-ahead symbol b, should parser:» Shift b onto the stack making it b ?» Reduce X assuming that the stack has the
form = making it X ? If stack has the form , should apply
reduction X (or shift) depending on stack prefix ? is different for different possible reductions
since ’s have different lengths
- 84 -
LR Parsing Engine
Basic mechanism» Use a set of parser states» Use stack with alternating symbols and states
E.g., 1 ( 6 S 10 + 5 (blue = state numbers)
» Use parsing table to: Determine what action to apply (shift/reduce) Determine next state
The parser actions can be precisely determined from the table
- 85 -
LR Parsing Table
Algorithm: look at entry for current state S and input terminal C» If Table[S,C] = s(S’) then shift:
push(C), push(S’)
» If Table[S,C] = X then reduce: pop(2*||), S’= top(), push(X), push(Table[S’,X])
Next actionand next state
Next state
Terminals Non-terminals
State
Action table Goto table
- 86 -
LR Parsing Table Example
( ) id , $ S L1 s3 s2 g42 Sid Sid Sid Sid Sid3 s3 s2 g7 g54 accept5 s6 s86 S(L) S(L) S(L) S(L) S(L)7 LS LS LS LS LS8 s3 s2 g99 LL,S LL,S LL,S LL,S LL,S
Sta
te
Input terminal Non-terminals
We want to derive this in an algorithmic fashion
- 87 -
Parsing Example ((a),b)
derivation stack input action((a),b) 1 ((a),b) shift, goto 3((a),b) 1(3 (a),b) shift, goto 3((a),b) 1(3(3 a),b) shift, goto 2((a),b) 1(3(3a2 ),b) reduce Sid((S),b) 1(3(3(S7 ),b) reduce LS((L),b) 1(3(3(L5 ),b) shift, goto 6((L),b) 1(3(3L5)6 ,b) reduce S(L)(S,b) 1(3S7 ,b) reduce LS(L,b) 1(3L5 ,b) shift, goto 8(L,b) 1(3L5,8 b) shift, goto 9(L,b) 1(3L5,8b2 ) reduce Sid(L,S) 1(3L8,S9 ) reduce LL,S(L) 1(3L5 ) shift, goto 6(L) 1(3L5)6 reduce S(L)S 1S4 $ done
S (L) | idL S | L,S
- 88 -
LR(k) Grammars
LR(k) = Left-to-right scanning, right-most derivation, k lookahead chars
Main cases» LR(0), LR(1)» Some variations SLR and LALR(1)
Parsers for LR(0) Grammars:» Determine the actions without any lookahead» Will help us understand shift-reduce parsing
- 89 -
Building LR(0) Parsing Tables
To build the parsing table:» Define states of the parser
» Build a DFA to describe transitions between states
» Use the DFA to build the parsing table
Each LR(0) state is a set of LR(0) items» An LR(0) item: X . where X is a
production in the grammar
» The LR(0) items keep track of the progress on all of the possible upcoming productions
» The item X . abstracts the fact that the parser already matched the string at the top of the stack
- 90 -
Example LR(0) State
An LR(0) item is a production from the language with a separator “.” somewhere in the RHS of the production
Sub-string before “.” is already on the stack (beginnings of possible ’s to be reduced)
Sub-string after “.”: what we might see next
E num .E ( . S)
stateitem
- 91 -
Class Problem
For the production,E num | (S)
Two items are:E num .E ( . S )
Are there any others? If so, what are they? If not, why?
- 92 -
LR(0) Grammar
Nested lists» S (L) | id
» L S | L,S
Examples» (a,b,c)
» ((a,b), (c,d), (e,f))
» (a, (b,c,d), ((f,g)))
S
( L )
L , S
L , S
( S )S
a L , S
S
b
c
d
Parse tree for(a, (b,c), d)
- 93 -
Start State and Closure
Start state» Augment grammar with production: S’ S $» Start state of DFA has empty stack: S’ . S $
Closure of a parser state:» Start with Closure(S) = S» Then for each item in S:
X . Y Add items for all the productions Y to the
closure of S: Y .
- 94 -
Closure Example
S (L) | idL S | L,S
DFA start state
S’ . S $closure
S’ . S $S . (L)S . id
- Set of possible productions to be reduced next- Added items have the “.” located at the beginning: no symbols for these items on the stack yet
- 95 -
The Goto Operation
Goto operation = describes transitions between parser states, which are sets of items
Algorithm: for state S and a symbol Y» If the item [X . Y ] is in I, then» Goto(I, Y) = Closure( [X Y . ] )
S’ . S $S . (L)S . id
Goto(S, ‘(‘) Closure( { S ( . L) } )
- 96 -
Class Problem
1. If I = { [E’ . E]}, then Closure(I) = ??
2. If I = { [E’ E . ], [E E . + T] }, then Goto(I,+) = ??
E’ EE E + T | TT T * F | FF (E) | id
- 97 -
Applying Reduce Actions
S’ . S $S . (L)S . id
S ( . L)L . SL . L, SS . (L)S . id
S id .
id
(
id (Grammar
S (L) | idL S | L,S
S (L . )L L . , S
L S .
L
S
states causing reductions(dot has reached the end!)
Pop RHS off stack, replace with LHS X (X ),then rerun DFA (e.g., (x))
- 98 -
Reductions
On reducing X with stack » Pop off stack, revealing prefix and state» Take single step in DFA from top state» Push X onto stack with new DFA state
Example
derivation stack input action((a),b) 1 ( 3 ( 3 a),b) shift, goto 2((a),b) 1 ( 3 ( 3 a 2 ),b) reduce S id((S),b) 1 ( 3 ( 3 S 7 ),b) reduce L S
- 99 -
Full DFA
S’ . S $S . (L)S . id
S ( . L)L . SL . L, SS . (L)S . id
S id .id
(
id
(
S (L . )LL L . , S
L S .
S
L L , . SS . (L)S . id
L L,S .
S (L) .
S’ S . $
final state
1 2 8 9
6
5
3
74
S
,
)
S
$
id
L
GrammarS (L) | idL S | L,S
- 100 -
Building the Parsing Table
States in the table = states in the DFA For transition S S’ on terminal C:
» Table[S,C] += Shift(S’) For transition S S’ on non-terminal N:
» Table[S,N] += Goto(S’) If S is a reduction state X then:
» Table[S,*] += Reduce(X )
- 101 -
Computed LR Parsing Table
( ) id , $ S L1 s3 s2 g42 Sid Sid Sid Sid Sid3 s3 s2 g7 g54 accept5 s6 s86 S(L) S(L) S(L) S(L) S(L)7 LS LS LS LS LS8 s3 s2 g99 LL,S LL,S LL,S LL,S LL,S
Sta
te
Input terminal Non-terminals
red = reduceblue = shift
- 102 -
LR(0) Summary
LR(0) parsing recipe:» Start with LR(0) grammar» Compute LR(0) states and build DFA:
Use the closure operation to compute states Use the goto operation to compute transitions
» Build the LR(0) parsing table from the DFA This can be done automatically
- 103 -
Class Problem
S E + S | EE num
Generate the DFA for the following grammar
- 104 -
LR(0) Limitations
An LR(0) machine only works if states with reduce actions have a single reduce action» Always reduce regardless of lookahead
With a more complex grammar, construction gives states with shift/reduce or reduce/reduce conflicts
Need to use lookahead to choose
L L , S .L L , S .S S . , L
L S , L .L S .
OK shift/reduce reduce/reduce
- 105 -
A Non-LR(0) Grammar
Grammar for addition of numbers» S S + E | E» E num
Left-associative version is LR(0) Right-associative is not LR(0) as you saw
with the previous class problem» S E + S | E» E num
- 106 -
LR(0) Parsing Table
S’ . S $S .E + SS . EE .num E num .
S E . +SS E .
E
num
+
S E + S .
S’ S $ .
S
S E + . SS . E + SS . EE . num
S’ S . $
1 2
5
3
7
4S
GrammarS E + S | EE num
$
E
num
num + $ E S1 s4 g2 g62 SE s3/SE SE
Shift orreducein state 2?
- 107 -
Solve Conflict With Lookahead
3 popular techniques for employing lookahead of 1 symbol with bottom-up parsing» SLR – Simple LR» LALR – LookAhead LR» LR(1)
Each as a different means of utilizing the lookahead» Results in different processing capabilities
- 108 -
SLR Parsing
SLR Parsing = Easy extension of LR(0)» For each reduction X , look at next symbol C
» Apply reduction only if C is in FOLLOW(X)
SLR parsing table eliminates some conflicts» Same as LR(0) table except reduction rows» Adds reductions X only in the columns of
symbols in FOLLOW(X)
num + $ E S1 s4 g2 g62 s3 SE
Example: FOLLOW(S) = {$}
GrammarS E + S | EE num
- 109 -
SLR Parsing Table
Reductions do not fill entire rows as before Otherwise, same as LR(0)
num + $ E S1 s4 g2 g62 s3 SE3 s4 g2 g54 Enum Enum5 SE+S6 s77 accept
GrammarS E + S | EE num
- 110 -
Class ProblemConsider:
S L = RS RL *RL identR L
Think of L as l-value, R as r-value, and* as a pointer dereference
When you create the states in the SLR(1) DFA,2 of the states are the following:
S L . = RR L . S R .
Do you have any shift/reduce conflicts? (Not as easy as it looks)
- 111 -
LR(1) Parsing Get as much as possible out of 1 lookahead
symbol parsing table LR(1) grammar = recognizable by a shift/reduce
parser with 1 lookahead LR(1) parsing uses similar concepts as LR(0)
» Parser states = set of items» LR(1) item = LR(0) item + lookahead symbol
possibly following production LR(0) item: S . S + E LR(1) item: S . S + E , + Lookahead only has impact upon REDUCE
operations, apply when lookahead = next input
- 112 -
LR(1) States
LR(1) state = set of LR(1) items LR(1) item = (X . , y)
» Meaning: already matched at top of the stack, next expect to see y
Shorthand notation» (X . , {x1, ..., xn})
» means: (X . , x1) . . . (X . , xn)
Need to extend closure and goto operations
S S . + E +,$S S + . E num
- 113 -
LR(1) Closure
LR(1) closure operation:» Start with Closure(S) = S
» For each item in S: X . Y , z
and for each production Y , add the following item to the closure of S: Y . , FIRST(z)
» Repeat until nothing changes
Similar to LR(0) closure, but also keeps track of lookahead symbol
- 114 -
LR(1) Start State
Initial state: start with (S’ . S , $), then apply closure operation
Example: sum grammar
S’ . S , $
S’ . S , $S . E + S , $S . E , $E . num , +,$
closure
S’ S $S E + S | EE num
- 115 -
LR(1) Goto Operation
LR(1) goto operation = describes transitions between LR(1) states
Algorithm: for a state S and a symbol Y (as before)» If the item [X . Y ] is in I, then
» Goto(I, Y) = Closure( [X Y . ] )
S E . + S , $S E . , $
Closure({S E + . S , $})
Goto(S1, ‘+’)S1 S2
Grammar:S’ S$S E + S | EE num
- 116 -
Class Problem
1. Compute: Closure(I = {S E + . S , $})2. Compute: Goto(I, num)3. Compute: Goto(I, E)
S’ S $S E + S | EE num
- 117 -
LR(1) DFA Construction
S’ . S , $S . E + S , $S . E , $E .num , +,$
E num . , +,$
S’ S . , $
E
num
+
S E+S. , +,$
S
S E + . S , $S . E + S , $S . E , $E . num , +,$
S E . + S , $S E . , $
S
GrammarS’ S$S E + S | EE numE
num
- 118 -
LR(1) Reductions
S’ . S , $S . E + S , $S . E , $E .num , +,$
E num . , +,$
S’ S . , $
E
num
+
S E . , +,$
S
S E + . S , $S . E + S , $S . E , $E . num , +,$
S E . + S , $S E . , $
S
GrammarS’ S$S E + S | EE numE
num
• Reductions correspond to LR(1) items of the form (X . , y)
- 119 -
LR(1) Parsing Table Construction
Same as construction of LR(0), except for reductions
For a transition S S’ on terminal x:» Table[S,x] += Shift(S’)
For a transition S S’ on non-terminal N:» Table[S,N] += Goto(S’)
If I contains {(X . , y)} then:» Table[I,y] += Reduce(X )
- 120 -
LR(1) Parsing Table Example
S’ . S , $S . E + S , $S . E , $E .num , +,$
E
+
S E + . S , $S . E + S , $S . E , $E . num , +,$
S E . + S , $S E . , $
GrammarS’ S$S E + S | EE num
1
2
3
+ $ E1 g22 s3 SE
Fragment of theparsing table
- 121 -
Class Problem
Compute the LR(1) DFA for the following grammar
E E + T | TT TF | FF F* | a | b
- 122 -
LALR(1) Grammars
Problem with LR(1): too many states LALR(1) parsing (aka LookAhead LR)
» Constructs LR(1) DFA and then merge any 2 LR(1) states whose items are identical except lookahead
» Results in smaller parser tables
» Theoretically less powerful than LR(1)
LALR(1) grammar = a grammar whose LALR(1) parsing table has no conflicts
S id . , +S E . , $
S id . , $S E . , ++ = ??
- 123 -
LALR Parsers
LALR(1)» Generally same number of states as SLR
(much less than LR(1))» But, with same lookahead capability of LR(1)
(much better than SLR)» Example: Pascal programming language
In SLR, several hundred states In LR(1), several thousand states
- 124 -
Automate the Parsing Process
Can automate:» The construction of LR parsing tables
» The construction of shift-reduce parsers based on these parsing tables
LALR(1) parser generators» yacc, bison
» Not much difference compared to LR(1) in practice
» Smaller parsing tables than LR(1)
» Augment LALR(1) grammar specification with declarations of precedence, associativity
» Output: LALR(1) parser program
- 125 -
Associativity
S S + E | EE num
E E + EE num
What happens if we run this grammar through LALR construction?
E E + EE num
E E + E . , +E E . + E , +,$
+
shift/reduceconflict
shift: 1+ (2+3)reduce: (1+2)+3
1 + 2 + 3
- 126 -
Associativity (2)
If an operator is left associative» Assign a slightly higher value to its precedence if it is
on the parse stack than if it is in the input stream
» Since stack precedence is higher, reduce will take priority (which is correct for left associative)
If operator is right associative» Assign a slightly higher value if it is in the input
stream
» Since input stream is higher, shift will take priority (which is correct for right associative)
- 127 -
Precedence
E E + E | TT T x T | num | (E) E E + E | E x E | num | (E)
Shift/reduceconflict results
What happens if we run this grammar through LALR construction?
E E . + E , ...E E x E . , +
E E + E . , xE E . x E, ...
Precedence: attach precedence indicators to terminalsShift/reduce conflict resolved by:
1. If precedence of the input token is greater than the last terminal on parse stack, favor shift over reduce2. If the precedence of the input token is less than or equal to the last terminal on the parse stack, favor reduce over shift
- 128 -
Abstract Syntax Tree (AST) - Review Derivation = sequence of
applied productions» S E+S 1+S 1+E
1+2
Parse tree = graph representation of a derivation» Doesn’t capture the order
of applying the productions
AST discards unnecessary information from the parse tree
++ 5
1 +
2 +
3 4
S
E + S
( S ) E
E + S 5
E + S1
2 E
( S )
E + S
E3 4
- 129 -
Implicit AST Construction
LL/LR parsing techniques implicitly build AST
The parse tree is captured in the derivation» LL parsing: AST represented by applied
productions» LR parsing: AST represented by applied
reductions We want to explicitly construct the AST
during the parsing phase
- 130 -
AST Construction - LL
void parse_S() { switch (token) { case num: case ‘(‘: parse_E(); parse_S’(); return; default: ParseError(); }}
Expr parse_S() { switch (token) { case num: case ‘(‘: Expr left = parse_E(); Expr right = parse_S’(); if (right == NULL) return left else return new Add(left,right); default: ParseError(); }}
LL parsing: extend proceduresfor non-terminals
S ES’S’ | +SE num | (S)
- 131 -
AST Construction - LR
We again need to add code for explicit AST construction
AST construction mechanism» Store parts of the tree on the stack» For each nonterminal symbol X on stack, also
store the sub-tree rooted at X on stack» Whenever the parser performs a reduce
operation for a production X , create an AST node for X
- 132 -
AST Construction for LR - Example
S E + S | SE num | (S)
.
.
.
.
.
.
S
+
E
.
.
Add
Num(1) Num(2)
stac
k
Before reduction: S E + S
Num(3) ...
.
.
.
S .Add
Num(1)
Num(2) Num(3)
Add
After reduction: S E + S
input string: “1 + 2 + 3”
- 133 -
Problems
Unstructured code: mixing parsing code with AST construction code
Automatic parser generators» The generated parser needs to contain AST
construction code
» How to construct a customized AST data structure using an automatic parser generator?
May want to perform other actions concurrently with parsing phase» E.g., semantic checks
» This can reduce the number of compiler passes
- 134 -
Syntax-Directed Definition
Solution: Syntax-directed definition» Extends each grammar production with an
associated semantic action (code): S E + S {action}
» The parser generator adds these actions into the generated parser
» Each action is executed when the corresponding production is reduced
- 135 -
Semantic Actions
Actions = C code (for bison/yacc) The actions access the parser stack
» Parser generators extend the stack of symbols with entries for user-defined structures (e.g., parse trees)
The action code should be able to refer to the grammar symbols in the productions» Need to refer to multiple occurrences of the same non-
terminal symbol, distinguish RHS vs LHS occurrence E E + E
» Use dollar variables in yacc/bison ($$, $1, $2, etc.) expr ::= expr PLUS expr {$$ = $1 + $3;}
- 136 -
Building the AST
Use semantic actions to build the AST AST is built bottom-up along with parsing
expr ::= NUM {$$ = new Num($1.val); }expr ::= expr PLUS expr {$$ = new Add($1, $3); }expr ::= expr MULT expr {$$ = new Mul($1, $3); }expr ::= LPAR expr RPAR {$$ = $2; }
Recall: User-defined type forobjects on the stack (%union)
- 137 -
Outline
DFA & NFA Regular expression Regular languages Context free languages &PDA Scanner Parser
» Top-down parsing» Bottom-up Parsing» Comparison
- 138 -
LL/LR Grammar Summary
LL parsing tables» Non-terminals x terminals productions
» Computed using FIRST/FOLLOW
LR parsing tables» LR states x terminals {shift/reduce}
» LR states x non-terminals goto
» Computed using closure/goto operations on LR states
A grammar is:» LL(1) if its LL(1) parsing table has no conflicts
» same for LR(0), SLR, LALR(1), LR(1)
- 139 -
Top-Down Parsing
S S+E E+E (S)+E (S+E)+E (S+E+E)+E (E+E+E)+E (1+E+E)+E (1+2+E)+E ...
S S + E | EE num | (S)
In left-most derivation, entiretree above token (2) has beenexpanded when encountered
S
S + E
( S )
S + E
5E
S + E
2E
1
( S )
S + E
4E
3
- 140 -
Top-Down vs Bottom-Up
scanned unscanned scanned unscanned
Top-down Bottom-up
Bottom-up: Don’t need to figure out as much of he parse treefor a given amount of input More time to decide what rulesto apply
- 141 -
Terminology: LL vs LR LL(k)
» Left-to-right scan of input» Left-most derivation» k symbol lookahead» [Top-down or predictive] parsing or LL parser» Performs pre-order traversal of parse tree
LR(k)» Left-to-right scan of input» Right-most derivation» k symbol lookahead» [Bottom-up or shift-reduce] parsing or LR parser» Performs post-order traversal of parse tree
- 142 -
Classification of Grammars
LR(0)
SLR
LALR(1)
LR(1)
LL(1)
LR(k) LR(k+1)LL(k) LL(k+0)
LL(k) LR(k)LR(0) SLRLALR(1) LR(1)
not to scale
- 143 -
Bottom-Up Parsing
(1+2+(3+4))+5 (E+2+(3+4))+5 (S+2+(3+4))+5 (S+E+(3+4))+5
S S + E | EE num | (S)
Advantage of bottom-up parsing:can postpone the selection ofproductions until more of theinput is scanned
S
S + E
( S )
S + E
5E
S + E
2E
1
( S )
S + E
4E
3