comp3190: principle of programming languages formal language syntax

COMP3190: Principle of Programming Languages

Formal Language Syntax

- 2 -

Motivation

The problem of parsing structured text is very commonConsider the structure of email addresses (using a grammar):<emailAddress> := <person> @ <host><person> := <word><host> := <word> | <word>.<host>Describe and recognize email addresses in arbitrary text.

- 3 -

Outline

DFA & NFA Regular expression Regular languages Context free languages &PDA Scanner Parser

- 4 -

Deterministic Finite Automata (DFA)

Q: finite set of states Σ: finite set of “letters” (alphabet) δ: QxΣ -> Q (transition function) q0: start state (in Q)

F : set of accept states (subset of Q) Acceptance: input consumed with the automata

in a final state.

- 5 -

Example of DFA

q1 q2

1

0

0 1

δ 0 1

q1 q1 q2

q2 q1 q2

Accepts all strings that end in 1

- 6 -

Another Example of a DFA

S

q1

q2

r1

r2

a b

a

ab

b

b

a b

a

Accepts all strings that start and end with “a” OR start and end with “b”

- 7 -

Non-deterministic Finite Automata (NFA)

Transition function is different δ: QxΣε -> P(Q)

P(Q) is the powerset of Q (set of all subsets) Σε is the union of Σ and the special symbol ε

(denoting empty)String is accepted if there is at least one path leading to an accept state, and input consumed.

- 8 -

Example of an NFA

q1 q2 q3 q4

0, 11 0, ε 1

0, 1

δ 0 1 ε

q1 {q1} {q1, q2}

q2 {q3} {q3}

q3 {q4}

q4 {q4} {q4}

What strings does this NFA accept?

- 9 -

Outline


- 10 -

Regular Expressions

R is a regular expression if R is “a” for some a in Σ. ε (the empty string). member of the empty language. the union of two regular expressions. the concatenation of two regular expr. R1

* (Kleene closure: zero or more repetitions of R1).

- 11 -

Regular Expression Notation a: an ordinary letter ε: the empty string M | N: choosing from M or N MN: concatenation of M and N M*: zero or more times (Kleene star) M+: one or more times M?: zero or one occurence [a-zA-Z] character set alternation (choice) . period stands for any single char exc. newline

- 12 -

Examples of Regular Expressions

{0, 1}* 0 all strings that end in 0{0, 1} 0* string that start with 1 or 0 followed by zero or more 0s.{0, 1}* all strings{0n1n, n >=0} not a regular expression!!!

- 13 -

Converting a Regular Expression to an NFA

εε

ε

ε

εM

N

M

M N

ε

a

M|N

MN

M*

- 14 -

Regular expression->NFA

Language: Strings of 0s and 1s in which the number of 0s is even

Regular expression: (1*01*0)*1*

- 15 -

Converting an NFA to a DFA

For set of states S, closure(S) is the set of states that can be reached from S without consuming any input.

For a set of states S, DFAedge(s, c) is the set of states that can be reached from S by consuming input symbol c.

Each set of NFA states corresponds to one DFA state (hence at most 2n states).

- 16 -

NFA -> DFA

Initial classes:{A, B, E}, {C, D}

No class requires partitioning!

Hence a two-stateDFA is obtained.

- 17 -

Obtaining the minimal equivalent DFA

Initially two equivalence classes: final and nonfinal states.

Search for an equivalence class C and an input letter a such that with a as input, the states in C make transitions to states in k>1 different equivalence classes.

Partition C into k classes accordingly Repeat until unable to find a class to partition.

- 18 -

Example (cont.)

- 19 -

Outline


- 20 -

Regular Grammar

Later definitions build on earlier ones Nothing defined in terms of itself (no

recursion)

Regular grammar for numeric literals in Pascal:digit -> 0|1|2|...|8|9unsigned_integer -> digit digit*unsigned_number -> unsigned_integer (( . unsigned_integer) | ε ) (( e (+ | - | ε ) unsigned_integer ) | ε )

- 21 -

Languages and Automata in Programming Languages

Regular languages» Recognized(accepted) by finite automata» Useful for tokenizing program text

(lexical analysis) Context-free languages

» Recognized(accepted) by pushdown automata» Useful for parsing the syntax of a program

- 22 -

Important Theorems

A language is regular if a regular expression describes it.

A language is regular if a finite automata recognizes it.

DFAs and NFAs are equally powerful.

- 23 -

Outline


- 24 -

Context-free Grammars

Context-free grammars are defined by substitution rules

Big Jim ate gree cheesegreen Jim ate green cheeseJim ate cheeseCheese ate Jim

P -> NP -> APS -> PVP

A -> big|greenN -> cheese|JimV -> ate

- 25 -


Context-free grammars are used to formally describe the syntax of programming languages.

Every syntactically correct program is derived using the context-free grammar of the language.

Parsing a program involves tracing such derivation, given the context-free grammar and the program.

- 26 -


A context-free grammar consists of V: a finite set of variables Σ: a finite set of terminals R: a finite set of rules of the form

variable -> {variable, terminal}* S: the start variable

- 27 -

Pushdown Automata (PDA)

A pushdown automata consists of Q: a set of states Σ: input alphabet (of terminals) Γ: stack alphabet δ: a set of transition rules

Q x Σε x Γε -> P(Q x Γε)currentState, inputSymbol, headOfStack ->newState, pushSymbolOnStack

q0: the start state F: the set of accept states (subset of Q)

Deterministic: At most one move is possible from any configuration

- 28 -

How does a PDA accept?

By final state: » Consume all the input while» Reaching a final state

By empty stack:» Consume all the input while» Having an empty stack» Set of final states is irrelevant

- 29 -

Example of a PDA

q1 q2

q3q4

ε, ε ->$ 0, ε->0

1, 0->ε

1, 0->εε, $->ε

Notation: a, b->c: when PDA reads “a” from input, it replaces “b” at the top of stack with “c”.

What does this PDA accept?

- 30 -

Important Theorems

A language is context-free iff a pushdown automata recognizes it

Non-deterministic PDA are more powerful than deterministic ones

- 31 -

Example of Context-free Language That Requires a Non-deterministic PDA

{w wR | w belongs to {0, 1}*}

i.e. wR is w written backwards

Idea:

Non-deterministically guess the middle of the input string

- 32 -

The Solution

q1 q2

q3q4

ε, ε ->$ 0, ε->01, ε->1

ε, ε->ε

1, 1->ε0, 0->ε

ε, $->ε

- 34 -

Parse Tree for Slope*x + Intercept

Is this the only parse tree for this expression and grammar?

- 35 -

A Better Expression Grammar

1. expression -> term | expression add_op term

2. term -> factor | term mult_op factor

3. factor -> identifier | number | - factor | (expression)

4. add_op -> + | -

5. mult_op -> * | /

A good grammar reflects the internal structure of programs.

This grammar is unambiguous and captures (HOW?):- operator precedence (*,/ bind tighter than +,- )- associativity (ops group left to right)

- 36 -

And Better Parse Trees...

3 + 4 * 5

10 - 4 - 3

- 37 -

Syntax-directed Compilation

Parser calls scanner to obtain tokens. Assembles tokens into parse tree. Passes tree to later phases of compilation. Scanner: deterministic finite automata. Parser: pushdown automata. Scanners and parsers can be generated

automatically from regular expressions and CFGs (e.G. lex/yacc).

- 38 -

Outline


- 39 -

Scanning

Accept the longest possible token in each invocation of the scanner.

Implementation.» Capture finite automata.

Case(switch) statements. Table and driver.

- 40 -

Scanner for Pascal

- 41 -

Scanner for Pascal(case Statements)

- 42 -

Scanner (Table&driver)

- 43 -

Scanner Generators

Start with a regular expression. Construct an NFA from it. Use a set of subsets construction to obtain an

equivalent DFA. Construct the minimal equivalent DFA.

- 44 -

Outline


» Top-down parsing» Bottom-up Parsing» Comparison

- 45 -

Parsing approaches Parsing in general has O(n3) cost. Need classes of grammars that can be parsed in

linear time» Top-down or

predictive parsing orrecursive descent parsingor LL parsing (Left-to-right Left-most)

» Bottom-up or shift-reduce parsing orLR parsing (Left-to-right Right-most)

- 46 -

A Simple Grammar for a Comma-separated List of Identifiers

id_list -> id id_list_tail

id_list_tail -> , id id_list_tail

id_list_tail -> ;_________________________

String to be parsed: A, B, C;

- 47 -

Top-down/bottom-up Parsing

- 48 -

Outline



- 49 -

Top-down Parsing

Predicts a derivation Matches non-terminal against token observed in

input

- 50 -

LL(1) Grammar

A grammar for which a top-down deterministic parser can be produced with one token of look-ahead.

LL(1) grammar:» For a given non-terminal, the lookahead symbol

uniquely determines the production to apply

» Top-down parsing = predictive parsing

» Driven by predictive parsing table of non-terminals x terminals productions

- 51 -

From Last Time: Parsing with Table

Partly-derived String Lookahead parsed part unparsed partES’ ( (1+2+(3+4))+5(S)S’ 1 (1+2+(3+4))+5(ES’)S’ 1 (1+2+(3+4))+5(1S’)S’ + (1+2+(3+4))+5(1+ES’)S’ 2 (1+2+(3+4))+5(1+2S’)S’ + (1+2+(3+4))+5

S ES’ S’ | +S E num | (S)

num + ( ) $S ES’ ES’S’ +S E num (S)

- 52 -

How to Construct Parsing Tables?

Needed: Algorithm for automatically generatinga predictive parse table from a grammar

S ES’S’ | +SE number | (S)


??

- 53 -

Constructing Parse Tables Can construct predictive parser if:

» For every non-terminal, every lookahead symbol can be handled by at most 1 production

FIRST() for an arbitrary string of terminals and non-terminals is:» Set of symbols that might begin the fully expanded

version of FOLLOW(X) for a non-terminal X is:

» Set of symbols that might follow the derivation of X in the input stream

FIRST FOLLOW

X

- 54 -

Parse Table Entries

Consider a production X Add to the X row for each symbol in

FIRST() If can derive ( is nullable), add

for each symbol in FOLLOW(X) Grammar is LL(1) if no conflicting entries



- 55 -

Computing Nullable

X is nullable if it can derive the empty string:» If it derives directly (X )

» If it has a production X YZ ... where all RHS symbols (Y,Z) are nullable

Algorithm: assume all non-terminals are non-nullable, apply rules repeatedly until no change


Only S’ is nullable

- 56 -

Computing FIRST Determining FIRST(X)

1. if X is a terminal, then add X to FIRST(X)

2. if X then add to FIRST(X)

3. if X is a nonterminal and X Y1Y2...Yk then a is in FIRST(X) if a is in FIRST(Yi) and is in FIRST(Yj) for j = 1...i-1 (i.e., its possible to have an empty prefix Y1 ... Yi-1

4. if is in FIRST(Y1Y2...Yk) then is in FIRST(X)

- 57 -

FIRST Example


Apply rule 1: FIRST(num) = {num}, FIRST(+) = {+}, etc.Apply rule 2: FIRST(S’) = {}Apply rule 3: FIRST(S) = FIRST(E) = {}

FIRST(S’) = FIRST(‘+’) + {} = { , + }FIRST(E) = FIRST(num) + FIRST(‘(‘) = {num, ( }

Rule 3 again: FIRST(S) = FIRST(E) = {num, ( }FIRST(S’) = {, + }FIRST(E) = {num, ( }

- 58 -

Computing FOLLOW

Determining FOLLOW(X)1. if S is the start symbol then $ is in FOLLOW(S)

2. if A B then add all FIRST() != to FOLLOW(B)

3. if A B or B and is in FIRST() then add FOLLOW(A) to FOLLOW(B)

- 59 -

FOLLOW Example


FIRST(S) = {num, ( }FIRST(S’) = {, + }FIRST(E) = { num, ( }

Apply rule 1: FOL(S) = {$}Apply rule 2: S ES’ FOL(E) += {FIRST(S’) - } = {+}

S’ | +S -E num | (S) FOL(S) += {FIRST(‘)’) - } = {$,) }

Apply rule 3: S ES’ FOL(E) += FOL(S) = {+,$,)}(because S’ is nullable)

FOL(S’) += FOL(S) = {$,)}

- 60 -

Putting it all TogetherFOLLOW(S) = { $, ) }FOLLOW(S’) = { $, ) }FOLLOW(E) = { +, ), $ }

FIRST(S) = {num, ( }FIRST(S’) = {, + }FIRST(E) = { num, ( }

Consider a production X

Add to the X row for each symbol in FIRST()

If can derive ( is nullable), add for each symbol in FOLLOW(X)



- 61 -

Ambiguous Grammars

Construction of predictive parse table for ambiguousgrammar results in conflicts in the table (ie 2 or moreproductions to apply in same cell)

S S + S | S * S | num

FIRST(S+S) = FIRST(S*S) = FIRST(num) = { num }

- 62 -

Class Problem

E E + T | TT T * F | FF (E) | num |

1. Compute FIRST and FOLLOW sets for this G2. Compute parse table entries

- 63 -

Top-Down Parsing Up to This Point

Now we know» How to build parsing table for an LL(1)

grammar (ie FIRST/FOLLOW)» How to construct recursive-descent parser

from parsing table» Call tree = parse tree

Open question – Can we generate the AST?

- 64 -

Creating the Abstract Syntax Tree Some class definitions to assist

with AST construction class Expr {} class Add extends Expr {

» Expr left, right;

» Add(Expr L, Expr R) { left = L; right = R;

» }

} class Num extends Expr {

» int value;

» Num(int v) {value = v;}

}

Expr

Num Add

Class Hierarchy

- 65 -

Creating the AST

++ 5

1 +

2 +

3 4

(1 + 2 + (3 + 4)) + 5S

E + S

( S ) E

E + S 5

E + S1

2 E

( S )

E + S

E3 4

• We got the parse treefrom the call tree

• Just add code to eachparsing routine to createthe appropriate nodes

• Works because parse treeand call tree are the sameshape, and AST is just acompressed form of theparse tree

- 66 -

AST Creation: parse_E

Expr parse_E() {» switch (token) {

case num: // E number Expr result = Num(token.value); token = input.read(); return result;

case ‘(‘: // E (S) token = input.read(); Expr result = parse_S(); if (token != ‘)’) ParseError(); token = input.read(); return result;

default: ParseError();

» }

}

Remember, this is lookahead token


- 67 -

AST Creation: parse_S

Expr parse_S() {» switch (token) {

case num: case ‘(‘: // S ES’

Expr left = parse_E(); Expr right = parse_S’(); if (right == NULL) return left; else return new Add(left,right);

default: ParseError();

» }

}


- 68 -

Grammars Have been using grammar for language “sums

with parentheses” (1+2+(3+4))+5 Started with simple, right-associative grammar

» S E + S | E» E num | (S)

Transformed it to an LL(1) by left factoring:» S ES’» S’ | +S» E num (S)

What if we start with a left-associative grammar?» S S + E | E» E num | (S)

- 69 -

Reminder: Left vs Right Associativity

+

1 +

2 +

3 4

S E + SS EE num

S S + ES EE num

+

1

+

2

+ 34

Right recursion : right associative

Left recursion : left associative

Consider a simpler string on a simpler grammar: “1 + 2 + 3 + 4”

- 70 -

Left Recursion

derived string lookahead read/unreadS 1 1+2+3+4S+E 1 1+2+3+4S+E+E 1 1+2+3+4S+E+E+E 1 1+2+3+4E+E+E+E 1 1+2+3+41+E+E+E 2 1+2+3+41+2+E+E 3 1+2+3+41+2+3+E 4 1+2+3+41+2+3+4 $ 1+2+3+4

Is this right? If not, what’s the problem?

S S + ES EE num

“1 + 2 + 3 + 4”

- 71 -

Left-Recursive Grammars

Left-recursive grammars don’t work with top-down parsers: we don’t know when to stop the recursion

Left-recursive grammars are NOT LL(1)!» S S» S

In parse table» Both productions will appear in the predictive

table at row S in all the columns corresponding to FIRST()

- 72 -

Eliminate Left Recursion

Replace» X X1 | ... | Xm» X 1 | ... | n

With» X 1X’ | ... | nX’» X’ 1X’ | ... | mX’ |

See complete algorithm in Dragon book

- 73 -

Class Problem

E E + T | TT T * F | FF (E) | num

Transform the following grammar to eliminate left recursion:

- 74 -

Creating an LL(1) Grammar

Start with a left-recursive grammar S S + E S E

» and apply left-recursion elimination algorithm S ES’ S’ +ES’ |

Start with a right-recursive grammar S E + S S E

» and apply left-factoring to eliminate common prefixes S ES’ S’ +S |

- 75 -

Top-Down Parsing Summary

Language grammarLeft-recursion elimination

Left factoring

LL(1) grammar

predictive parsing tableFIRST, FOLLOW

recursive-descent parser

parser with AST gen

- 76 -

Outline



- 77 -

New Topic: Bottom-Up Parsing

A more power parsing technology LR grammars – more expressive than LL

» Construct right-most derivation of program» Left-recursive grammars, virtually all

programming languages are left-recursive» Easier to express syntax

Shift-reduce parsers» Parsers for LR grammars» Automatic parser generators (yacc, bison)

- 78 -

Bottom-Up Parsing (2)

Right-most derivation – Backward» Start with the tokens» End with the start symbol» Match substring on RHS of production,

replace by LHS

S S + E | EE num | (S)

(1+2+(3+4))+5 (E+2+(3+4))+5 (S+2+(3+4))+5 (S+E+(3+4))+5 (S+(3+4))+5 (S+(E+4))+5 (S+(S+4))+5 (S+(S+E))+5 (S+(S))+5 (S+E)+5 (S)+5 E+5 S+E S

- 79 -

Shift-Reduce Parsing

Parsing actions: A sequence of shift and reduce operations

Parser state: A stack of terminals and non-terminals (grows to the right)

Current derivation step = stack + input

Derivation step stack Unconsumed input(1+2+(3+4))+5 (1+2+(3+4))+5(E+2+(3+4))+5 (E +2+(3+4))+5(S+2+(3+4))+5 (S +2+(3+4))+5(S+E+(3+4))+5 (S+E +(3+4))+5...

- 80 -

Shift-Reduce Actions

Parsing is a sequence of shifts and reduces Shift: move look-ahead token to stack

Reduce: Replace symbols from top of stack with non-terminal symbol X corresponding to the production: X (e.g., pop , push X)

stack input action( 1+2+(3+4))+5 shift 1(1 +2+(3+4))+5

stack input action(S+E +(3+4))+5 reduce S S+ E(S +(3+4))+5

- 81 -

Shift-Reduce Parsing

derivation stack input stream action(1+2+(3+4))+5 (1+2+(3+4))+5 shift(1+2+(3+4))+5 ( 1+2+(3+4))+5 shift(1+2+(3+4))+5 (1 +2+(3+4))+5 reduce E num(E+2+(3+4))+5 (E +2+(3+4))+5 reduce S E(S+2+(3+4))+5 (S +2+(3+4))+5 shift(S+2+(3+4))+5 (S+ 2+(3+4))+5 shift(S+2+(3+4))+5 (S+2 +(3+4))+5 reduce E num(S+E+(3+4))+5 (S+E +(3+4))+5 reduce S S+E(S+(3+4))+5 (S +(3+4))+5 shift(S+(3+4))+5 (S+ (3+4))+5 shift(S+(3+4))+5 (S+( 3+4))+5 shift(S+(3+4))+5 (S+(3 +4))+5 reduce E num

...


- 82 -

Potential Problems

How do we know which action to take: whether to shift or reduce, and which production to apply

Issues» Sometimes can reduce but should not» Sometimes can reduce in different ways

- 83 -

Action Selection Problem

Given stack and look-ahead symbol b, should parser:» Shift b onto the stack making it b ?» Reduce X assuming that the stack has the

form = making it X ? If stack has the form , should apply

reduction X (or shift) depending on stack prefix ? is different for different possible reductions

since ’s have different lengths

- 84 -

LR Parsing Engine

Basic mechanism» Use a set of parser states» Use stack with alternating symbols and states

E.g., 1 ( 6 S 10 + 5 (blue = state numbers)

» Use parsing table to: Determine what action to apply (shift/reduce) Determine next state

The parser actions can be precisely determined from the table

- 85 -

LR Parsing Table

Algorithm: look at entry for current state S and input terminal C» If Table[S,C] = s(S’) then shift:

push(C), push(S’)

» If Table[S,C] = X then reduce: pop(2*||), S’= top(), push(X), push(Table[S’,X])

Next actionand next state

Next state

Terminals Non-terminals

State

Action table Goto table

- 86 -

LR Parsing Table Example

( ) id , $ S L1 s3 s2 g42 Sid Sid Sid Sid Sid3 s3 s2 g7 g54 accept5 s6 s86 S(L) S(L) S(L) S(L) S(L)7 LS LS LS LS LS8 s3 s2 g99 LL,S LL,S LL,S LL,S LL,S

Sta

te

Input terminal Non-terminals

We want to derive this in an algorithmic fashion

- 87 -

Parsing Example ((a),b)

derivation stack input action((a),b) 1 ((a),b) shift, goto 3((a),b) 1(3 (a),b) shift, goto 3((a),b) 1(3(3 a),b) shift, goto 2((a),b) 1(3(3a2 ),b) reduce Sid((S),b) 1(3(3(S7 ),b) reduce LS((L),b) 1(3(3(L5 ),b) shift, goto 6((L),b) 1(3(3L5)6 ,b) reduce S(L)(S,b) 1(3S7 ,b) reduce LS(L,b) 1(3L5 ,b) shift, goto 8(L,b) 1(3L5,8 b) shift, goto 9(L,b) 1(3L5,8b2 ) reduce Sid(L,S) 1(3L8,S9 ) reduce LL,S(L) 1(3L5 ) shift, goto 6(L) 1(3L5)6 reduce S(L)S 1S4 $ done

S (L) | idL S | L,S

- 88 -

LR(k) Grammars

LR(k) = Left-to-right scanning, right-most derivation, k lookahead chars

Main cases» LR(0), LR(1)» Some variations SLR and LALR(1)

Parsers for LR(0) Grammars:» Determine the actions without any lookahead» Will help us understand shift-reduce parsing

- 89 -

Building LR(0) Parsing Tables

To build the parsing table:» Define states of the parser

» Build a DFA to describe transitions between states

» Use the DFA to build the parsing table

Each LR(0) state is a set of LR(0) items» An LR(0) item: X . where X is a

production in the grammar

» The LR(0) items keep track of the progress on all of the possible upcoming productions

» The item X . abstracts the fact that the parser already matched the string at the top of the stack

- 90 -

Example LR(0) State

An LR(0) item is a production from the language with a separator “.” somewhere in the RHS of the production

Sub-string before “.” is already on the stack (beginnings of possible ’s to be reduced)

Sub-string after “.”: what we might see next

E num .E ( . S)

stateitem

- 91 -

Class Problem

For the production,E num | (S)

Two items are:E num .E ( . S )

Are there any others? If so, what are they? If not, why?

- 92 -

LR(0) Grammar

Nested lists» S (L) | id

» L S | L,S

Examples» (a,b,c)

» ((a,b), (c,d), (e,f))

» (a, (b,c,d), ((f,g)))

S

( L )

L , S

L , S

( S )S

a L , S

S

b

c

d

Parse tree for(a, (b,c), d)

- 93 -

Start State and Closure

Start state» Augment grammar with production: S’ S $» Start state of DFA has empty stack: S’ . S $

Closure of a parser state:» Start with Closure(S) = S» Then for each item in S:

X . Y Add items for all the productions Y to the

closure of S: Y .

- 94 -

Closure Example

S (L) | idL S | L,S

DFA start state

S’ . S $closure

S’ . S $S . (L)S . id

- Set of possible productions to be reduced next- Added items have the “.” located at the beginning: no symbols for these items on the stack yet

- 95 -

The Goto Operation

Goto operation = describes transitions between parser states, which are sets of items

Algorithm: for state S and a symbol Y» If the item [X . Y ] is in I, then» Goto(I, Y) = Closure( [X Y . ] )

S’ . S $S . (L)S . id

Goto(S, ‘(‘) Closure( { S ( . L) } )

- 96 -

Class Problem

1. If I = { [E’ . E]}, then Closure(I) = ??

2. If I = { [E’ E . ], [E E . + T] }, then Goto(I,+) = ??

E’ EE E + T | TT T * F | FF (E) | id

- 97 -

Applying Reduce Actions

S’ . S $S . (L)S . id

S ( . L)L . SL . L, SS . (L)S . id

S id .

id

(

id (Grammar

S (L) | idL S | L,S

S (L . )L L . , S

L S .

L

S

states causing reductions(dot has reached the end!)

Pop RHS off stack, replace with LHS X (X ),then rerun DFA (e.g., (x))

- 98 -

Reductions

On reducing X with stack » Pop off stack, revealing prefix and state» Take single step in DFA from top state» Push X onto stack with new DFA state

Example

derivation stack input action((a),b) 1 ( 3 ( 3 a),b) shift, goto 2((a),b) 1 ( 3 ( 3 a 2 ),b) reduce S id((S),b) 1 ( 3 ( 3 S 7 ),b) reduce L S

- 99 -

Full DFA

S’ . S $S . (L)S . id

S ( . L)L . SL . L, SS . (L)S . id

S id .id

(

id

(

S (L . )LL L . , S

L S .

S

L L , . SS . (L)S . id

L L,S .

S (L) .

S’ S . $

final state

1 2 8 9

6

5

3

74

S

,

)

S

$

id

L

GrammarS (L) | idL S | L,S

- 100 -

Building the Parsing Table

States in the table = states in the DFA For transition S S’ on terminal C:

» Table[S,C] += Shift(S’) For transition S S’ on non-terminal N:

» Table[S,N] += Goto(S’) If S is a reduction state X then:

» Table[S,*] += Reduce(X )

- 101 -

Computed LR Parsing Table

( ) id , $ S L1 s3 s2 g42 Sid Sid Sid Sid Sid3 s3 s2 g7 g54 accept5 s6 s86 S(L) S(L) S(L) S(L) S(L)7 LS LS LS LS LS8 s3 s2 g99 LL,S LL,S LL,S LL,S LL,S

Sta

te

Input terminal Non-terminals

red = reduceblue = shift

- 102 -

LR(0) Summary

LR(0) parsing recipe:» Start with LR(0) grammar» Compute LR(0) states and build DFA:

Use the closure operation to compute states Use the goto operation to compute transitions

» Build the LR(0) parsing table from the DFA This can be done automatically

- 103 -

Class Problem

S E + S | EE num

Generate the DFA for the following grammar

- 104 -

LR(0) Limitations

An LR(0) machine only works if states with reduce actions have a single reduce action» Always reduce regardless of lookahead

With a more complex grammar, construction gives states with shift/reduce or reduce/reduce conflicts

Need to use lookahead to choose

L L , S .L L , S .S S . , L

L S , L .L S .

OK shift/reduce reduce/reduce

- 105 -

A Non-LR(0) Grammar

Grammar for addition of numbers» S S + E | E» E num

Left-associative version is LR(0) Right-associative is not LR(0) as you saw

with the previous class problem» S E + S | E» E num

- 106 -

LR(0) Parsing Table

S’ . S $S .E + SS . EE .num E num .

S E . +SS E .

E

num

+

S E + S .

S’ S $ .

S

S E + . SS . E + SS . EE . num

S’ S . $

1 2

5

3

7

4S

GrammarS E + S | EE num

$

E

num

num + $ E S1 s4 g2 g62 SE s3/SE SE

Shift orreducein state 2?

- 107 -

Solve Conflict With Lookahead

3 popular techniques for employing lookahead of 1 symbol with bottom-up parsing» SLR – Simple LR» LALR – LookAhead LR» LR(1)

Each as a different means of utilizing the lookahead» Results in different processing capabilities

- 108 -

SLR Parsing

SLR Parsing = Easy extension of LR(0)» For each reduction X , look at next symbol C

» Apply reduction only if C is in FOLLOW(X)

SLR parsing table eliminates some conflicts» Same as LR(0) table except reduction rows» Adds reductions X only in the columns of

symbols in FOLLOW(X)

num + $ E S1 s4 g2 g62 s3 SE

Example: FOLLOW(S) = {$}


- 109 -

SLR Parsing Table

Reductions do not fill entire rows as before Otherwise, same as LR(0)

num + $ E S1 s4 g2 g62 s3 SE3 s4 g2 g54 Enum Enum5 SE+S6 s77 accept


- 110 -

Class ProblemConsider:

S L = RS RL *RL identR L

Think of L as l-value, R as r-value, and* as a pointer dereference

When you create the states in the SLR(1) DFA,2 of the states are the following:

S L . = RR L . S R .

Do you have any shift/reduce conflicts? (Not as easy as it looks)

- 111 -

LR(1) Parsing Get as much as possible out of 1 lookahead

symbol parsing table LR(1) grammar = recognizable by a shift/reduce

parser with 1 lookahead LR(1) parsing uses similar concepts as LR(0)

» Parser states = set of items» LR(1) item = LR(0) item + lookahead symbol

possibly following production LR(0) item: S . S + E LR(1) item: S . S + E , + Lookahead only has impact upon REDUCE

operations, apply when lookahead = next input

- 112 -

LR(1) States

LR(1) state = set of LR(1) items LR(1) item = (X . , y)

» Meaning: already matched at top of the stack, next expect to see y

Shorthand notation» (X . , {x1, ..., xn})

» means: (X . , x1) . . . (X . , xn)

Need to extend closure and goto operations

S S . + E +,$S S + . E num

- 113 -

LR(1) Closure

LR(1) closure operation:» Start with Closure(S) = S

» For each item in S: X . Y , z

and for each production Y , add the following item to the closure of S: Y . , FIRST(z)

» Repeat until nothing changes

Similar to LR(0) closure, but also keeps track of lookahead symbol

- 114 -

LR(1) Start State

Initial state: start with (S’ . S , $), then apply closure operation

Example: sum grammar

S’ . S , $

S’ . S , $S . E + S , $S . E , $E . num , +,$

closure

S’ S $S E + S | EE num

- 115 -

LR(1) Goto Operation

LR(1) goto operation = describes transitions between LR(1) states

Algorithm: for a state S and a symbol Y (as before)» If the item [X . Y ] is in I, then

» Goto(I, Y) = Closure( [X Y . ] )

S E . + S , $S E . , $

Closure({S E + . S , $})

Goto(S1, ‘+’)S1 S2

Grammar:S’ S$S E + S | EE num

- 116 -

Class Problem

1. Compute: Closure(I = {S E + . S , $})2. Compute: Goto(I, num)3. Compute: Goto(I, E)

S’ S $S E + S | EE num

- 117 -

LR(1) DFA Construction

S’ . S , $S . E + S , $S . E , $E .num , +,$

E num . , +,$

S’ S . , $

E

num

+

S E+S. , +,$

S

S E + . S , $S . E + S , $S . E , $E . num , +,$

S E . + S , $S E . , $

S

GrammarS’ S$S E + S | EE numE

num

- 118 -

LR(1) Reductions

S’ . S , $S . E + S , $S . E , $E .num , +,$

E num . , +,$

S’ S . , $

E

num

+

S E . , +,$

S

S E + . S , $S . E + S , $S . E , $E . num , +,$

S E . + S , $S E . , $

S

GrammarS’ S$S E + S | EE numE

num

• Reductions correspond to LR(1) items of the form (X . , y)

- 119 -

LR(1) Parsing Table Construction

Same as construction of LR(0), except for reductions

For a transition S S’ on terminal x:» Table[S,x] += Shift(S’)

For a transition S S’ on non-terminal N:» Table[S,N] += Goto(S’)

If I contains {(X . , y)} then:» Table[I,y] += Reduce(X )

- 120 -

LR(1) Parsing Table Example

S’ . S , $S . E + S , $S . E , $E .num , +,$

E

+

S E + . S , $S . E + S , $S . E , $E . num , +,$

S E . + S , $S E . , $

GrammarS’ S$S E + S | EE num

1

2

3

+ $ E1 g22 s3 SE

Fragment of theparsing table

- 121 -

Class Problem

Compute the LR(1) DFA for the following grammar

E E + T | TT TF | FF F* | a | b

- 122 -

LALR(1) Grammars

Problem with LR(1): too many states LALR(1) parsing (aka LookAhead LR)

» Constructs LR(1) DFA and then merge any 2 LR(1) states whose items are identical except lookahead

» Results in smaller parser tables

» Theoretically less powerful than LR(1)

LALR(1) grammar = a grammar whose LALR(1) parsing table has no conflicts

S id . , +S E . , $

S id . , $S E . , ++ = ??

- 123 -

LALR Parsers

LALR(1)» Generally same number of states as SLR

(much less than LR(1))» But, with same lookahead capability of LR(1)

(much better than SLR)» Example: Pascal programming language

In SLR, several hundred states In LR(1), several thousand states

- 124 -

Automate the Parsing Process

Can automate:» The construction of LR parsing tables

» The construction of shift-reduce parsers based on these parsing tables

LALR(1) parser generators» yacc, bison

» Not much difference compared to LR(1) in practice

» Smaller parsing tables than LR(1)

» Augment LALR(1) grammar specification with declarations of precedence, associativity

» Output: LALR(1) parser program

- 125 -

Associativity

S S + E | EE num

E E + EE num

What happens if we run this grammar through LALR construction?

E E + EE num

E E + E . , +E E . + E , +,$

+

shift/reduceconflict

shift: 1+ (2+3)reduce: (1+2)+3

1 + 2 + 3

- 126 -

Associativity (2)

If an operator is left associative» Assign a slightly higher value to its precedence if it is

on the parse stack than if it is in the input stream

» Since stack precedence is higher, reduce will take priority (which is correct for left associative)

If operator is right associative» Assign a slightly higher value if it is in the input

stream

» Since input stream is higher, shift will take priority (which is correct for right associative)

- 127 -

Precedence

E E + E | TT T x T | num | (E) E E + E | E x E | num | (E)

Shift/reduceconflict results

What happens if we run this grammar through LALR construction?

E E . + E , ...E E x E . , +

E E + E . , xE E . x E, ...

Precedence: attach precedence indicators to terminalsShift/reduce conflict resolved by:

1. If precedence of the input token is greater than the last terminal on parse stack, favor shift over reduce2. If the precedence of the input token is less than or equal to the last terminal on the parse stack, favor reduce over shift

- 128 -

Abstract Syntax Tree (AST) - Review Derivation = sequence of

applied productions» S E+S 1+S 1+E

1+2

Parse tree = graph representation of a derivation» Doesn’t capture the order

of applying the productions

AST discards unnecessary information from the parse tree

++ 5

1 +

2 +

3 4

S

E + S

( S ) E

E + S 5

E + S1

2 E

( S )

E + S

E3 4

- 129 -

Implicit AST Construction

LL/LR parsing techniques implicitly build AST

The parse tree is captured in the derivation» LL parsing: AST represented by applied

productions» LR parsing: AST represented by applied

reductions We want to explicitly construct the AST

during the parsing phase

- 130 -

AST Construction - LL

void parse_S() { switch (token) { case num: case ‘(‘: parse_E(); parse_S’(); return; default: ParseError(); }}

Expr parse_S() { switch (token) { case num: case ‘(‘: Expr left = parse_E(); Expr right = parse_S’(); if (right == NULL) return left else return new Add(left,right); default: ParseError(); }}

LL parsing: extend proceduresfor non-terminals

S ES’S’ | +SE num | (S)

- 131 -

AST Construction - LR

We again need to add code for explicit AST construction

AST construction mechanism» Store parts of the tree on the stack» For each nonterminal symbol X on stack, also

store the sub-tree rooted at X on stack» Whenever the parser performs a reduce

operation for a production X , create an AST node for X

- 132 -

AST Construction for LR - Example

S E + S | SE num | (S)

.

.

.

.

.

.

S

+

E

.

.

Add

Num(1) Num(2)

stac

k

Before reduction: S E + S

Num(3) ...

.

.

.

S .Add

Num(1)

Num(2) Num(3)

Add

After reduction: S E + S

input string: “1 + 2 + 3”

- 133 -

Problems

Unstructured code: mixing parsing code with AST construction code

Automatic parser generators» The generated parser needs to contain AST

construction code

» How to construct a customized AST data structure using an automatic parser generator?

May want to perform other actions concurrently with parsing phase» E.g., semantic checks

» This can reduce the number of compiler passes

- 134 -

Syntax-Directed Definition

Solution: Syntax-directed definition» Extends each grammar production with an

associated semantic action (code): S E + S {action}

» The parser generator adds these actions into the generated parser

» Each action is executed when the corresponding production is reduced

- 135 -

Semantic Actions

Actions = C code (for bison/yacc) The actions access the parser stack

» Parser generators extend the stack of symbols with entries for user-defined structures (e.g., parse trees)

The action code should be able to refer to the grammar symbols in the productions» Need to refer to multiple occurrences of the same non-

terminal symbol, distinguish RHS vs LHS occurrence E E + E

» Use dollar variables in yacc/bison ($$, $1, $2, etc.) expr ::= expr PLUS expr {$$ = $1 + $3;}

- 136 -

Building the AST

Use semantic actions to build the AST AST is built bottom-up along with parsing

expr ::= NUM {$$ = new Num($1.val); }expr ::= expr PLUS expr {$$ = new Add($1, $3); }expr ::= expr MULT expr {$$ = new Mul($1, $3); }expr ::= LPAR expr RPAR {$$ = $2; }

Recall: User-defined type forobjects on the stack (%union)

- 137 -

Outline



- 138 -

LL/LR Grammar Summary

LL parsing tables» Non-terminals x terminals productions

» Computed using FIRST/FOLLOW

LR parsing tables» LR states x terminals {shift/reduce}

» LR states x non-terminals goto

» Computed using closure/goto operations on LR states

A grammar is:» LL(1) if its LL(1) parsing table has no conflicts

» same for LR(0), SLR, LALR(1), LR(1)

- 139 -

Top-Down Parsing

S S+E E+E (S)+E (S+E)+E (S+E+E)+E (E+E+E)+E (1+E+E)+E (1+2+E)+E ...


In left-most derivation, entiretree above token (2) has beenexpanded when encountered

S

S + E

( S )

S + E

5E

S + E

2E

1

( S )

S + E

4E

3

- 140 -

Top-Down vs Bottom-Up

scanned unscanned scanned unscanned

Top-down Bottom-up

Bottom-up: Don’t need to figure out as much of he parse treefor a given amount of input More time to decide what rulesto apply

- 141 -

Terminology: LL vs LR LL(k)

» Left-to-right scan of input» Left-most derivation» k symbol lookahead» [Top-down or predictive] parsing or LL parser» Performs pre-order traversal of parse tree

LR(k)» Left-to-right scan of input» Right-most derivation» k symbol lookahead» [Bottom-up or shift-reduce] parsing or LR parser» Performs post-order traversal of parse tree

- 142 -

Classification of Grammars

LR(0)

SLR

LALR(1)

LR(1)

LL(1)

LR(k) LR(k+1)LL(k) LL(k+0)

LL(k) LR(k)LR(0) SLRLALR(1) LR(1)

not to scale

- 143 -

Bottom-Up Parsing

(1+2+(3+4))+5 (E+2+(3+4))+5 (S+2+(3+4))+5 (S+E+(3+4))+5


Advantage of bottom-up parsing:can postpone the selection ofproductions until more of theinput is scanned

S

S + E

( S )

S + E

5E

S + E

2E

1

( S )

S + E

4E

3

comp3190: principle of programming languages formal language syntax

Documents