fall 2016-2017 compiler principles lecture 2: ll parsingcomp171/wiki.files/02-parsing-1-ll.pdf ·...

Post on 20-Jun-2020

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Fall 2016-2017 Compiler PrinciplesLecture 2: LL parsing

Roman ManevichBen-Gurion University of the Negev

1

Books

2

CompilersPrinciples, Techniques, and ToolsAlfred V. Aho, Ravi Sethi, Jeffrey D. Ullman

Advanced Compiler Design and ImplementationSteven Muchnik

Modern Compiler DesignD. Grune, H. Bal, C. Jacobs, K. Langendoen

Modern Compiler Implementation in JavaAndrew W. Appel

Tentative syllabus

FrontEnd

Scanning

Top-downParsing (LL)

Bottom-upParsing (LR)

IntermediateRepresentation

Operational Semantics

Lowering

Optimizations

DataflowAnalysis

LoopOptimizations

Code Generation

RegisterAllocation

EnergyOptimization

InstructionSelection

3

mid-term exam

Parsing background

• Context-free grammars

– Terminals

– Nonterminals

– Start nonterminal

– Productions (rules)

• Context-free languages

– Derivations (leftmost, rightmost)

– Derivation tree (also called parse tree)

• Ambiguous grammars

4

Agenda

5

• Understand role of syntax analysis

• Parsing strategies

• LL parsing

– Building a predictor table via FIRST/FOLLOW/NULLABLE sets

– Pushdown automata algorithm

• Handling conflicts

Role of syntax analysis

• Recover structure from stream of tokens– Parse tree / abstract syntax tree

• Error reporting (recovery)• Other possible tasks

– Syntax directed translation (one pass compilers)– Create symbol table– Create pretty-printed version of the program,

e.g., Auto Formatting function in IDE

6

High-levelLanguage

(scheme)

Executable

Code

LexicalAnalysis

Syntax Analysis

Parsing

AST SymbolTableetc.

Inter.Rep.(IR)

CodeGeneration

From tokens to abstract syntax trees

59 + (1257 * xPosition)

)id*num(+num

Lexical Analyzer

program text

token stream

Parser

Grammar:

E id

E num

E E + E

E E * E

E ( E ) +

num

num x

*

Abstract Syntax Tree

validsyntaxerror

7

Lexicalerror valid

Regular expressionsFinite automata

Context-free grammarsPush-down automata

Marking “end-of-file”

• Sometimes it will be useful to transform a grammar G with start non-terminal S into a grammar G’ with a new start non-terminal S‘ and a new production rule

S’ S $– $ is not part of the set of tokens

– It is a special End-Of-File (EOF) token

• To parse α with G’ we change it into α $

• Simplifies parsing grammars with null productions– Also simplifies parsing LR grammars

8

Another convention

• We will assume that all productions have been consecutively numbered(1) S E $

(2) E T

(3) E E + T

(4) T id

(5) T ( E )

9

Parsing strategies

10

Broad kinds of parsers

• Parsers for arbitrary grammars–Cocke-Younger-Kasami [‘65] method O(n3)

– Earley’s method (implemented by NLTK)O(n3) but lower for restricted classes

–Not commonly used by compilers

• Parsers for restricted classes of grammars– Top-Down

• With/without backtracking

–Bottom-Up

11

Top-down parsing

• Constructs parse tree in a top-down matter

• Find leftmost derivation

• Predictive: for every non-terminal and k-tokens predictthe next production LL(k)

• Challenge: beginning with the start symbol, try to guess the productions to apply to end up at the user's program

12

By Fidelio (Own work) [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC-BY-SA-3.0-2.5-2.0-1.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

Predictive parsing

13

Exercise: show leftmost derivation

14

not ( not true or false )

not E

E

( E OP E )

not

LIT

or LIT

true

false

(1) E LIT(2) | (E OP E)(3) | not E(4) LIT true(5) | false(6) OP and(7) | or(8) | xor

E

E

not E

not ( E OP E )

not ( not E OP E )

not ( not LIT OP E )

not ( not true OP E )

not ( not true or LIT )

not ( not true or E )

How did we decide which production of ‘E’ to take?

Predictive parsing

• Given a grammar G attempt to derive a word ω• Idea

– Scan input from left to right– Apply production to leftmost nonterminal– Pick production rule based on next input token

• Problem: there is more than one production based for next token

• Solution: restrict grammars to LL(1)– Parser correctly predicts which production to apply– If grammar is not in LL(1) the parser construction

algorithm will detect it

15

LL(1) parsing via pushdown automata

16

Parsing programX

Y

Z

$

$b+a

Derivation tree / error

Input stream

Stack of symbols(current sentential form)

no

nte

rmin

al

token

production

Prediction table

LL(1) parsing algorithm

• Set stack=S$• while true

– Prediction When top of stack is nonterminal N1. Pop N2. lookup Table[N,t]3. If table[N,t] is not empty, push Table[N,t] on stack

else return syntax error

– Match When top of stack is terminal t• If t=next input toke, pop t and increment input index

else return syntax error

– End When stack is empty• If input is empty return success

else return syntax error

17

( ) not true false and or xor $

E 2 3 1 1

LIT 4 5

OP 6 7 8

(1) E → LIT

(2) E → ( E OP E )

(3) E → not E

(4) LIT → true

(5) LIT → false

(6) OP → and

(7) OP → or

(8) OP → xor

No

nte

rmin

als

Input tokens

Table entries determine which production to take

Example prediction table

18

‘(‘ FIRST(‘( E OP E )’ )

a b c

S S aSb S c

S aSb | caacbb$

Input suffix Stack content Move

aacbb$ S$ predict(S,a) = S aSb

aacbb$ aSb$ match(a,a)

acbb$ Sb$ predict(S,a) = S aSb

acbb$ aSbb$ match(a,a)

cbb$ Sbb$ predict(S,c) = S c

cbb$ cbb$ match(c,c)

bb$ bb$ match(b,b)

b$ b$ match(b,b)

$ $ match($,$) – success

Running parser example

19

a b c

S S aSb S c

S aSb | cabcbb$

Input suffix Stack content Move

abcbb$ S$ predict(S,a) = S aSb

abcbb$ aSb$ match(a,a)

bcbb$ Sb$ predict(S,b) = ERROR

Illegal input example

20

Building the prediction table

• Let G be a grammar

• Compute FIRST/NULLABLE/FOLLOW

• Check for conflicts

– No conflicts => G is an LL(1) grammar

– Conflicts exit => G is not an LL(1) grammar

• Attempt to transform G into an equivalent LL(1) grammar G’

21

First sets

22

FIRST sets

• Definition: For a nonterminal A, FIRST(A) is the set of terminals that can start in a sentence derived from A

– Formally: FIRST(A) = {t | A * t ω}

• Definition: For a sentential form α, FIRST(α) is the set of terminals that can start in a sentence derived from α

– Formally: FIRST(α) = {t | α * t ω}

23

FIRST sets example

• FIRST(E) = …?

• FIRST(LIT) = …?

• FIRST(OP) = …?

24

E LIT | (E OP E) | not ELIT true | falseOP and | or | xor

FIRST sets example

• FIRST(E) = FIRST(LIT) FIRST(( E OP E )) FIRST(not E)

• FIRST(LIT) = { true, false }

• FIRST(OP) = {and, or, xor}

• A set of recursive equations

• How do we solve them?

25

E LIT | (E OP E) | not ELIT true | falseOP and | or | xor

Computing FIRST sets

• This is known as a fixed-point algorithm• We will see such iterative methods later in the

course and learn to reason about them

26

Assume no null productions (A )

1. Initially, for all nonterminals A, setFIRST(A) = { t | A t ω for some ω }

2. Repeat the following until no changes occur:for each nonterminal A

for each production A α1 | … | αk

FIRST(A) := FIRST(α1) ∪ … ∪ FIRST(αk)

Exercise: compute FIRST

27

STMT if EXPR then STMT| while EXPR do STMT| EXPR ;

EXPR TERM -> id| zero? TERM| not EXPR| ++ id| -- id

TERM id| constant

TERMEXPRSTMT

FIRST(STMT) = FIRST(if) ∪ FIRST(while) ∪ FIRST(EXPR)FIRST(EXPR) = FIRST(TERM) ∪ FIRST(zero?) ∪ FIRST(not) ∪ FIRST(++) ∪ FIRST(--)FIRST(TERM) = FIRST(id) ∪ FIRST(constant)

Exercise: compute FIRST

28

STMT if EXPR then STMT| while EXPR do STMT| EXPR ;

EXPR TERM -> id| zero? TERM| not EXPR| ++ id| -- id

TERM id| constant

TERMEXPRSTMT

FIRST(STMT) = {if, while} ∪ FIRST(EXPR)FIRST(EXPR) = {zero?, not, ++, --} ∪ FIRST(TERM)FIRST(TERM) = {id, constant}

1. Initialization

29

STMT if EXPR then STMT| while EXPR do STMT| EXPR ;

EXPR TERM -> id| zero? TERM| not EXPR| ++ id| -- id

TERM id| constant

TERMEXPRSTMT

idconstant

zero?Not++--

ifwhile

FIRST(STMT) = {if, while} ∪ FIRST(EXPR)FIRST(EXPR) = {zero?, not, ++, --} ∪ FIRST(TERM)FIRST(TERM) = {id, constant}

2. Iterate 1

30

STMT if EXPR then STMT| while EXPR do STMT| EXPR ;

EXPR TERM -> id| zero? TERM| not EXPR| ++ id| -- id

TERM id| constant

TERMEXPRSTMT

idconstant

zero?Not++--

ifwhile

zero?Not++--

FIRST(STMT) = {if, while} ∪ FIRST(EXPR)FIRST(EXPR) = {zero?, not, ++, --} ∪ FIRST(TERM)FIRST(TERM) = {id, constant}

2. Iterate 2

31

STMT if EXPR then STMT| while EXPR do STMT| EXPR ;

EXPR TERM -> id| zero? TERM| not EXPR| ++ id| -- id

TERM id| constant

TERMEXPRSTMT

idconstant

zero?Not++--

ifwhile

idconstant

zero?Not++--

FIRST(STMT) = {if, while} ∪ FIRST(EXPR)FIRST(EXPR) = {zero?, not, ++, --} ∪ FIRST(TERM)FIRST(TERM) = {id, constant}

2. Iterate 3 – fixed-point

32

STMT if EXPR then STMT| while EXPR do STMT| EXPR ;

EXPR TERM -> id| zero? TERM| not EXPR| ++ id| -- id

TERM id| constant

TERMEXPRSTMT

idconstant

zero?Not++--

ifwhile

idconstant

zero?Not++--

idconstant

FIRST(STMT) = {if, while} ∪ FIRST(EXPR)FIRST(EXPR) = {zero?, not, ++, --} ∪ FIRST(TERM)FIRST(TERM) = {id, constant}

Reasoning about the algorithm

33

• Is the algorithm correct?• Does it terminate? (complexity)

Assume no null productions (A )

1. Initially, for all nonterminals A, setFIRST(A) = { t | A t ω for some ω }

2. Repeat the following until no changes occur:for each nonterminal A

for each production A α1 | … | αk

FIRST(A) := FIRST(α1) ∪ … ∪ FIRST(αk)

Reasoning about the algorithm

• Termination:

• Correctness:

34

LL(1) Parsing of grammars without epsilon productions

35

Using FIRST sets

• Assume G has no epsilon productions and for every non-terminal X and every pair of productions X and X we have thatFIRST() FIRST() = {}

• No intersection between FIRST sets =>can always pick a single rule

36

Using FIRST sets

• In our Boolean expressions example– FIRST( LIT ) = { true, false }

– FIRST( ( E OP E ) ) = { ‘(‘ }

– FIRST( not E ) = { not }

• If the FIRST sets intersect, may need longer lookahead– LL(k) = class of grammars in which production rule

can be determined using a lookahead of k tokens

– LL(1) is an important and useful class

• What if there are epsilon productions?

37

Extending LL(1) Parsingfor epsilon productions

38

FIRST, FOLLOW, NULLABLE sets

• For each non-terminal X

• FIRST(X) = set of terminals that can start in a sentence derived from X

– FIRST(X) = {t | X * t ω}

• NULLABLE(X) if X *

• FOLLOW(X) = set of terminals that can follow Xin some derivation

– FOLLOW(X) = {t | S * X t }

39

Computing the NULLABLE set

• Lemma: NULLABLE(1 … k) = NULLABLE(1) … NULLABLE(k)

1. Initially NULLABLE(X) = false

2. For each non-terminal X if exists a productionX then NULLABLE(X) = true

3. Repeatfor each production Y 1 … kif NULLABLE(1 … k) then

NULLABLE(Y) = trueuntil NULLABLE stabilizes

40

Exercise: compute NULLABLE

41

S A a bA a | B A B | CC b |

NULLABLE(S) = NULLABLE(A) NULLABLE(a) NULLABLE(b)NULLABLE(A) = NULLABLE(a) NULLABLE()NULLABLE(B) = NULLABLE(A) NULLABLE(B) NULLABLE(C)NULLABLE(C) = NULLABLE(b) NULLABLE()

FIRST with epsilon productions

• How do we compute FIRST(1 … k) when epsilon productions are allowed?

– FIRST(1 … k) = ?

42

FIRST with epsilon productions

• How do we compute FIRST(1 … k) when epsilon productions are allowed?

– FIRST(1 … k) =if not NULLABLE(1) then FIRST(1)else FIRST(1) FIRST (2 … k)

43

Exercise: compute FIRST

44

S A c bA a |

NULLABLE(S) = NULLABLE(A) NULLABLE(c) NULLABLE(b)NULLABLE(A) = NULLABLE(a) NULLABLE()

FIRST(S) = FIRST(A) FIRST(cb)FIRST(A) = FIRST(a) FIRST ()

FIRST(S) = FIRST(A) {c}FIRST(A) = {a}

FOLLOW sets

• if X α Y then FOLLOW(Y) ?

if NULLABLE() or = thenFOLLOW(Y) ?

p. 189

45

FOLLOW sets

• if X α Y then FOLLOW(Y) FIRST()

if NULLABLE() or = thenFOLLOW(Y) ?

p. 189

46

FOLLOW sets

• if X α Y then FOLLOW(Y) FIRST()

if NULLABLE() or = thenFOLLOW(Y) FOLLOW(X)

p. 189

47

FOLLOW sets

• if X α Y then FOLLOW(Y) FIRST()

if NULLABLE() or = thenFOLLOW(Y) FOLLOW(X)

• Allows predicting epsilon productions:X when the lookahead token is in FOLLOW(X)

p. 189

S A c bA a |

What should we predict for input “cb”?

What should we predict for input “acb”?

48

LL(1) conflicts

49

Conflicts

• FIRST-FIRST conflict

– X α and X and

– If FIRST(α) FIRST(β) {}

• FIRST-FOLLOW conflict

– NULLABLE(X)

– If FIRST(X) FOLLOW(X) {}

50

LL(1) grammars

• A grammar is in the class LL(1) when its LL(1) prediction table contains no conflicts

• A language is said to be LL(1) when it has an LL(1) grammar

51

LL(k) grammars

52

LL(k) grammars

• Generalizes LL(1) for k lookahead tokens

• Need to generalize FIRST and FOLLOW for klookahead tokens

53

Agenda

54

• LL(k) via pushdown automata

• Predicting productions via FIRST/FOLLOW/NULLABLE sets

• Handling conflicts

Handling conflicts

55

Problem 1: FIRST-FIRST conflict

• FIRST(term) = { ID }

• FIRST(indexed_elem) = { ID }

• How can we transform the grammar into an equivalent grammar that does not have this conflict?

term ID | indexed_elemindexed_elem ID [ expr ]

56

Solution: left factoring

• Rewrite the grammar to be in LL(1)

Intuition: just like factoring in algebra: x*y + x*z into x*(y+z)

term ID | indexed_elemindexed_elem ID [ expr ]

term ID after_IDAfter_ID [ expr ] |

57

New grammar is more complex – has epsilon production

S if E then S else S| if E then S | T

Exercise: apply left factoring

58

S if E then S else S| if E then S | T

S if E then S S’ | T

S’ else S |

Exercise: apply left factoring

59

Problem 2: FIRST-FOLLOW conflict

• FIRST(S) = { a } FOLLOW(S) = { }

• FIRST(A) = { a } FOLLOW(A) = { a }

• How can we transform the grammar into an equivalent grammar that does not have this conflict?

S A a bA a |

60

Solution: substitution

S A a bA a |

S a a b | a b

Substitute A in S

61

Solution: substitution

S A a bA a |

S a a b | a b

Substitute A in S

S a after_Aafter_A a b | b

Left factoring

62

Problem 3: FIRST-FIRST conflict

• Left recursion cannot be handled with a bounded lookahead

• How can we transform the grammar into an equivalent grammar that does not have this conflict?

E E - term | term

63

Solution: left recursion removal

• L(G1) = β, βα, βαα, βααα, …

• L(G2) = same

N Nα | βN βN’ N’ αN’ |

G1 G2

E E - term | termE term TE | termTE - term TE |

For our 3rd example:

p. 130

Can be done algorithmically.Problem 1: grammar becomes mangled beyond recognitionProblem 2: grammar may not be LL(1)

64

Recap

• Given a grammar

• Compute for each non-terminal– NULLABLE

– FIRST using NULLABLE

– FOLLOW using FIRST and NULLABLE

• Compute FIRST for each sentential form appearing on right-hand side of a production

• Check for conflicts– If exist: attempt to remove conflicts by rewriting

grammar

65

The bigger picture

• Compilers include different kinds of program analyses each further constrains the set of legal programs

– Lexical constraints

– Syntax constraints

– Semantic constraints

– “Logical” constraints(Verifying Compiler grand challenge)

66

Program consists of legal tokens

Program included in a given context-free language

Program included in a given attribute grammar (type checking, legal inheritance graph, variables initialized before used)

Memory safety: null dereference, array-out-of-bounds access,data races, functional correctness (program meets specification)

Next lecture:bottom-up parsing

top related