chapter 4 lexical analysis. why split it from parsing? –simplifies design parsers with whitespace...

84
Chapter 4 Lexical Analysis

Post on 21-Dec-2015

229 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Chapter 4

Lexical Analysis

Page 2: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Lexical Analysis

• Why split it from parsing?– Simplifies design

• Parsers with whitespace and comments are more awkward

– Efficiency• Only use the most powerful technique that works• And nothing more

– No parsing sledgehammers for lexical nuts

– Portability• More modular code • More code re-use

Page 3: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Source Code Characteristics

• Code– Identifiers

• Count, max, get_num

– Language keywords: reserved or predefined• switch, if .. then.. else, printf, return, void • Mathematical operators

– +, *, >> ….– <=, =, != …

– Literals• “Hello World”

• Comments• Whitespace

Page 4: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the
Page 5: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Reserved words versus predefined identifiers

• Reserved words cannot be used as the name of anything in a definition (i.e., as an identifier).

• Predefined identifiers have special meanings, but can be redefined (although they probably shouldn’t).

• Examples of predefined identifiers in Java:anything in java.lang package, such as String, Object, System, Integer.

Page 6: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Language of Lexical Analysis

Tokens: category

Patterns: regular expression

Lexemes:actual string matched

Page 7: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Tokens are not enough…

• Clearly, if we replaced every occurrence of a variable with a token then ….

We would lose other valuable information (value, name)

• Other data items are attributes of the tokens

• Stored in the symbol table

Page 8: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Token delimiters

• When does a token/lexeme end?

e.g xtemp=ytemp

Page 9: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Ambiguity in identifying tokens• A programming language definition will

state how to resolve uncertain token assignment

• <> Is it 1 or 2 tokens?

• Reserved keywords (e.g. if) take precedence over identifiers (rules are same for both)

• Disambiguating rules state what to do

• ‘Principle of longest substring’: greedy match

Page 10: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Regular Expressions

• To represent patterns of strings of characters• REs

– Alphabet – set of legal symbols– Meta-characters – characters with special meanings

is the empty string

• 3 basic operations– Choice – choice1|choice2,

• a|b matches either a or b– Concatenation – firstthing secondthing

• (a|b)c matches the strings { ac, bc }– Repetition (Kleene closure)– repeatme*

• a* matches { , a, aa, aaa, aaaa, ….}

• Precedence: * is highest, | is lowest– Thus a|bc* is a|(b(c*))

Page 11: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Regular Expressions…

• We can add in regular definitions– digit = 0|1|2 …|9

• And then use them:– digit digit*

• A sequence of 1 or more digits

• One or more repetitions:– (a|b)(a|b)* (a|b)+

• Any character in the alphabet .– .*b.* - strings containing at least one b

• Ranges [a-z], [a-zA-Z], [0-9], (assume character set ordering)

• Not: ~a or [^a]

Page 12: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Some exercises

• Describe the languages denoted by the following regular expressions1. 0 ( 0 | 1 ) * 02. ( ( | 0 ) * ) *3. 0* 1 0* 1 0* 1 0 *

• Write regular definitions for the following regular expressions1. All strings that contain the five vowels in

order (but not necessarily adjacent) aabcaadggge is okay

2. All strings of letters in which the letters are in ascending lexicographic order

3. All strings of 0’s and 1’s that do not contain the substring 011

Page 13: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Limitations of REs• REs can describe many language constructs but not

all

• For example

Alphabet = {a,b}, describe the set of strings consisting of a single a surrounded by an equal number of b’s

S= {a, bab, bbabb, bbbabbb, …}

• For example, nested Tags in HTML

Page 14: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Lookahead

• <=, <>, <

• When we read a token delimiter to establish a token we need to make sure that it is still available as part of next token

– It is the start of the next token!

• This is lookahead

– Decide what to do based on the character we ‘haven’t read’

• Sometimes implemented by reading from a buffer and then pushing the input back into the buffer

• And then starting with recognizing the next token

Page 15: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Classic Fortran example

• DO 99 I=1,10 becomes DO99I=1,10 versus

DO99I=1.10 The first is a do loop, the second an assignment.

We need lots of lookahead to distinguish.• When can the lexical analyzer assign a token?Push back into input buffer

– or ‘backtracking’

Page 16: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Finite Automata

• A recognizer determines if an input string is a sentence in a language

• Uses a regular expression• Turn the regular expression into a finite

automaton• Could be deterministic or non-

deterministic

Page 17: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Transition diagram for identifiers

• RE – Identifier -> letter (letter | digit)*

start 0 letter

letter

digit

other1 2accept

Page 18: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

• An NFA is similar to a DFA but it also permits multiple transitions over the same character and transitions over . In the case of multiple transitions from a state over the same character, when we are at this state and we read this character, we have more than one choice; the NFA succeeds if at least one of these choices succeeds. The transition doesn't consume any input characters, so you may jump to another state for free.

• Clearly DFAs are a subset of NFAs. But it turns out that DFAs and NFAs have the same expressive power.

Page 19: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

From a Regular Expression to an NFA

Thompson’s Construction(a | b)* abb

0

2

76

3

1

4 5

1098

accept

start

a

b

a b b

Page 20: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

start a

a

b

b b0 1 2 3

Non-deterministic finite state automata NFA

start a

a

b b0 01 02 03

b a

a

b

Equivalent deterministic finite state automata DFA

accept

accept

Page 21: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Transition Table (NFA)

State

a b

0 {0,1} {0}

1 {2}

2 {3}

Input Symbol

Page 22: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

NFA -> DFA (subset construction)

• Suppose that you assign a number to each NFA state. • The DFA states generated by subset construction have

sets of numbers, instead of just one number. For example, a DFA state may have been assigned the set {5, 6, 8}. This indicates that arriving to the state labeled {5, 6, 8} in the DFA is the same as arriving to the state 5, the state 6, or the state 8 in the NFA when parsing the same input.

• Recall that a particular input sequence when parsed by a DFA, leads to a unique state, while when parsed by a NFA it may lead to multiple states.

• First we need to handle transitions that lead to other states for free (without consuming any input). These are the transitions. We define the closure of a NFA node as the set of all the nodes reachable by this node using zero, one, or more transitions.

Page 23: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

NFA -> DFA (cont)

• The start state of the constructed DFA is labeled by the closure of the NFA start state.

• For every DFA state labeled by some set {s1,..., sn} and for every character c in the language alphabet, you find all the states reachable by s1, s2, ..., or sn using c arrows and you union together the closures of these nodes.

• If this set is not the label of any other node in the DFA constructed so far, you create a new DFA node with this label.

Page 24: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Transition Table (DFA)

Statea b

0 01 0

01 01 02

02 01 03

03 01 0

Input Symbol

Page 25: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Writing a lexical analyzer

• The DFA helps us to write the scanner. • Figure 4.1 in your text gives a good

example of what a scanner might look like.

Page 26: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

LEX (FLEX)

• Tool for generating programs which recognize lexical patterns in text

• Takes regular expressions and turns them into a program

Page 27: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Lexical Errors

• Only a small percentage of errors can be recognized during Lexical Analysis

Consider if (good == “bad)

Page 28: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

– Line ends inside literal string – Illegal character in input file – missing semi-colon– missing operator– missing paren– unquoted string– unopened file handle

Examples from the PERL language

Page 29: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

In general

• What does a lexical error mean?

• Strategies for dealing with:

– “Panic-mode”

• Delete chars from input until something matches

– Inserting characters

– Re-ordering characters

– Replacing characters

• For an error like “illegal character” then we should report it sensibly

Page 30: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Syntax Analysis

• also known as Parsing

• Grouping together tokens into larger structures

• Analogous to lexical analysis

• Input:

– Tokens (output of Lexical Analyzer)

• Output:

– Structured representation of original program

Page 31: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Parsing Fundamentals

• Source program:– 3 + 4

• After Lexical Analysis: ???

Page 32: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

A Context Free Grammar

• A grammar is a four tuple (, N,P,S) where

is the terminal alphabet • N is the non terminal alphabet • P is the set of productions • S is a designated start symbol in N

Page 33: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Parsing

• Expression number plus number– Similar to regular definitions:

• Concatenation• Choice

Expression number Operator number

operator + | - | * | /

• Repetition is done differently

Page 34: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

BNF Grammar

Operator + | - | * | /

Meta-symbols: |

Expression number Operator number

Structure on the left is defined to consist of the choiceson the right hand side

Expression number Operator number

Different conventions for writing BNF Grammars:

<expression> ::= number <operator> number

Page 35: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Derivations• Derivation:

– Sequence of replacements of structure names by choices on the RHS of grammar rules

– Begin: start symbol– End: string of token symbols– Each step one replacement is made

Exp Exp Op Exp | number

Op + | - | * | /

Page 36: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Example Derivation

Note the different arrows:

Derivation applies grammar rules

Used to define grammar rules

Non-terminals: Exp, Op Terminals: number, *

Terminals: because they terminate the derivation

Page 37: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Derivations (2)

• E ( E ) | a

• What sentences does this grammar generate?

An example derivation: • E ( E ) ((E)) ((a))• Note that this is what we couldn’t achieve with

regular definitions

Page 38: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Recursive Grammars

• E ( E ) | a– is recursive

E ( E ) is the general case

E a is the terminating case

• We have no * operator in context free grammars– Repetition = recursion

• E E | – derives , , , , ….– All strings beginning with followed by zero or more

repetitions of *

Page 39: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Recursive Grammars (2)

• a+ (regular expression)– E E a | a (1)– Or– E a E | a (2)

• 2 different grammars can derive the same language (1) is left recursive (2) is right recursive

• a*– Implies we need the empty production– E E a |

Page 40: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Recursive Grammars (3)

• Require recursive data structures trees

• Parse Trees

exp

exp expop

number number*

1

234

Exp Exp Op Exp | number

Op + | - | * | /

Page 41: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Parse Trees & Derivations

• Leafs = terminals• Interior nodes = non-terminals• If we replace the non-terminals right to

left– The parse tree sequence is right to left– A rightmost derivation -> reverse post-

order traversal

• If we derive left to right:– A leftmost derivation– pre-order traversal– parse trees encode information about the

derivation process

Page 42: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

• Formal Methods of Describing Syntax• 1950: Noam Chomsky (noted linguist) described generative

devices which describe four classes of languages (in order of decreasing power)

• recursively enumerable x y where x and y can be any string of nonterminals and terminals.

• context-sensitive x y where x and y can be string of terminals and non-terminals but y must be the same length or longer than x. – Can recognize anbncn

• context-free (yacc) - nonterminals appear singly on left-side of productions. Any nonterminal can be replaced by its right hand side regardless of the context it appears in. – Ex: If you were in the boxing ring and said ``Hit me'' it would imply

a different action than if you were playing cards. – Ex: If a IDENTSY which is between brackets is treated differently in

terms of what it matches than an IDENTSY between parens, this is context sensitive

– Can recognize  anbn, palindromes • regular (lex)

– Can recognize  anbm

Chomsky was interested in the theoretic nature of natural languages.

Page 43: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Abstract Syntax Trees

exp

exp

expop

3

+

4

Parse Tree

+

3 4

Abstract Syntax Tree

number number Tokensequence

This is all the informationwe actually need

Parse trees contain surplus information

Page 44: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

An exercise

• Consider the grammar S->(L) | aL->L,S |S

(a) What are the terminals, nonterminals and start symbol(b) Find leftmost and rightmost derivations and parse trees

for the following sentencesi. (a,a)ii. (a, (a,a))iii. (a, ((a,a), (a,a)))

Page 45: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Parsing token sequence: id + id * id

E E + E | E * E | ( E ) | - E | id

Page 46: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Ambiguity

• If a sentence has two distinct parse trees, the grammar is ambiguous

• Or alternatively:is ambiguous if there are two different right-most derivations for the same string.

• In English, the phrase ``small dogs and cats'' is ambiguous as we aren't sure if the cats are small or not.

• `I see flying planes' is also ambiguous • A language is said to be ambiguous if no

unambiguous grammar exists for it. • Dance is at the old main gym.  How it is parsed?

Page 47: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Ambiguous Grammars

• Problem – no clear structure is expressed• A grammar that generates a string with 2 distinct

parse trees is called an ambiguous grammar

– 2+3*4 = 2 + (3*4) = 14– 2+3*4 = (2+3) * 4 = 20

• Our experience of math says interpretation 1 is correct but the grammar does not express this:

– E E + E | E * E | ( E ) | - E | id

Page 48: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Example of Ambiguity• Grammar:

expr expr + expr | expr expr | ( expr ) | NUMBER

• Expression: 2 + 3 * 4

• Parse trees:

expr

expr expr

expr

+

* expr

expr

expr

+

* expr

expr expr NUMBER (2)

NUMBER (3)

NUMBER (4)

NUMBER (2)

NUMBER (3)

NUMBER (4)

Page 49: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Removing AmbiguityTwo methods1. Disambiguating Rules

positives: leaves grammar unchangednegatives: grammar is not sole source of

syntactic knowledge

2. Rewrite the Grammar

Using knowledge of the meaning that we want to use later in the translation into object code to guide grammar alteration

Page 50: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Precedence

E E addop Term | TermAddop + | -Term Term * Factor | Term/Factor |FactorFactor ( exp ) | number | id

• Operators of equal precedence are grouped together at the same ‘level’ of the grammar ’precedence cascade’

• The lowest level operators have highest precedence

• (The first shall be last and the last shall be first.)

Page 51: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Associativity

• 45-10-5 ?

30 or 40Subtraction is left associative, left to right (=30)

• E E addop E | TermDoes not tell us how to split up 45-10-5

• E E addop Term | TermForces left associativity via left recursion

• Precedence & associativity remove ambiguity of arithmetic expressions

– Which is what our math teachers took years telling us!

Page 52: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Ambiguous grammars

Statement -> If-statement | otherIf-statement -> if (Exp) Statement | if (Exp) Statement else

StatementExp -> 0 | 1

Parse if (0) if (1) other1 else other2

Page 53: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Removing ambiguity

Statement -> Matched-stmt | Unmatched-stmtMatched-stmt -> if (Exp) Matched-stmt else Matched-

stmt | otherUnmatched-stmt ->if (Exp) Statement | if (Exp) Matched-stmt else Unmatched-

stmt

Page 54: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Extended BNF Notation

• Notation for repetition and optional features.• {…} expresses repetition:

expr expr + term | term becomes expr term { + term }

• […] expresses optional features:if-stmt if( expr ) stmt | if( expr ) stmt else stmtbecomesif-stmt if( expr ) stmt [ else stmt ]

Page 55: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Notes on use of EBNF

• Use {…} only for left recursive rules:expr term + expr | termshould become expr term [ + expr ]

• Do not start a rule with {…}: writeexpr term { + term }, notexpr { term + } term

• Exception to previous rule: simple token repetition, e.g. expr { - } term …

• Square brackets can be used anywhere, however:expr expr + term | term | unaryop termshould be written asexpr [ unaryop ] term { + term }

Page 56: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Syntax Diagrams

• An alternative to EBNF.• Rarely seen any more: EBNF is much

more compact.• Example (if-statement, p. 101):

if-statement expression

statement

if ( )

else statement

Page 57: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

How is Parsing done?

1. Recursive descent (top down).2. Bottom up – tries to match input with

the right hand side of a rule. Sometimes called shift-reduce parsers.

Page 58: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Predictive Parsing• Top down parsing• LL(1) parsing• Table driven predictive parsing

(no recursion) versus recursive descent parsing where each nonterminal is associated with a procedure call

• No backtracking E -> E + T | TT -> T * F | FF -> (E) | id

Page 59: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Two grammar problems• Eliminating left recursion

(without changing associativity)A -> A |

A -> A’ A’ -> A’ |

Example E -> E + T | TT -> T * F | FF -> (E) | id

The general case

A -> A1 | A2 | …| Am | 1 | 2 | …| n

Page 60: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Two grammar problems

• Eliminating left recursion involving derivations of two or more steps

S -> Aa | bA -> Ac | Sd |

A -> Ac | Aad | bd |

Page 61: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

• Removing Left RecursionBefore• A --> A x• A --> y After• A --> yB• B --> x B • B --> e

Page 62: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Two grammar problems…

• Left factoring

Stmt -> if Exp then Stmt else Stmt| if Expr then Stmt

A -> 1 | 2

A -> A’

A’ -> 1 | 2

Page 63: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

exercises

Eliminate left recursion from the following grammars.a) S->(L) | a

L->L,S | S

b) Bexpr ->Bexpr or Bterm | BtermBterm -> Bterm and Bfactor | BfactorBfactor -> not Bfactor | (Bexpr) | true | false

Page 64: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

COMP313A Programming Languages

Syntax Analysis (3)

Page 65: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

• Table driven predictive parsing

• Getting the grammar right

• Constructing the table

Page 66: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Table Driven Predictive Parsing

a + b $

Predictive ParsingProgram

Parsing Table

Output

Stack

Input

X

Y

Z

$

id + id * id

Page 67: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Table Driven Predictive Parsing

NonTerminal

Input Symbol

id + ( ) $

E

E’

T

T’

F

E->TE’

E’->+TE’

E->TE’

E’-> E’->

T->FT’

T’-> T’-> T’->

F->id F->(E)

T’->*FT’

T->FT’

Page 68: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Table Driven Predictive Parsing

Parse id + id * id

Leftmost derivation and parse tree using the grammar

E -> TE’E’ -> +TE’ | T -> FT’T’ -> *FT’ | F -> (E) | id

Page 69: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

First and Follow Sets

• First and Follow sets tell when it is appropriate to put the right hand side of some production on the stack.

(i.e. for which input symbols)

E -> TE’E’ -> +TE’ | T -> FT’T’ -> *FT | F -> (E) | id

id + id * id

Page 70: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

First Sets

1. If X is a terminal, then FIRST(X) is {X}

2. IF X -> is a production, then add to FIRST(X)

3. IF X is a non terminal and X -> Y1Y2…Yk is a

production, then place a in FIRST(X) if for some i, a is in FIRST(Yi), and is in all of First(Y1), …

First(Yi-1). If is in FIRST(Yj) for all j = 1, 2, …k,

then add to FIRST(X).

Page 71: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

FIRST sets

E -> TE’E’ -> +TE’ | T -> FT’T’ -> *FT | F -> (E) | id

Page 72: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Follow Sets

1. Place $ in Follow(S), where S is the start symbol and $ is the input right endmarker

2. If there is a production A -> B, then everything in FIRST() except for is placed in FOLLOW(B)

3. If there is a production A -> B, or a production A -> B where FIRST() contains (i.e., ), then everything in Follow(A) is in FOLLOW(B)

*

Page 73: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Follow Sets

E -> TE’E’ -> +TE’ | T -> FT’T’ -> *FT | F -> (E) | id

Page 74: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

FIRST and FOLLOW sets

Construct first and follow sets for the following grammar after left recursion has been eliminated

a) S->(L) | aL->L,S | S

Page 75: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Construction of the Predictive

Parsing Table • Algorithm from Aho et al.

1. For each production A -> of the grammar do steps 2 and 3

2. For each terminal a in FIRST(), add A -> to M[A,a]

3. If is in FIRST(), add A -> to M[A, b] for each terminal b in FOLLOW(A). If is in FIRST() and $ is in FOLLOW(A), add A -> to M[A, $].

4. Make each undefined entry of M be error

Page 76: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Predictive Parsing Table

Construct the parsing table for the grammara) S->(L) | a

L->L,S | S

Page 77: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

COMP313A Programming Languages

Syntax Analysis (4)

Page 78: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

• A problem for predictive parsers

• Predictive parsing LL(1) grammars

• Error recovery in predictive parsing

• Recursive Descent parsing

Lecture Outline

Page 79: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Producing code from parse on the fly

E -> TE’E’ -> +TE’ | T -> FT’T’ -> *FT’ | F -> (E) | id

Page 80: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Table Driven Predictive Parsing

NonTerminal

Input Symbol

id + * ( ) $

E

E’

T

T’

F

E->TE’

E’->+TE’

E->TE’

E’-> E’->

T->FT’ T->FT’

T’-> T’-> T’->

F->id F->(E)

T->*FT’

Page 81: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

LL(1) grammars

S -> if E then S S’ | aS’ -> else S | E -> b

FIRST(S) = {if, a}FIRST(S’) = {else, }FIRST(E) = {b}

FOLLOW(S) = {$, else}FOLLOW(S’) ={$, else}FOLLOW(E) = {then}

Construct the parsing table

Page 82: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

LL(1) grammars

• An LL(1) grammar has no multiply defined entries in its parsing table

• Left-recursive and ambiguous grammars are not LL(1)

• A grammar G is LL(1) iff whenever A -> | are two distinct productions of G1. For no terminal a do both and derive strings beginning

with a2. At most one of and can derive the empty string3. If then does not derive any string beginning with a

terminal in FOLL0W(A)

*

Page 83: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Error Recovery in Predictive Parsing

• Panic mode recovery based on a set of synchronizing tokens

• Heuristics for synchronizing sets

1. For nonterminal A all symbols in Follow(A) and FIRST(A)

2. Symbols that begin higher constructs

3. If A derives then A -> can be used as the default

4. Pop a nonmatching terminal from the top of the stack

Page 84: Chapter 4 Lexical Analysis. Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the

Recursive Descent Parsers

• A function for each nonterminal

example expression grammar

Function ExprIf the next input symbol is a ( or id then

call function Term followed by function Expr’Else Error

Expr -> Term Expr’Expr’ -> +Term Expr’ | Term -> Factor Term’Term’ -> *Factor Term’ | Factor -> (Expr) | id