chapter 3 context-free grammars

Chapter 3

Context-Free Grammars

Dr. Frank Lee

3.1 CFG Definition

• The next phase of compilation after lexical analysis is syntax analysis.

• This phase is called parsing. It is to determine if the program is syntactically correct or not.

• The tool we use to describe the syntax of a programming language is context-free grammars (CFG).

3.1 CFG Definition

• A CFG includes 4 components:– A set of terminals T, which are the tokens of

the language– A set of non-terminals N– A set of rewriting rules R.

• The left-hand side of each rewriting rule is a single non-terminal.

• The right-hand side of each rewriting rule is a string of terminals and/or non-terminals

– A special non-terminal S Є N, which is the start symbol

3.1 CFG Definition• Just as regular expression generate strings of

characters, CFG generate strings of tokens• A string of tokens is generated by a CFG in the

following way:– The initial input string is the start symbol S– While there are non-terminals left in the string:

1. Pick any non-terminal in the input string A

2. Replace a single occurrence of A in the string with the right-hand side of any rule that has A as the left-hand side

3. Repeat 1 and 2 until all elements in the string are terminals

– See Fig. 3.1 (next slide or p38)

Fig. 3.1 A CFG for Some Simple Statements

Terminals = { id, num, if, then, else, print, =, {, }, ;, (, ) }

Non-Terminals = { S, E, B, L }

Rules = (1) S print(E);

(2) S while (B) do S

(3) S { L }

(4) E id

(5) E num

(6) B E > E

(7) L S

(8) L SL

Start Symbol = S

3.1 CFG Example 11. E E + E2. E E – E3. E E * E4. E E / E5. E num

• Example 1: 1. E E + E2. E * E + E3. num * E + E4. num * num + E5. num * num + num

3.1 CFG Example 21. S NP V NP2. NP the N3. N boy4. N ball5. N window6. V threw7. V broke

• Example 21. S NP V NP2. the N V NP3. the boy V NP4. the boy broke NP5. the boy broke the N6. the boy broke the window

3.2 Derivations

• A derivation is a description of how a string is generated from a grammar

• A leftmost derivation always picks the leftmost non-terminal to replace (see slides 9 and 10)

• A rightmost derivation always picks the rightmost non-terminal to replace (see slides 12 and 13)

• Some derivations are neither leftmost nor rightmost (see slide 15)

• For example: Use the CFG in Fig. 3.1 (p38) to generate print (id);S print (E); print (id);

3.2 Derivations3.2.1 Leftmost Derivations

• A string of terminals and non-terminals α that can be derived from the initial symbol of the grammar is called a sentential form

• Thus the strings “{ S L }” and “while(id>E) do S”, are all sentential forms, but print(E>id)” isn’t.

• A derivation is “leftmost” if, at each step in the derivation, the leftmost non-terminal is selected to replace

• All of the above examples are leftmost derivations• A sentential form that occurs in a leftmost

derivation is called a left-sentential form

3.2.1 Leftmost Derivations

• Example 1: We can use leftmost derivations to generate while(id > num) do print(id); from this CFG as follows:

S while(B) do S while(E>E) do S while(id>E) do S while(id>num) do S while(id>num) do print(E); while(id>num) do print(id);

3.2.1 Leftmost Derivations

• Example 2: We also can generate { print(id); print(num); } from the CFG as follows:

S { L } { S L } { print(E); L } { print(id); L } { print(id); S } { print(id); print(E); } { print(id); print(num); }

3.2.2 Rightmost Derivations

• In addition to leftmost derivations, there are also rightmost derivations, where we always choose the rightmost non-terminal to replace

Example 1: To generate while(num > num) do print(id);

S while(B) do S while(B) do print(E); while(B) do print(id); while(E>E) do print(id); while(E>num) do print(id); while(num>num) do print(id);

3.2.2 Rightmost Derivations

Example 2: Try to derivate { print(num); print(id); } from S

S { L } { S L } { S S } { S print(E); } { S print(id); } { print(E); print(id); } { print(num); print(id); }

3.2.3 Non-Lefmost, Non-Rightmost Derivations

• Some derivations are neither leftmost or rightmost, such as:S while(B) do S while(E>E) do S while(E>E) do print(E); while(E>id) do print(E); while(num>int) do print(E); while(num>id) do print(num);

3.2.3 Non-Lefmost, Non-Rightmost Derivations

• Some strings that are not derivable from this CFG, such as:1. print(id)

2. { print(id); print(id) }

3. while (id) do print(id);

4. print(id > id);– 1 & 2: no ; to terminate statements.– 3: the id in while (id) is not derivable from B.– 4: id > id is not derivable from E.

3.3 CFG Shorthand

• We can combine two rules of the form

S α and S β

to get the single rule

S α│β

• See Fig. 3.2 for an example (next slide or p40)

Fig. 3.2 Shorthand for the CFG in Fig. 3.1

Terminals = { id, num, if, then, else, print, =, {, }, ;, (, ) }

Non-Terminals = { S, E, B, L }

Rules = S print(E); | while (B) do S | { L }

E id | num

B E > E

L S | SL

Start Symbol = S

3.4 Parse Trees ($)

• A parse tree is a graphical representation of a derivation• We start with the initial symbol S of the grammar as the

root of the tree• The children of the root are the symbols that were used

to rewrite the initial symbol in the derivation• The internal nodes of the parse tree are non-terminals• The children of each internal node N are the symbols on

the right-hand side of a rule that has N as the left-hand side (e.g. B E > E where E > E is the right-hand side and B is the left-hand side of the rule)

• Terminals are leaves of the tree• See three parse tree examples on p40-41.

3.5 Ambiguous Grammars

• A grammar is ambiguous if there is at least one string derivable from the grammar that has more then one different parse tree

• Fig. 3.4 and Fig. 3.6 are ambiguous grammars because there are several strings derivable from these grammars that have multiple parse trees, such as the two trees in Fig. 3.5 (p42) and Fig. 3.7, respectively

• Ambiguous grammars are bad, because the parse trees don’t tell us the exact meaning of the string. For example, in Fig. 3.5.a, the string means id*(id+id), but in Fig. 3.5.b, the string means (id*id)+id. This is why we call it “ambiguous”.

• We need to change the grammar to fix this problem• We may rewrite the grammar as in Fig. 3.8 and Fig. 3.9.

They are unambiguous CFGs for expressions

3.5 Ambiguous Grammars

• We need to make sure that all additions appear higher in the tree than multiplications (Why?) How can we do this?

• Once we replace an E with E*E using single rule 4, we don’t want to rewrite any of the Es we’ve just created using rule 2, since that would place an addition (+) lower in the tree than a multiplication (*)

• Let’s create a new non-terminal T for multiplication and division• T will generate strings of id’s multiplied or divided together, with

no additions or subtractions• Then we can modify E to generate strings of T’s added together• This modified grammar is in Fig. 3.6 (p42)• However, this grammar is still ambiguous. It is impossible to

generate a parse tree from Fig. 3.6 that has * higher than + in the tree

3.5 Ambiguous Grammars• Consider the string id+id+id, which has two parse trees, as listed

in Fig. 3.7 (p43)id+id+id = (id+id)+id or = id+(id+id) are all ok id-id-id = (id-id)-id != id-(id-id) but this is wrong

• We would like addition and subtraction to have leftmost association as above

• In other words, we need to make sure that the right subtree of an addition or subtraction is not another addition or subtraction

• We modified the CFG in Fig. 3.6 as in Fig. 3.8 (p43, unambiguous CFG)

• In Fig. 3.9 (p44), we add parentheses to the grammar in Fig. 3.8 to express expressions like id*(id+id).

• Fig. 3.9 is unambiguous too. See the three example parse trees for this grammar in Fig. 3.10-3.12

3.6 Extended Backus Naur Form

• Another term for a CFG is a Backus Naur Form (BNF).• There is an extension to BNF notation, called

Extended Backus Naur Form, or EBNF• EBNF rules allow us to mix and match CFG notation

and regular expression notation in the right-hand side of CFG rules

• For example, consider the following CFG, which describes simpleJava statement blocks and stylized simpleJava print statements:

1. S { B }2. S print(id)3. B S ; C4. C S ; C5. C ε

3.6 Extended Backus Naur Form

• Rules 3, 4, and 5 in the above grammar describe a series of one or more statements S, terminated by semicolons

• We could express the same language using an EBNF as follows:

1. S { B }

2. S print”(“id”)”

3. B (S;)+

Note, in Rule 2, when we want a parenthesis to appear in EBNF, we need to surround it with quotation marks. But in Rule 3, the pair of parenthesis is for the + symbol, not belongs to the language.

3.6 Extended Bakus Naur Form

• Another Example: Consider the following CFG fragment, which describes Pascal for statements:1. S for id := E to E do S

2. S for id := E downto E do S

• The CFG rules for both types of Pascal for statement (to and downto) can be represented by a single EBNF rule, as below:

• S for id := E (to│downto) E do S• The “, +, and | notations in the above examples

are from regular expressions.

Chapter 3 Homework

• Due; 2/18/2013

• Pages: 46-47

• Do the following exercises:

1 (a,c,e), 2(a), 3(a,c)

chapter 3 context-free grammars

Documents

e e e e

e e num

e e ee e ee e

printid s

num e num

s printe

s whileidnum

s whileide