lecture five: context free grammar (cfg)

Lecture Five:Context Free Grammar

(CFG)

Amjad Ali

CFG, Lecture 5, slide

There are four important components in a grammatical description of a

language:

1. There is a finite set of symbols that form the strings of the language

being defined. This set was {0,1} in the palindrome example we just

saw. We call this alphabet the terminals, or terminal symbols.

2. There is a finite set of variables, also called sometimes nonterminals

or syntactic categories. Each variable represents a language; i.e., a

set of strings. In our example above, there was only one variable, P,

which we used to represent the class of palindromes over alphabet

{0,1}.

Definition of Context-Free Grammar


3. One of the variables represents the language being defined; it is

called the start symbol. Other variables represent auxiliary classes of

strings that are used to help define the language of the start symbol.

In our example, P , the only variable , is the start symbol.

4. There is a finite set of productions or rules that represent the recursive

definition of a language. Each production consists of:

a) A variable that is being (partially) defined by the production. This

variable is often called the head of the production.

b) The production symbol


c) A string of zero or more terminals and variables. This string, called

the body of the production, represents one way to form strings in the

language of the variable of the head. In so doing, we leave terminals

unchanged and substitute for each variable if the body any string that

is known to be in language of that variable.


Alternate Definition of Context-Free Grammar

A context-free grammar, CFG is a collection of three things:

1. An alphabet Σ of letters called terminals from which we are going to

make strings that will be the words of a language.

2. A set of symbols called nonterminals, one of which is the symbol S,

standing for “start here”.

3. A finite set of productions of the form.

One Nonterminals finite set of terminals and/or Nonterminals


Formal Definition of CFG

A context-free grammar is a 4-tuple (V, Σ, R ,S), where

1. V is finite set called the variables.

2. Σ is a finite set, disjoint from V, called the terminals.

3. R is a finite set of rules, with each rule being a variable and a string

of variables and terminals, and

4. SV is the start variable.


Palindrome Example

Some of the rules that define the palindromes, expressed in the context-free grammar notation, are:

1. P ^

2. P 0

3. P 1

4. P 0P0

5. P 1P1


Notions for CFG Derivations

Some conventions used while discussing CFG’s:

1. Lower-case letters near the beginning of the alphabet, a, b, and so on,

are terminal symbols. Digits and other characters such as + or

parentheses can also be used as terminals.

2. Upper-case letters near the beginning of the alphabet, A, B, and so on,

are variables.

3. Lower-case letters near the end of the alphabet, such as w or z, are

strings of terminals. This convention reminds us that the terminals are

analogous to the input symbols of an automation.

4. Upper-case letters near the end of the alphabet, such as X or Y, are either

terminals or variables.


5. Lower-case Greek letters, such as alpha and beta, are strings consisting of terminals and/or

variables.

There is no special notation for strings that consist of variables only, since this concept plays

no important role. However, a string named alpha or another Greek letter might happen to

have only variables.


Example: A complex CFG that represents (a simplification of ) expressions in a typical programming language. Operators used are limited to + and *, representing addition and multiplication respectively. Arguments act as identifiers, but instead of full set of typical identifiers (letters followed by zero or more letters and digits). The letters are a and b and the digits 0 and 1. Every identifier begins with a or b, which may be followed by any string in {a, b, 0, 1}* .


Two variables used in this grammar:

1. E which represents expressions and it represents the language of expressions we are defining.

2. I represents identifiers.

The productions will be:

1. E I2. E E+E3. E E * E4. E (E)5. I a6. I b7. I Ia8. I Ib9. I I0

10. I I1


Suppose a string of the above CFG is a*(a+b00).

Its derivations will be:

E => E * E Production no. 3

=> I * E Production no. 1

=> a * E Production no. 5

=> a * (E) Production no. 4

=> a * (E + E) Production no. 2

=> a * (I + E) Production no. 1


=> a * a (a + E) Production no.5

=> a * a (a + I) Production no.1

=> a * a (a + I0) Production no. 9

=> a * a (a + I00) Production no. 9

=> a * (a + b00) Production no. 6


Leftmost and Right most Derivations

Leftmost derivation:In order to restrict the number of choices we have in deriving a

string, it is often useful to require that at each step we replace the leftmost variable by one of its production bodies. Such a derivation is called a leftmost derivation.

Rightmost derivation:

In order to restrict the number of choices we have in deriving a string, it is often useful to require that at each step we replace the rightmost variable by one of its production bodies. Such a derivation is called a rightmost derivation.


Example:The inference that a*(a+b00) is in the language of

variable E can be reflected in a derivation of that string, starting with the string E.

Leftmost derivation will be:

E => E * E => I * E => a * E => a * (E) => a * (E + E)

=> a * ( I + E ) => a * ( a + E) => a * ( a + I) =>

a * ( a + I0) => a * ( a + I00) => a * ( a + b00)

We can summarize the leftmost derivation as E => a*(a+b00) or E * E => a * (E)

*lm

*lm

lm lm lm

lm

lm lm

lm

lm

lm lm

lm


Rightmost derivation will be:

E => E * E => E * (E) => E * (E + E) => E * (E + I) => E * (E + I0)

=> E * ( E + I00 ) => E * (E + b00) => E * (I + b00) =>

E * ( a + b00) => I * ( a + b00) => a * ( a + b00)

So the rightmost derivation can be expressed as E => a*(a+b00).


rm rm

rm

rm

rm

rm

rm

rm

rm

rmrm

rm

Inference, Derivations and Parse Trees

I. The recursive inference procedure determines that terminal string w is in the language of variable A.

II. A=>w.

III. A =>w.

IV. A =>w.

V. There is a parse tree with root A and yield w.

*

*lm

*rm


Some Examples:

Example#1:Let the terminal be a and the nonterminal be S, and the productions be

S aSS ^

The above language is a*.

To derive a6 in this CFG the following derivations will be used.

S => aS => aaS => aaS => aaaS => aaaaS => aaaaaS => aaaaaaS => aaaaaa^ = aaaaaa

Notice:i. means “can be replaced by” as in S aS.ii. => means “can develop into” as in aaS => aaaS


Example#2:Let the terminals be a and b and the only nonterminal be S, and the

productions be

S aSS bSS aS b

The language generated by this CFG is the set of all possible strings of letters a and b except for the null string, which we cannot generate.

To produce the string baab the following derivations will be used.

S => bS => baS => baaS => baab


Example#3:Let the terminals be a and b, the only nonterminal be S, and the productions be

S aSS bSS aS bS ^

The word ab can be generated by the derivationS =>aS =>abS =>ab^ =ab

or by the derivationS=>aS =>ab

The language of this CFG is also (a+b)*, but the sequence of productions that is used to generate a specific word is not unique.

The third and fourth productions are redundant.


Example#4:Let the terminals be a and b, the only nonterminal be S and X, and the productions be

S XaaXX aXX bXX ^

The words generated from S have the formanything aa anything

or (a+b)*aa(a+b)*which is the language of all words with a double a in them somewhere.

For example, to generate baabaab, we can proceed as follows:S=>XaaX=>bXaaX=>baXaaX=>baaXaaX=>baabXaaX =>baab^aaX=>baabaaX=>baabaabX=>baabaab^=baabaab


Example#5:Let the terminals be a and b, the only nonterminal be S,X and Y and the productions be

S XYX aXX bXX aY YaY YbY a

X productions are:X aXX bXX a

In the preceding productions, it can be seen that:o any string of terminals that comes from X must end in an ao any words ending in an a can be derived from X


To derive the word babba from X, the procedure will be:X=>bX=>baX=>babX=>babbX=>babba

Considering variable Y:Y productions are:

Y YaY YbY a

It can be seen that the words that can be derived from Y:o Exactly those that begin with an a

To derive abbab, the procedure will be:

Y=>Yb=>Yab=>Ybab=>Ybbab=>abbab


Since S XY

The words that can be derived from S have a double a in them.

To derive babaabb, the procedure will be:

S=>XY=>bXY=>baXY=>babXY=>babaY=>babaYb=>babaYbb

=>babaabb


Example#6:Let the terminals be a and b, and the three nonterminals be S, BALANCED, and UNBALANCED.

The productions are:S SSS BALANCED SS S BALANCED S ^S UNBALANCED S UNBALANCED

BALANCED aa BALANCED bb

UNBALANCED ab UNBALANCED baIn the preceding productions, it can be seen that:

o The language generated is the set of all words with an even number of a’s and an even number of b’s i.e. the language EVEN-EVEN.


Derivation of word aababbab:

S=>BALANCED S

=>aaS

=>aa UNBALANCED S UNBALANCED

=>aa ba S UNBALANCED

=>aa ba S ab

=>aa ba BALANCED S ab

=>aa ba bb S ab

=>aa ba bb ^ ab

= aababbab


Example#7:Let the terminals be a and b, and only one nonterminal S.

The productions are:

S aSbS ^

The language generated by these productions is the nonregular language anbn.

Derivation of a6Sb6 using the above productions:

S=>aSb=>aaSbb

=>aaaSbbb=>aaaaSbbbb

=>aaaaaSbbbbb=>aaaaaaSbbbbbb

=>aaaaaabbbbbbCFG, Lecture 5, slide

Example#8:Let the terminals be a and b, and only one nonterminal S.

The productions are:

S aSaS bSbS ^

The language generated by these productions is the nonregular language PALINDROME(a word that reads the same backwards as forwards.

Derivation of word abbaabba using the above productions:

S=>aSb=>aaSbb

=>aaaSbbb=>aaaaSbbbb

=>aaaaaSbbbbb=>aaaaaaSbbbbbb

=>aaaaaabbbbbbCFG, Lecture 5, slide

Derivation of word abbaabba using the above productions:

S =>aSa

=>abSba

=>abbSbba

=>abbaSabba

=>abbaabba


Example#9:

ODD PALINDROME language is the language containing odd number of letters in words.

To convert a general palindrome(which can contain both even and odd letters).

Grammar for ODD PALINDROME is:S => aSaS => bSbS => aS => b

The above grammar can be modified to be the entire languae PALINDROME as:

S => aSaS => bSbS => aS => bS => ^


Example#10:

A nonregular language that can be generated by CFG is anban.

S => aSaS => b


Example#11:

Let the terminals be a and b, the nonterminals be S, A, and B, and the productions be

S aBS bAA aA aSA bAAB bB bSB aBB

The language that this CFG generates is the language EQUAL of all strings that have an equal number of a’s and b’s in them.

Some words of this language are abba, aaabbb, and ba.CFG, Lecture 5, slide

Ambugity

Definition:A CFG is called ambiguous if for at least one word in the language that it

generates there are two possible derivations of the word that correspond to different syntax trees.If a CFG is not ambiguous, it is called unambiguous.

Ambiguous Grammars:

Consider the form E + E * E. It has two derivations from E.

1. E=> E + E => E + E * E

2. E=> E * E => E + E * E


E E

E+ E E * E

E * E E + E

fig. I fig. IITwo parse trees with the same yield


Removing Ambiguity from Grammars

There are two causes of ambiguity in the previous ambiguous grammar:

I. The precedence of operators is not respected. While fig. I properly groups the * before the + operator, fig. II is also a valid parse tree and groups the + ahead of the *. We need to force only the structure of fig. I to be legal in an unambiguous grammar.

II. A sequence of identical operators can group either from the left or from the right. For example, if the *’s in fig(I and II) were replaced by +’s, we would see two different parse trees for the string E + E + E. Since addition and multiplication are associative, it doesn’t matter whether we group from the left or the right, but to eliminate ambiguity, we must pick one. The conventional approach is to insist on grouping from the left, so the structure of fig. II is the only correct grouping of two +-signs


lecture five: context free grammar (cfg)

Documents

finite set of variables

set of strings

strings of terminals

finite set of symbols

finite set of rules

finite set of productions

alphabet of letters

terminals unchanged