lecture five: context free grammar (cfg)
DESCRIPTION
Lecture Five: Context Free Grammar (CFG). Amjad Ali. Definition of Context-Free Grammar. There are four important components in a grammatical description of a language: - PowerPoint PPT PresentationTRANSCRIPT
Lecture Five:Context Free Grammar
(CFG)
Amjad Ali
CFG, Lecture 5, slide
There are four important components in a grammatical description of a
language:
1. There is a finite set of symbols that form the strings of the language
being defined. This set was {0,1} in the palindrome example we just
saw. We call this alphabet the terminals, or terminal symbols.
2. There is a finite set of variables, also called sometimes nonterminals
or syntactic categories. Each variable represents a language; i.e., a
set of strings. In our example above, there was only one variable, P,
which we used to represent the class of palindromes over alphabet
{0,1}.
Definition of Context-Free Grammar
CFG, Lecture 5, slide
3. One of the variables represents the language being defined; it is
called the start symbol. Other variables represent auxiliary classes of
strings that are used to help define the language of the start symbol.
In our example, P , the only variable , is the start symbol.
4. There is a finite set of productions or rules that represent the recursive
definition of a language. Each production consists of:
a) A variable that is being (partially) defined by the production. This
variable is often called the head of the production.
b) The production symbol
CFG, Lecture 5, slide
c) A string of zero or more terminals and variables. This string, called
the body of the production, represents one way to form strings in the
language of the variable of the head. In so doing, we leave terminals
unchanged and substitute for each variable if the body any string that
is known to be in language of that variable.
CFG, Lecture 5, slide
Alternate Definition of Context-Free Grammar
A context-free grammar, CFG is a collection of three things:
1. An alphabet Σ of letters called terminals from which we are going to
make strings that will be the words of a language.
2. A set of symbols called nonterminals, one of which is the symbol S,
standing for “start here”.
3. A finite set of productions of the form.
One Nonterminals finite set of terminals and/or Nonterminals
CFG, Lecture 5, slide
Formal Definition of CFG
A context-free grammar is a 4-tuple (V, Σ, R ,S), where
1. V is finite set called the variables.
2. Σ is a finite set, disjoint from V, called the terminals.
3. R is a finite set of rules, with each rule being a variable and a string
of variables and terminals, and
4. SV is the start variable.
CFG, Lecture 5, slide
Palindrome Example
Some of the rules that define the palindromes, expressed in the context-free grammar notation, are:
1. P ^
2. P 0
3. P 1
4. P 0P0
5. P 1P1
CFG, Lecture 5, slide
Notions for CFG Derivations
Some conventions used while discussing CFG’s:
1. Lower-case letters near the beginning of the alphabet, a, b, and so on,
are terminal symbols. Digits and other characters such as + or
parentheses can also be used as terminals.
2. Upper-case letters near the beginning of the alphabet, A, B, and so on,
are variables.
3. Lower-case letters near the end of the alphabet, such as w or z, are
strings of terminals. This convention reminds us that the terminals are
analogous to the input symbols of an automation.
4. Upper-case letters near the end of the alphabet, such as X or Y, are either
terminals or variables.
CFG, Lecture 5, slide
5. Lower-case Greek letters, such as alpha and beta, are strings consisting of terminals and/or
variables.
There is no special notation for strings that consist of variables only, since this concept plays
no important role. However, a string named alpha or another Greek letter might happen to
have only variables.
CFG, Lecture 5, slide
Example: A complex CFG that represents (a simplification of ) expressions in a typical programming language. Operators used are limited to + and *, representing addition and multiplication respectively. Arguments act as identifiers, but instead of full set of typical identifiers (letters followed by zero or more letters and digits). The letters are a and b and the digits 0 and 1. Every identifier begins with a or b, which may be followed by any string in {a, b, 0, 1}* .
CFG, Lecture 5, slide
Two variables used in this grammar:
1. E which represents expressions and it represents the language of expressions we are defining.
2. I represents identifiers.
The productions will be:
1. E I2. E E+E3. E E * E4. E (E)5. I a6. I b7. I Ia8. I Ib9. I I0
10. I I1
CFG, Lecture 5, slide
Suppose a string of the above CFG is a*(a+b00).
Its derivations will be:
E => E * E Production no. 3
=> I * E Production no. 1
=> a * E Production no. 5
=> a * (E) Production no. 4
=> a * (E + E) Production no. 2
=> a * (I + E) Production no. 1
CFG, Lecture 5, slide
=> a * a (a + E) Production no.5
=> a * a (a + I) Production no.1
=> a * a (a + I0) Production no. 9
=> a * a (a + I00) Production no. 9
=> a * (a + b00) Production no. 6
CFG, Lecture 5, slide
Leftmost and Right most Derivations
Leftmost derivation:In order to restrict the number of choices we have in deriving a
string, it is often useful to require that at each step we replace the leftmost variable by one of its production bodies. Such a derivation is called a leftmost derivation.
Rightmost derivation:
In order to restrict the number of choices we have in deriving a string, it is often useful to require that at each step we replace the rightmost variable by one of its production bodies. Such a derivation is called a rightmost derivation.
CFG, Lecture 5, slide
Example:The inference that a*(a+b00) is in the language of
variable E can be reflected in a derivation of that string, starting with the string E.
Leftmost derivation will be:
E => E * E => I * E => a * E => a * (E) => a * (E + E)
=> a * ( I + E ) => a * ( a + E) => a * ( a + I) =>
a * ( a + I0) => a * ( a + I00) => a * ( a + b00)
We can summarize the leftmost derivation as E => a*(a+b00) or E * E => a * (E)
*lm
*lm
lm lm lm
lm
lm lm
lm
lm
lm lm
lm
CFG, Lecture 5, slide
Rightmost derivation will be:
E => E * E => E * (E) => E * (E + E) => E * (E + I) => E * (E + I0)
=> E * ( E + I00 ) => E * (E + b00) => E * (I + b00) =>
E * ( a + b00) => I * ( a + b00) => a * ( a + b00)
So the rightmost derivation can be expressed as E => a*(a+b00).
CFG, Lecture 5, slide
rm rm
rm
rm
rm
rm
rm
rm
rm
rmrm
rm
Inference, Derivations and Parse Trees
I. The recursive inference procedure determines that terminal string w is in the language of variable A.
II. A=>w.
III. A =>w.
IV. A =>w.
V. There is a parse tree with root A and yield w.
*
*lm
*rm
CFG, Lecture 5, slide
Some Examples:
Example#1:Let the terminal be a and the nonterminal be S, and the productions be
S aSS ^
The above language is a*.
To derive a6 in this CFG the following derivations will be used.
S => aS => aaS => aaS => aaaS => aaaaS => aaaaaS => aaaaaaS => aaaaaa^ = aaaaaa
Notice:i. means “can be replaced by” as in S aS.ii. => means “can develop into” as in aaS => aaaS
CFG, Lecture 5, slide
Example#2:Let the terminals be a and b and the only nonterminal be S, and the
productions be
S aSS bSS aS b
The language generated by this CFG is the set of all possible strings of letters a and b except for the null string, which we cannot generate.
To produce the string baab the following derivations will be used.
S => bS => baS => baaS => baab
CFG, Lecture 5, slide
Example#3:Let the terminals be a and b, the only nonterminal be S, and the productions be
S aSS bSS aS bS ^
The word ab can be generated by the derivationS =>aS =>abS =>ab^ =ab
or by the derivationS=>aS =>ab
The language of this CFG is also (a+b)*, but the sequence of productions that is used to generate a specific word is not unique.
The third and fourth productions are redundant.
CFG, Lecture 5, slide
Example#4:Let the terminals be a and b, the only nonterminal be S and X, and the productions be
S XaaXX aXX bXX ^
The words generated from S have the formanything aa anything
or (a+b)*aa(a+b)*which is the language of all words with a double a in them somewhere.
For example, to generate baabaab, we can proceed as follows:S=>XaaX=>bXaaX=>baXaaX=>baaXaaX=>baabXaaX =>baab^aaX=>baabaaX=>baabaabX=>baabaab^=baabaab
CFG, Lecture 5, slide
Example#5:Let the terminals be a and b, the only nonterminal be S,X and Y and the productions be
S XYX aXX bXX aY YaY YbY a
X productions are:X aXX bXX a
In the preceding productions, it can be seen that:o any string of terminals that comes from X must end in an ao any words ending in an a can be derived from X
CFG, Lecture 5, slide
To derive the word babba from X, the procedure will be:X=>bX=>baX=>babX=>babbX=>babba
Considering variable Y:Y productions are:
Y YaY YbY a
It can be seen that the words that can be derived from Y:o Exactly those that begin with an a
To derive abbab, the procedure will be:
Y=>Yb=>Yab=>Ybab=>Ybbab=>abbab
CFG, Lecture 5, slide
Since S XY
The words that can be derived from S have a double a in them.
To derive babaabb, the procedure will be:
S=>XY=>bXY=>baXY=>babXY=>babaY=>babaYb=>babaYbb
=>babaabb
CFG, Lecture 5, slide
Example#6:Let the terminals be a and b, and the three nonterminals be S, BALANCED, and UNBALANCED.
The productions are:S SSS BALANCED SS S BALANCED S ^S UNBALANCED S UNBALANCED
BALANCED aa BALANCED bb
UNBALANCED ab UNBALANCED baIn the preceding productions, it can be seen that:
o The language generated is the set of all words with an even number of a’s and an even number of b’s i.e. the language EVEN-EVEN.
CFG, Lecture 5, slide
Derivation of word aababbab:
S=>BALANCED S
=>aaS
=>aa UNBALANCED S UNBALANCED
=>aa ba S UNBALANCED
=>aa ba S ab
=>aa ba BALANCED S ab
=>aa ba bb S ab
=>aa ba bb ^ ab
= aababbab
CFG, Lecture 5, slide
Example#7:Let the terminals be a and b, and only one nonterminal S.
The productions are:
S aSbS ^
The language generated by these productions is the nonregular language anbn.
Derivation of a6Sb6 using the above productions:
S=>aSb=>aaSbb
=>aaaSbbb=>aaaaSbbbb
=>aaaaaSbbbbb=>aaaaaaSbbbbbb
=>aaaaaabbbbbbCFG, Lecture 5, slide
Example#8:Let the terminals be a and b, and only one nonterminal S.
The productions are:
S aSaS bSbS ^
The language generated by these productions is the nonregular language PALINDROME(a word that reads the same backwards as forwards.
Derivation of word abbaabba using the above productions:
S=>aSb=>aaSbb
=>aaaSbbb=>aaaaSbbbb
=>aaaaaSbbbbb=>aaaaaaSbbbbbb
=>aaaaaabbbbbbCFG, Lecture 5, slide
Derivation of word abbaabba using the above productions:
S =>aSa
=>abSba
=>abbSbba
=>abbaSabba
=>abbaabba
CFG, Lecture 5, slide
Example#9:
ODD PALINDROME language is the language containing odd number of letters in words.
To convert a general palindrome(which can contain both even and odd letters).
Grammar for ODD PALINDROME is:S => aSaS => bSbS => aS => b
The above grammar can be modified to be the entire languae PALINDROME as:
S => aSaS => bSbS => aS => bS => ^
CFG, Lecture 5, slide
Example#10:
A nonregular language that can be generated by CFG is anban.
S => aSaS => b
CFG, Lecture 5, slide
Example#11:
Let the terminals be a and b, the nonterminals be S, A, and B, and the productions be
S aBS bAA aA aSA bAAB bB bSB aBB
The language that this CFG generates is the language EQUAL of all strings that have an equal number of a’s and b’s in them.
Some words of this language are abba, aaabbb, and ba.CFG, Lecture 5, slide
Ambugity
Definition:A CFG is called ambiguous if for at least one word in the language that it
generates there are two possible derivations of the word that correspond to different syntax trees.If a CFG is not ambiguous, it is called unambiguous.
Ambiguous Grammars:
Consider the form E + E * E. It has two derivations from E.
1. E=> E + E => E + E * E
2. E=> E * E => E + E * E
CFG, Lecture 5, slide
E E
E+ E E * E
E * E E + E
fig. I fig. IITwo parse trees with the same yield
CFG, Lecture 5, slide
Removing Ambiguity from Grammars
There are two causes of ambiguity in the previous ambiguous grammar:
I. The precedence of operators is not respected. While fig. I properly groups the * before the + operator, fig. II is also a valid parse tree and groups the + ahead of the *. We need to force only the structure of fig. I to be legal in an unambiguous grammar.
II. A sequence of identical operators can group either from the left or from the right. For example, if the *’s in fig(I and II) were replaced by +’s, we would see two different parse trees for the string E + E + E. Since addition and multiplication are associative, it doesn’t matter whether we group from the left or the right, but to eliminate ambiguity, we must pick one. The conventional approach is to insist on grouping from the left, so the structure of fig. II is the only correct grouping of two +-signs
CFG, Lecture 5, slide