context-free grammars: normal formsmlg.postech.ac.kr/~seungjin/courses/automata/handouts/... ·...
TRANSCRIPT
Context-Free Grammars: Normal Forms
Seungjin Choi
Department of Computer Science and EngineeringPohang University of Science and Technology
77 Cheongam-ro, Nam-gu, Pohang 37673, [email protected]
1 / 41
Outline
I Simplification of CFG: Transform an arbitrary CFG into anequivalent form that satisfies certain restrictions on its form
I Substitution ruleI Removing useless productionsI Removing ε-productionsI Removing unit-productions
I Normal formsI Chomsky normal formI Greibach normal form
2 / 41
I Throughout the lecture, we only consider languages that do notcontain the empty string ε, since it plays a rather singular role inmany theorems and proofs.
I In doing so, we do not lose generality.
Let G be a CFG for L− {ε}.Then we add a new variable S0 to V and add productions to P:
S0 → S | ε.
Any non-trivial conclusion that we can make for L− {ε} will almostcertainly transfer to L. Given any CFG G there is a method of obtainingG such that L(G ) = L(G )− {ε}.
3 / 41
Substitution Rule
TheoremLet G = (V ,T ,S ,P) be a CFG. Suppose that P contains a production ofthe form
A→ x1Bx2.
Assume that A and B are different variables and that
B → y1 | y2 | · · · | yn
is the set of all productions in P which have B as the left side. LetG = (V ,T ,S , P) be the grammar in which P is constructed by deletingA→ x1Bx2 from P and adding to it
A→ x1y1x2 | x1y2x2 | · · · | x1ynx2.
Then L(G ) = L(G ).
4 / 41
Proof. What you need to show is that
I if w ∈ L(G ) then w ∈ L(G )
I if w ∈ L(G ) then w ∈ L(G )
5 / 41
Suppose that w ∈ L(G), i.e., S∗⇒G
w .
If this derivation does not involve the production A→ x1Bx2, then obviously
S∗⇒G
w .
If it does, then look at the derivation the first time A→ x1Bx2 is used. The Bso introduced eventually has to be replaced; we lose nothing by assuming thatthis is done immediately. Thus,
S∗⇒G
u1Au2 ⇒G
u1x1Bx2u2 ⇒G
u1x1yjx2u2.
But with grammar G we can get
S∗⇒G
u1Au2 ⇒G
u1x1yjx2u2.
Thus we can reach the same sentential form with G and G . If A→ x1Bx2 isused again later, we can repeat the argument. If follows then, by induction onthe number of times the production is applied, that
S∗⇒G
w .
Therefore, if w ∈ L(G) then w ∈ L(G).
By similar reasoning, we can show that if w ∈ L(G) then w ∈ L(G), whichcompletes the proof. �
6 / 41
Example: Consider G = ({A,B}, {a, b, c},A,P) with productions
A → a | aaA | abBc,B → abbA | b.
Substitutions for the variable B, lead to the grammar G with productions
A → a | aaA | ababbAc | abbc,B → abbA | b. (unnecessary)
The string aaabbc has the derivation
A⇒ aaA⇒ aaabBc ⇒ aaabbc in G
and
A⇒ aaA⇒ aaabbc in G .
7 / 41
Useful?
DefinitionLet G = (V ,T ,S ,P) be a CFG. A variable A ∈ V is said to be useful ifand only if there is at least one w ∈ L(G ) such that
S∗⇒
reachablexAy
∗⇒generating
w
with x , y ∈ (V ∪ T )∗.
In other words, a variable is useful if and only if it occurs in at least onederivation.
Two requirements to be ”useful”:
I Can derive a terminal string?: A symbol X ∈ V ∪ T is generating ifX
∗⇒ w for some w ∈ T ∗.
I Can be reached from the start variable?: A symbol X ∈ V ∪ T isreachable if there is a derivation S
∗⇒ αXβ for some α and β.
8 / 41
Removing Useless Productions
I A variable that is not useful is called useless.
I A production is useless if it involves any useless variable.
I Eliminate the symbols that are not generating first, and theneliminate from the remaining grammars those symbols that are notreachable. Note that the other way around does not work!
Example: Eliminate useless variables and productions from G whereV = {S ,A,B,C} and T = {a, b} with
S → aS |A |C ,A → a,
B → aa,
C → aCb.
9 / 41
First, identify the set of variables that can lead to a terminal string
A → a, (OK)
B → aa, (OK)
S ⇒ A⇒ a, (OK)
C (No!)
Remove C , then we have V1 = {S ,A,B} and T = {a} with
S → aS |A,A → a,
B → aa.
10 / 41
Next, eliminate the variables that cannot be reached from the startvariable. To this end we draw a dependency graph where its vertices arelabeled with variables and an edge between C and D is connected if andonly if there is a production form C → xDy .
S A B
The variable B is useless. Therefore, G = (V , T ,S , P) where
V = {S ,A} and T = {a} with
S → aS |A,A → a.
11 / 41
TheoremLet G = (V ,T ,S ,P) be a CFG. Then there exists an equivalent
grammar G = (V , T ,S , P) that does not contain any useless variables orproductions.
Proof. The algorithm G can be generated from G by an algorithmconsisting of two parts. In the first part we construct an intermediategrammar G1 = (V1,T1,S ,P1) such that V1 contain only variables A for
which A∗⇒ w ∈ T ∗ is possible.
1. Set V1 to φ.
2. Repeat the following step until no more variables are added to V1.For every A ∈ V for which P has a production of the form
A→ x1x2 · · · xn with all xi ∈ V1 ∪ T ,
add A to V1.
3. Take P1 as all the productions in P whose symbols are all in V1 ∪T .
12 / 41
Clearly this procedure terminates. It is equally clear that if A ∈ V1 thenA
∗⇒ w ∈ T ∗ is a possible derivation with G1. The remaining issue iswhether every A for which A
∗⇒ w is added to V1 before the procedureterminates.
Consider a parse tree
A
Ai
Aj
a b
c
Level k
Level k − 1
Level k − 2
13 / 41
At level k, there are only terminals, so every variable Ai at level k − 1 willbe added to V1 on the first pass through step 2. Any variable at levelk − 2 will then be added to V1 on the second pass through step 2. Thethird time through step 2, all variables at level k − 2 will be added, andso on. The algorithm cannot terminate while there are variable in thetree that are not yet in V1. Hence A will be eventually added to V1.
In the second part of the construction, we get the final answer G fromG1. We draw the variable dependency graph for G1 and from it find allvariables that cannot be reached from S . We can also eliminate anyterminal that does not occur in some useful productions. The result isthe grammar G = (V , T ,S , P).
Because of the construction, G does not contain any useless symbols orproductions. Also, for each w ∈ L(G ) we have a derivation
S∗⇒ xAy
∗⇒ w .
14 / 41
Since the construction of G retains A and all associated productions, wehave everything needed to make the derivation
S∗⇒G
xAy∗⇒G
w .
The grammar G is constructed from G by removal of productions,implying P ⊆ P, which gives
L(G ) ⊆ L(G ).
Putting the two results together, we see that G and G are equivalent. �
15 / 41
Removing ε-Productions
DefinitionAny production of a CFG that is of the form
A→ ε
is called ε-production.
DefinitionAny variable A for which the derivation
A∗⇒ ε
is possible, is called nullable.
16 / 41
Example: Consider the CFG
S → aS1b,
S1 → aS1b | ε.
This leads to L = {anbn | n ≥ 1}.The ε-production, S1 → ε, can be removed by the substituting ε for S1,leading to
S → aS1b | ab,S1 → aS1b | ab.
which also generate L = {anbn | n ≥ 1}.
17 / 41
TheoremLet G be any CFG with ε not in L(G ). Then there exists an equivalent
grammar G having no ε-productions.
Proof. Find the set VN of all nullable variables of G .
1. Find all productions A→ ε, and put A into VN .
2. Repeat the following step until no further variables are added to VN .
2.1 For all productions
B → A1A2 · · ·An,
where A1,A2, . . . ,An ∈ VN , put B into VN .
18 / 41
Once the set VN has been found, we are ready to construct P. We lookat all productions in P of the form
A→ x1x2 · · · xm, m ≥ 1,
where xi ∈ V ∪ T .
For each such production of P, we put into P that productions as well asall those generated by replacing nullable variables with ε in all possiblecombinations.
For example, if xi and xj are both nullable, there will be one production
in P with xi replaced with ε, one in which xj is replaced with ε, and onein which both xi and xj are replaced with ε. There is one exception: if all
xi are nullable, the production A→ ε is not put into P. If isstraightforward to see G is equivalent to G . �
19 / 41
Example: Consider the CFG G with
S → ABaC ,
A → BC ,
B → b | ε,C → D | ε,D → d .
Identify nullable variables. First we recognize VN = {B,C} since B → εand C → ε. In addition, since we have A→ BC , the set of nullablevariables is VN = {A,B,C}.
20 / 41
Then we construct G (with no ε-productions)
S → ABaC |BaC |AaC |ABa | aC |Aa |Ba | a,A → BC |B |C ,B → b,
C → D,
D → d .
21 / 41
Removing Unit-Productions
DefinitionAny production of a CFG of the form
A→ B,
where A,B ∈ V is called a unit-production.
We use the substitution rule to remove unit-productions.
22 / 41
TheoremLet G = (V ,T ,S ,P) be any CFG without ε-productions. Then there
exists a CFG G = (V , T ,S , P) that does not have any unit-productionsand that is equivalent to G .
Proof. Obviously, any unit production of the form A→ A can beremoved from the grammar without effect. We need only considerA→ B where A and B are different variables.
We first find, for each A all variables such that
A∗⇒ B.
We can do this by drawing a dependency graph with an edge (C ,D)
whenever the grammar has a unit production C → D. A∗⇒ B holds
whenever there is a walk between A and B.
23 / 41
The new grammar G is generated by first putting into P all non-unitproductions of P. Next, for all A and B satisfying A
∗⇒ B, we add to P
A→ y1 | y2 | · · · | yn,
where B → y1 | y2 | · · · | yn is the set of all rules in P with B on the left.
Note that since B → y1 | y2 | · · · | yn is taken from P, none of the yi canbe a single variable, so that non-unit productions are created by the laststep.
To show that the resulting grammar is equivalent to the original one wecan follow the same line of reasoning as in Theorem 1. �
24 / 41
Example: Remove all unit-productions from
S → Aa |B,B → A | bb,A → a | bc |B.
We first draw a dependency graph for the unit-productions.
S BA
25 / 41
It follows from the dependency graph that we have
S∗⇒ A,
S∗⇒ B,
B∗⇒ A,
A∗⇒ B.
The new rules are
S → a | bc | bb,A → bb,
B → a | bc.
26 / 41
The original non-unit productions are
S → Aa,
B → bb,
A → a | bc.
Therefore, the equivalent grammar (without unit-productions) areconstructed by adding the new rules to the original non-unit productions:
S → a | bc | bb |Aa,A → bb | a | bc,B → a | bc | bb.
27 / 41
TheoremLet L be CFG that does not contain ε. Then there exists a CFG thatgenerates L and that does not have any useless productions,ε-productions, or unit-productions.
28 / 41
Summary
To clean up a grammar we can
1. Eliminate ε-productions
2. Eliminate unit-productions
3. Eliminate useless productions
in this order.
29 / 41
Chomsky Normal Form
DefinitionA CFG is in Chomsky normal form (CNF) if all productions are of theform
A→ BC , or A→ a,
where A,B,C ∈ V and a ∈ T .
I The number of symbols on the right of a production are strictlylimited.
I The string on the right of a production consists of no more than twosymbols.
30 / 41
Example: The grammar with the following productions
S → AS | a,A → SA | b,
is in CNF.The grammar with the following productions
S → AS |AAS ,A → SA | aa,
is not in CNF.
31 / 41
TheoremAny CFG G with ε ∈ L(G ) has an equivalent grammar G = (V , T ,S , P)in CNF.
32 / 41
Example: Consider a CFG G with productions
S → ABa,
A → aab,
B → Ac .
Convert G to CNF.
33 / 41
Step 1: Introduce new variables, Ba,Bb,Bc , for terminals a, b, c ,respectively. Then we have
S → ABBa,
A → BaBaBb,
B → ABc ,
Ba → a,
Bb → b,
Bc → c .
34 / 41
Step 2: Introduce additional variables to take care of the first twoproductions:
S → AD1,
D1 → BBa,
A → BaD2,
D2 → BaBb,
B → ABc ,
Ba → a,
Bb → b,
Bc → c .
35 / 41
Greibach Normal Form
DefinitionA CFG is said to be in Greibach normal form (GNF) if all productionshave the form
A→ ax ,
where a ∈ T and x ∈ V ∗.
The form A→ ax is common to both Greibach normal form ands-grammars but GNF does not carry the restrictions that the pair (A, a)occur at most once.
TheoremFor every CFG G with ε ∈ L(G ), there exists an equivalent G in GNF.
36 / 41
From CFG to GNF
I Not trivial
I A simple idea is to eliminate the following:
1. Front recursions: A→ AB2. Front non-terminals3. Non-front terminals
37 / 41
Remove Front Recursion
Consider a grammar involving the front recursion
A→ AB |CD,
We convert this into the following equivalent grammar:
A → CDN |CD,N → BN |B.
38 / 41
Remove Front Non-Terminal
Consider a grammar involving the front non-terminal
A → BbC | aA,B → cDA | aE ,
We convert this into the following equivalent grammar:
A → cDAbC | aEbC | aA,B → cDA | aE .
39 / 41
Remove Non-Front Terminal
Consider a grammar involving the non-front terminal
A→ cDAbC ,
We convert this into the following equivalent grammar:
A → cDANC ,
N → b.
40 / 41
Example: Consider a CFG with
S → SA |BeA,A → AS | a,B → b.
Find an equivalent GNF?
Solution. We directly apply the technique explained in previous 3 slides.The GNF is given by
S → bEAC | bEA,C → aDC | aD | aC | a,A → aD | a,D → bEACD | bEAC | bEAD | bEA,E → e.
41 / 41