context-free grammars: normal formsmlg.postech.ac.kr/~seungjin/courses/automata/handouts/... ·...

Context-Free Grammars: Normal Forms

Seungjin Choi

Department of Computer Science and EngineeringPohang University of Science and Technology

77 Cheongam-ro, Nam-gu, Pohang 37673, [email protected]

1 / 41

Outline

I Simplification of CFG: Transform an arbitrary CFG into anequivalent form that satisfies certain restrictions on its form

I Substitution ruleI Removing useless productionsI Removing ε-productionsI Removing unit-productions

I Normal formsI Chomsky normal formI Greibach normal form

2 / 41

I Throughout the lecture, we only consider languages that do notcontain the empty string ε, since it plays a rather singular role inmany theorems and proofs.

I In doing so, we do not lose generality.

Let G be a CFG for L− {ε}.Then we add a new variable S0 to V and add productions to P:

S0 → S | ε.

Any non-trivial conclusion that we can make for L− {ε} will almostcertainly transfer to L. Given any CFG G there is a method of obtainingG such that L(G ) = L(G )− {ε}.

3 / 41

Substitution Rule

TheoremLet G = (V ,T ,S ,P) be a CFG. Suppose that P contains a production ofthe form

A→ x1Bx2.

Assume that A and B are different variables and that

B → y1 | y2 | · · · | yn

is the set of all productions in P which have B as the left side. LetG = (V ,T ,S , P) be the grammar in which P is constructed by deletingA→ x1Bx2 from P and adding to it

A→ x1y1x2 | x1y2x2 | · · · | x1ynx2.

Then L(G ) = L(G ).

4 / 41

Proof. What you need to show is that

I if w ∈ L(G ) then w ∈ L(G )

I if w ∈ L(G ) then w ∈ L(G )

5 / 41

Suppose that w ∈ L(G), i.e., S∗⇒G

w .

If this derivation does not involve the production A→ x1Bx2, then obviously

S∗⇒G

w .

If it does, then look at the derivation the first time A→ x1Bx2 is used. The Bso introduced eventually has to be replaced; we lose nothing by assuming thatthis is done immediately. Thus,

S∗⇒G

u1Au2 ⇒G

u1x1Bx2u2 ⇒G

u1x1yjx2u2.

But with grammar G we can get

S∗⇒G

u1Au2 ⇒G

u1x1yjx2u2.

Thus we can reach the same sentential form with G and G . If A→ x1Bx2 isused again later, we can repeat the argument. If follows then, by induction onthe number of times the production is applied, that

S∗⇒G

w .

Therefore, if w ∈ L(G) then w ∈ L(G).

By similar reasoning, we can show that if w ∈ L(G) then w ∈ L(G), whichcompletes the proof. �

6 / 41

Example: Consider G = ({A,B}, {a, b, c},A,P) with productions

A → a | aaA | abBc,B → abbA | b.

Substitutions for the variable B, lead to the grammar G with productions

A → a | aaA | ababbAc | abbc,B → abbA | b. (unnecessary)

The string aaabbc has the derivation

A⇒ aaA⇒ aaabBc ⇒ aaabbc in G

and

A⇒ aaA⇒ aaabbc in G .

7 / 41

Useful?

DefinitionLet G = (V ,T ,S ,P) be a CFG. A variable A ∈ V is said to be useful ifand only if there is at least one w ∈ L(G ) such that

S∗⇒

reachablexAy

∗⇒generating

w

with x , y ∈ (V ∪ T )∗.

In other words, a variable is useful if and only if it occurs in at least onederivation.

Two requirements to be ”useful”:

I Can derive a terminal string?: A symbol X ∈ V ∪ T is generating ifX

∗⇒ w for some w ∈ T ∗.

I Can be reached from the start variable?: A symbol X ∈ V ∪ T isreachable if there is a derivation S

∗⇒ αXβ for some α and β.

8 / 41

Removing Useless Productions

I A variable that is not useful is called useless.

I A production is useless if it involves any useless variable.

I Eliminate the symbols that are not generating first, and theneliminate from the remaining grammars those symbols that are notreachable. Note that the other way around does not work!

Example: Eliminate useless variables and productions from G whereV = {S ,A,B,C} and T = {a, b} with

S → aS |A |C ,A → a,

B → aa,

C → aCb.

9 / 41

First, identify the set of variables that can lead to a terminal string

A → a, (OK)

B → aa, (OK)

S ⇒ A⇒ a, (OK)

C (No!)

Remove C , then we have V1 = {S ,A,B} and T = {a} with

S → aS |A,A → a,

B → aa.

10 / 41

Next, eliminate the variables that cannot be reached from the startvariable. To this end we draw a dependency graph where its vertices arelabeled with variables and an edge between C and D is connected if andonly if there is a production form C → xDy .

S A B

The variable B is useless. Therefore, G = (V , T ,S , P) where

V = {S ,A} and T = {a} with

S → aS |A,A → a.

11 / 41

TheoremLet G = (V ,T ,S ,P) be a CFG. Then there exists an equivalent

grammar G = (V , T ,S , P) that does not contain any useless variables orproductions.

Proof. The algorithm G can be generated from G by an algorithmconsisting of two parts. In the first part we construct an intermediategrammar G1 = (V1,T1,S ,P1) such that V1 contain only variables A for

which A∗⇒ w ∈ T ∗ is possible.

1. Set V1 to φ.

2. Repeat the following step until no more variables are added to V1.For every A ∈ V for which P has a production of the form

A→ x1x2 · · · xn with all xi ∈ V1 ∪ T ,

add A to V1.

3. Take P1 as all the productions in P whose symbols are all in V1 ∪T .

12 / 41

Clearly this procedure terminates. It is equally clear that if A ∈ V1 thenA

∗⇒ w ∈ T ∗ is a possible derivation with G1. The remaining issue iswhether every A for which A

∗⇒ w is added to V1 before the procedureterminates.

Consider a parse tree

A

Ai

Aj

a b

c

Level k

Level k − 1

Level k − 2

13 / 41

At level k, there are only terminals, so every variable Ai at level k − 1 willbe added to V1 on the first pass through step 2. Any variable at levelk − 2 will then be added to V1 on the second pass through step 2. Thethird time through step 2, all variables at level k − 2 will be added, andso on. The algorithm cannot terminate while there are variable in thetree that are not yet in V1. Hence A will be eventually added to V1.

In the second part of the construction, we get the final answer G fromG1. We draw the variable dependency graph for G1 and from it find allvariables that cannot be reached from S . We can also eliminate anyterminal that does not occur in some useful productions. The result isthe grammar G = (V , T ,S , P).

Because of the construction, G does not contain any useless symbols orproductions. Also, for each w ∈ L(G ) we have a derivation

S∗⇒ xAy

∗⇒ w .

14 / 41

Since the construction of G retains A and all associated productions, wehave everything needed to make the derivation

S∗⇒G

xAy∗⇒G

w .

The grammar G is constructed from G by removal of productions,implying P ⊆ P, which gives

L(G ) ⊆ L(G ).

Putting the two results together, we see that G and G are equivalent. �

15 / 41

Removing ε-Productions

DefinitionAny production of a CFG that is of the form

A→ ε

is called ε-production.

DefinitionAny variable A for which the derivation

A∗⇒ ε

is possible, is called nullable.

16 / 41

Example: Consider the CFG

S → aS1b,

S1 → aS1b | ε.

This leads to L = {anbn | n ≥ 1}.The ε-production, S1 → ε, can be removed by the substituting ε for S1,leading to

S → aS1b | ab,S1 → aS1b | ab.

which also generate L = {anbn | n ≥ 1}.

17 / 41

TheoremLet G be any CFG with ε not in L(G ). Then there exists an equivalent

grammar G having no ε-productions.

Proof. Find the set VN of all nullable variables of G .

1. Find all productions A→ ε, and put A into VN .

2. Repeat the following step until no further variables are added to VN .

2.1 For all productions

B → A1A2 · · ·An,

where A1,A2, . . . ,An ∈ VN , put B into VN .

18 / 41

Once the set VN has been found, we are ready to construct P. We lookat all productions in P of the form

A→ x1x2 · · · xm, m ≥ 1,

where xi ∈ V ∪ T .

For each such production of P, we put into P that productions as well asall those generated by replacing nullable variables with ε in all possiblecombinations.

For example, if xi and xj are both nullable, there will be one production

in P with xi replaced with ε, one in which xj is replaced with ε, and onein which both xi and xj are replaced with ε. There is one exception: if all

xi are nullable, the production A→ ε is not put into P. If isstraightforward to see G is equivalent to G . �

19 / 41

Example: Consider the CFG G with

S → ABaC ,

A → BC ,

B → b | ε,C → D | ε,D → d .

Identify nullable variables. First we recognize VN = {B,C} since B → εand C → ε. In addition, since we have A→ BC , the set of nullablevariables is VN = {A,B,C}.

20 / 41

Then we construct G (with no ε-productions)

S → ABaC |BaC |AaC |ABa | aC |Aa |Ba | a,A → BC |B |C ,B → b,

C → D,

D → d .

21 / 41

Removing Unit-Productions

DefinitionAny production of a CFG of the form

A→ B,

where A,B ∈ V is called a unit-production.

We use the substitution rule to remove unit-productions.

22 / 41

TheoremLet G = (V ,T ,S ,P) be any CFG without ε-productions. Then there

exists a CFG G = (V , T ,S , P) that does not have any unit-productionsand that is equivalent to G .

Proof. Obviously, any unit production of the form A→ A can beremoved from the grammar without effect. We need only considerA→ B where A and B are different variables.

We first find, for each A all variables such that

A∗⇒ B.

We can do this by drawing a dependency graph with an edge (C ,D)

whenever the grammar has a unit production C → D. A∗⇒ B holds

whenever there is a walk between A and B.

23 / 41

The new grammar G is generated by first putting into P all non-unitproductions of P. Next, for all A and B satisfying A

∗⇒ B, we add to P

A→ y1 | y2 | · · · | yn,

where B → y1 | y2 | · · · | yn is the set of all rules in P with B on the left.

Note that since B → y1 | y2 | · · · | yn is taken from P, none of the yi canbe a single variable, so that non-unit productions are created by the laststep.

To show that the resulting grammar is equivalent to the original one wecan follow the same line of reasoning as in Theorem 1. �

24 / 41

Example: Remove all unit-productions from

S → Aa |B,B → A | bb,A → a | bc |B.

We first draw a dependency graph for the unit-productions.

S BA

25 / 41

It follows from the dependency graph that we have

S∗⇒ A,

S∗⇒ B,

B∗⇒ A,

A∗⇒ B.

The new rules are

S → a | bc | bb,A → bb,

B → a | bc.

26 / 41

The original non-unit productions are

S → Aa,

B → bb,

A → a | bc.

Therefore, the equivalent grammar (without unit-productions) areconstructed by adding the new rules to the original non-unit productions:

S → a | bc | bb |Aa,A → bb | a | bc,B → a | bc | bb.

27 / 41

TheoremLet L be CFG that does not contain ε. Then there exists a CFG thatgenerates L and that does not have any useless productions,ε-productions, or unit-productions.

28 / 41

Summary

To clean up a grammar we can

1. Eliminate ε-productions

2. Eliminate unit-productions

3. Eliminate useless productions

in this order.

29 / 41

Chomsky Normal Form

DefinitionA CFG is in Chomsky normal form (CNF) if all productions are of theform

A→ BC , or A→ a,

where A,B,C ∈ V and a ∈ T .

I The number of symbols on the right of a production are strictlylimited.

I The string on the right of a production consists of no more than twosymbols.

30 / 41

Example: The grammar with the following productions

S → AS | a,A → SA | b,

is in CNF.The grammar with the following productions

S → AS |AAS ,A → SA | aa,

is not in CNF.

31 / 41

TheoremAny CFG G with ε ∈ L(G ) has an equivalent grammar G = (V , T ,S , P)in CNF.

32 / 41

Example: Consider a CFG G with productions

S → ABa,

A → aab,

B → Ac .

Convert G to CNF.

33 / 41

Step 1: Introduce new variables, Ba,Bb,Bc , for terminals a, b, c ,respectively. Then we have

S → ABBa,

A → BaBaBb,

B → ABc ,

Ba → a,

Bb → b,

Bc → c .

34 / 41

Step 2: Introduce additional variables to take care of the first twoproductions:

S → AD1,

D1 → BBa,

A → BaD2,

D2 → BaBb,

B → ABc ,

Ba → a,

Bb → b,

Bc → c .

35 / 41

Greibach Normal Form

DefinitionA CFG is said to be in Greibach normal form (GNF) if all productionshave the form

A→ ax ,

where a ∈ T and x ∈ V ∗.

The form A→ ax is common to both Greibach normal form ands-grammars but GNF does not carry the restrictions that the pair (A, a)occur at most once.

TheoremFor every CFG G with ε ∈ L(G ), there exists an equivalent G in GNF.

36 / 41

From CFG to GNF

I Not trivial

I A simple idea is to eliminate the following:

1. Front recursions: A→ AB2. Front non-terminals3. Non-front terminals

37 / 41

Remove Front Recursion

Consider a grammar involving the front recursion

A→ AB |CD,

We convert this into the following equivalent grammar:

A → CDN |CD,N → BN |B.

38 / 41

Remove Non-Front Terminal

Consider a grammar involving the non-front terminal

A→ cDAbC ,


A → cDANC ,

N → b.

40 / 41

context-free grammars: normal formsmlg.postech.ac.kr/~seungjin/courses/automata/handouts/... ·...

Documents