chapter 12 context-free grammars - asethome.org  · web viewcontext-free grammars and languages...

29
Context-Free Grammars and Languages (Based on Hopcroft, Motwani and Ullman (2007) & Cohen (1997)) Introduction Consider an example sentence: A small cat eats the fish English grammar has rules for constructing sentences; e.g., 1. A sentence can be a subject followed by a predicate (Subject: A small cat) (Predicate: eats the fish) 2. A subject can be a noun-phrase (Noun-Phrase: A small cat) 3. A noun-phrase can be an article, followed by an adjective followed by a noun (Article: A) (Adjective: small) (Noun: cat) 4. A predicate can be a verb-phrase (Verb-Phrase: eats the fish) 5. A verb-phrase can be a verb followed by a noun-phrase (Verb: eats) (Noun-Phrase: the fish) 7. A noun-phrase can be an article, followed by a noun (Article: the) (Noun: fish) 8. A verb can be: buries, touches, grabs, eats 9. An adjective can be: big, small 10. An article can be: the, a, an Definition: The things that cannot be replaced by anything are called terminals. Definition: The things that must be replaced by other things are called nonterminals or variables In the above example, “small” and “eats” are terminals, and noun-phrase and verb-phrase are nonterminals. Definition: The sequence of applications of the rules that produces the finished string of terminals from the starting symbol is called a derivation. For a derivation of “a small cat eats the fish” using a Context-Free Grammar (CFG), visit this 1

Upload: ngohanh

Post on 06-Sep-2018

280 views

Category:

Documents


0 download

TRANSCRIPT

Context-Free Grammars and Languages(Based on Hopcroft, Motwani and Ullman (2007) & Cohen (1997))IntroductionConsider an example sentence: A small cat eats the fish

English grammar has rules for constructing sentences; e.g.,1. A sentence can be a subject followed by a predicate

(Subject: A small cat) (Predicate: eats the fish)2. A subject can be a noun-phrase (Noun-Phrase: A small cat)3. A noun-phrase can be an article, followed by an adjective followed by a noun (Article: A) (Adjective: small) (Noun: cat)4. A predicate can be a verb-phrase (Verb-Phrase: eats the fish) 5. A verb-phrase can be a verb followed by a noun-phrase (Verb: eats) (Noun-Phrase: the fish)7. A noun-phrase can be an article, followed by a noun (Article: the) (Noun: fish)8. A verb can be: buries, touches, grabs, eats9. An adjective can be: big, small10. An article can be: the, a, an

Definition: The things that cannot be replaced by anything are called terminals.Definition: The things that must be replaced by other things are called nonterminals or variables

In the above example, “small” and “eats” are terminals, and noun-phrase and verb-phrase are nonterminals.

Definition: The sequence of applications of the rules that produces the finished string of terminals from the starting symbol is called a derivation. For a derivation of “a small cat eats the fish” using a Context-Free Grammar (CFG), visit this page: http://www.asethome.org/mathfoundations/English_Parse_Tree.pdf

Context-Free Grammar (CFG)Example 1: terminals: ∑ = { a } Nonterminals: N = { S } productions:S aSS ^

The above CFG defines a* = { ^, a, aa, aaa, aaaa, aaaaa, . . . }It can generate or derive aaaa as follows:S => aS => aaS => aaaS => aaaaS => aaaa

1

The same derivation can also be written as: S => aS => aaS => aaaS => aaaaS => aaaa

Example 2: Terminal: ∑ = { a } Nonterminal: N = { S }Productions: S SS S a S ^This CFG can be written in more compact notation: S SS | a | ^

which is also called the Backus Normal Form or Backus-Naur Form (BNF).The above CFG defines the same CFL: a* = { ^, a, aa, aaa, aaaa, . . . }This CFG can generate the string aa as follows:S => SS => SSS => SSa => SSSa => SaSa => aSa => aa

In this grammar, a string in the CFL has infinitely many derivations. In other examples, a string is usually generated in limited number of ways.

Definition: A context-free grammar (CFG) is composed of:1. ∑ : A finite non-empty set of symbols called terminals from which strings of the

language are made. 2. N: A finite set of symbols called nonterminals. ∑ and N are disjoint. 3. S: A special start symbol in N.4. P: A finite set of productions of the form: Ni a string from (∑ + N)* where Ni is in N.

Thus, each production is of the form: one nonterminal -> finite string of terminals and/or nonterminals, where the “string of terminals and/or nonterminals” can consist of only terminals or only nonterminals, or any mixture of terminals and nonterminals or even the empty string. We require that at least one production has the nonterminal S as its left side. On the right side of the arrow any string from (∑ + N)* is allowed. Some books represent terminals by T; in that case, the productions of the form Ni (T + N)* are allowed in CFG.

Convention: Terminals will typically be lowercase letters. Nonterminals will typically be uppercase letters.

Example 3: A CFG for ab* = { a, ab, abb, abbb, abbbb, . . . . }1. Terminals: ∑ = {a, b}, 2. Nonterminal: N = {S, B} 3. Productions: P = {

S aB B bB | ^ }

DERIVATION of abbb using the CFG of example 3:

S => aB => abB => abbB => abbbB => abbb

2

Most of the time the set of productions are explicitly given for a CFG and the terminals and the non-terminals are understood from the context as shown in examples 4-6 :

Example 4: A CFG for aba* = { ab, aba, abaa, abaaa, . . . . }

S abA A aA | ^

DERIVATION of abaaa using the CFG of example 4: S => abA => abaA => abaaA => abaaaA => abaaa

Example 5: A CFG for ab*a = { aa, aba, abba, abbba, . . . . }

S aBa B aB | ^

DERIVATION of abbbba using the CFG of example 5: S => aBa => abBa => abbBa => abbbBa => abbbbBa => abbbba

Example 6: A CFG for { ancbn :n > 0} = { acb, aacbb, aaacbbb, . . . . }

S aSb | acb DERIVATION of aaacbbb using the CFG of example 6: S => aSb => aaSbb => aaacbbb

Definition: The language generated (defined, derived, produced) by a CFG, G, is the set of all strings of terminals that can be produced from the start symbol S using the productions as substitutions. A language generated by a CFG, G, is called a context-free language (CFL) and is denoted by L(G).

Example: Terminals: ∑ = {a}, Nonterminal: N = {S} Productions: P = S aS S ^Let L1 the language generated by this CFG, and let L2 be the language generated by regular expression a*.

Claim: L1 = L2.Proof:

We first show that L2 L1.Consider a^n L2 for n >= 1. We can generate an by using first production n times, and then second production.Can generate ^ L2 by using second production only.Hence L2 L1.

3

We now show that L1 L2.Since a is the only terminal, CFG can only produce strings having only a’s.Thus, L1 L2.Note that Two types of arrows:

-> used in statement of productions=> used in derivation of string

in the above derivation of a4, there were many unfinished stages that consisted of both terminals and nonterminals. These are called working strings.^ is neither a nonterminal (since it cannot be replaced with something else) nor a terminal (since it disappears from the string).

Example: terminals: ∑ = { a, b }Nonterminals: { S }Productions: S -> aS S -> bS S -> a S -> b

More compact notation:S -> aS | bS | a | bThis CFG can derive the string abbab as follows:S => aS => abS => abbS => abbaS => abbab

Let L1 be the CFL, and let L2 be the language generated by theregular expression (a + b)+.

Claim: L1 = L2.

Proof:First we show that L2 L1.Consider any string w L2.Read letters of w from left to right.

For each letter read in, if it is not the last, thenuse the production S -> aS if the letter is a oruse the production S -> bS if the letter is b

For the last letter of the string,use the production S -> a if the letter is a oruse the production S -> b if the letter is b

In each stage of the derivation, the working string has the form(string of terminals)S. Hence, we have shown how to generate w using the CFG, whichmeans that w L1. Hence, L2 L1.Now we show that L1 L2.To show this, we need to show that if w L1, then w L2.This is equivalent to showing that if w L2, then w L1.

Note that the only string w L2 is w = A.But note that A cannot be generated by the CFG, so A L1.

4

Hence, we have proven that L1 L2.

Trees can be used to illustrate how a string is derived from a CFG.

Definition: A tree is an acyclic graph. Trees based on a grammar (or syntax definition) are called syntax trees, parse trees, generation trees, production trees, or derivation trees.

Example: The following CFG defines L6 = { {n c*}n } Terminals: { {, }, c } Nonterminals: { S, X } Productions: S {S} | {X} X cX | ^Every string generated by this CFG has balanced open and closed braces like programming languages such as Java. Consider the following derivation:

S => {S} => {{S}} => {{{X}}} => {{{cX}}} => {{{ccX}}} => {{{cc}}}

The above derivation corresponds to the following derivation tree or parse tree:

Example: CFG for simplified arithmetic expressions.

5

Terminals : +, _, 0, 1, 2, . . . , 9Nonterminals : SProductions :

S S + S | S - S | 0 | 1 | 2 | · · · | 9

• Consider the expression 2 - 3 + 4.• Ambiguous how to evaluate this:• Does this mean (2 - 3) + 4 = 10 or 2 - (3 + 4) = 14 ?• Can eliminate ambiguity by examining the two possible derivation trees

S S/ | \ / | \/ | \ / | \/ | \ / | \/ | \ / | \S + S S * S/ | \ | | / | \/ | \ | | / | \2 * 3 4 2 3 + 4Eliminate the S’s as follows:+ */ \ / \/ \ / \/ \ / \/ \ / \* 4 2 +/ \ / \/ \ / \2 3 3 4

Note that we can construct a new notation for mathematical expressions: start at top of tree walk around tree keeping left hand touching tree first time hit each terminal, print it out.

This gives us a string which is in operator prefix notation or Polish notation.In above examples,

first tree yields+ _ 2 3 4

second tree yields_ 2 + 3 4

To evaluate the string:1. scan string from left to right.2. the first time we read a substring of the form “operator-operand-operand” (o-o-o),

replace the three symbols with the one result of the indicated arithmetic calculation.

3. go back to step 1

Example: (from above)

6

first tree yields: string first o-o-o substring

+ _ 2 3 4 _ 2 3 + 6 4 + 6 4 10

second tree yields: string first o-o-o substring

_ 2 + 3 4 + 3 4 _ 2 7 _ 2 7 14

Definition: A CFG is ambiguous if for at least one string in its CFL there are two possible derivations of the string that correspond to two different syntax trees.

Example: PALINDROME

Terminals : a, bNonterminals : SProductions :

S aSa | bSb | a | b | ^

Can generate the string babbab as follows:S => bSb => baSab => babSbab => babbab

which has derivation tree:

S/|\

b S b/|\

a S a/|\

b S b|^

Can show that this CFG is unambiguous.

Definition: For a given CFG, the total language tree is the tree with root S, whose children are all the productions of S, whose second descendents are all the working

7

strings that can be constructed by applying one production to the leftmost nonterminal in each of the children, and so on.

Example:

Terminals : a, bNonterminals : S, XProductions :

S -> aX | Xa | aXbXaX -> ba | ab

This CFG has total language tree as follows:S/ | \/ | \/ | \/ | \/ | \aX Xa aXbXa/ | / | / \/ | / | / \aba aab baa aba ababXa aabbXa/ \ / \ababbaa abababa aabbbaa aabbabaThe CFL is finite.

Other References:

http://web.njit.edu/~marvin/cis341/chap12.pdfhttp://www.cs.appstate.edu/~dap/classes/2490/chap12.html

8

Chapter 13 Grammatical Format

Regular Grammars

We previously saw thatCFG’s can generate all regular languages.CFG’s can generate some non-regular languages (that are Context-Free).

All regular languages can be generated by CFG’s, because all regular languages are CF. some non-regular languages cannot be generated by CFG’s, because they are non-CF.

One can turn an FA into a CFG as follows:

Example: L = all strings ending in a.

FA:

Definition: The path development of a string processed on a machine:Start in starting state S.For each state visited, print out the input letters used thus far and the current state.

9

b a

b

a

b

aA+

S-

B

The string ababba has following path development on the FA:

SaAabBabaAababBababbBababbaAababba

Now we define the following productions:S -> aA | bBA -> aA | bB | _B –> aA | bB

Note that:The CFG has a production

X -> cY

if and only if in the FA, there is an arc from state X to state Y labeled with c.

The CFG has a productionX -> A

if and only if state X in the FA is a final state.

Derivation of ababba using the CFG:S => aA => abB => abaA => ababB => ababbB => ababbaA => ababba

There is a one-to-one correspondence between path developments on the FA and derivations in the CFG; i.e., we can use the pigeonhole principle. The derivation of the string ababba using the CFG is exactly the same as the path development given above.

Theorem 21 All regular languages are CFL’s.

10

Example:FA:

productions:

S => aS | bAA => aC | bB | AB => aB | bCC => aA | bB | A

Consider a CFG G = (E, , R, S), whereE is the set of terminals is the set of nonterminals, and S is the starting nonterminal R ×(E+ )* is the set of productions, where a production (N, U) R with N and U (E + )* is written as N -> U

Definition: For a given CFG G = (E,,R, S), W is a semiword if W E * ;i.e., W is a string of terminals (maybe none) cancatenated with exactly one nonterminal (on the right).

Example: aabaN is a semiword if N is a nonterminal and a and b are terminals.

Definition: A Regular grammar, RG, is composed of 1. ∑: A finite non-empty set of symbols called terminals2. N: A finite set of symbols, called Non-terminals. ∑ and N are disjoint3. S: A special start symbol in N.4. P: A set of productions in one of the following two forms (not both):

11

ab

b

a

a

b

a

A+

S-

B

C+b

4a. Right RG: Ni ∑*Nj | ∑*

4b. Left RG: Ni Nj ∑* | ∑*

A regular grammar can be either Right Regular or Left Regular not both. Each of N i

and Nj is a single non-terminal and they could be the same. That is, N i = Nj is allowed.

Theorem 22 If a CFG is a regular grammar, then the language generated by this CFG is regular.

Proof.We will prove theorem by showing that there is a TG that accepts the language enerated by the CFG.

Suppose CFG is as follows:

N1 -> w1M1N2 -> w2M2... Nn -> wnMnNn+1 -> wn+1Nn+2 -> wn+2...Nn+m ! wn+m

where Ni and Mi are nonterminals (not necessarily distinct) and wi E* are strings of terminals. Thus, wiMi is a semiword. At least one of the Ni = S. Assume that N1 = S.Create a state of the TG for each nonterminal Ni and for each nonterminal Mj .

Also create a state +. Make the state for nonterminal S the initial state of the transition graph. Draw an arc labeled with wi from state Ni to state Mi if and only if there is a

production Ni -> wiMi. Draw an arc labeled with wi from state Ni to state + if and only if there is a

production Ni -> wi. Thus, we have created a TG. By considering the path developments of words accepted by the TG, we can

show that there is a one-to-one correspondence between words accepted by TG and words in CFL.

Thus, these are the same language. Kleene’s Theorem implies that the language has a regular expression. Thus, language is regular.

Remarks: all regular languages can be generated by some regular grammars (Theorem 21) all regular grammars generate some regular language. a regular language may have many CFG’s that generate it, where some of the

CFG’s may not be regular grammars.

12

Example: CFG

productions:S -> aB | bA | abA | baBA -> abaA | bbB -> baA | ab

Corresponding TG (note that CFG does not generate A):

Definition: A production (N, U) R is a A-production if U = A, i.e., theproduction is N -> A. If CFG does not contain a A -production, then A CFL.However, CFG may have A -production and A CFL.

Example: productions:S -> aXX -> A

Chomsky Normal Form

A Productions and Nullable Nonterminals

Recall we previously defined A -production:N-> A where N is some nonterminal.

13

a, b

^

bb

ab

b

ba

baa

abaA

+

B

S-

Note thatIf some CFL contains the word A, then the CFG must have a A -production.However, if a CFG has a A -production, then the CFL does not necessarily contain A; e.g.,

S -> aXX -> A

which defines the CFL {a}.

Definition: For a given CFG with as its set of nonterminals E and A as its set of terminals, a working string W 2 (E + )* is any string of nonterminals and/or terminals that can be generated from the CFG starting from any nonterminal.

Definition: For a given CFG having a nonterminal X and W a possible working string, we use the notation

X => W

if there is some derivation in the CFG starting from X that can result in the working string W.

Example: CFG:S a | Xb | aY aX Y | AY X | a

Since we have the following derivationS => aY a => aXa => aa

we can write S *=> aY a and S *=> aXa and S *=> aa.

Definition: In a given CFG, a nonterminal X is nullable if

1. There is a production X -> A, or2. X *=> A; i.e., there is a derivation that starts at X and leads to A3. X => …. => A

Chomsky Normal Form

Definition: A CFG G = (E, ,R, S) is in Chomsky Normal Form (CNF) if (N, U) R implies U () + E; i.e., each of its productions has one of the two forms:

1. Nonterminal -> string of exactly two Nonterminals2. Nonterminal -> one terminal

Theorem 26 For any CFL L, the non- A words of L can be generated by aCFG in CNF.

14

Leftmost Nonterminals and Derivations

Definition: The leftmost nonterminal (LMN) in a working string is the first nonterminal that we encounter when we scan the string from left to right. Example: In the string bbabXbaY SbXbY , the LMN is X.

Definition: If a word w is generated by a CFG by a certain derivation and at each step in the derivation, a rule of production is applied to the leftmost nonterminal in the working string, then this derivation is called a leftmost derivation (LMD).

Example: CFG:

S baXaS | abX Xab | aa

The following is a Left-Most Derivation (LMD):

S => baXaS => baXabaS => baXababaS => baaaababaS => baaaababaab

Example: CFG:

S XYX Y b | Xa | aa | Y YY XbbX | ab

The word abbaaabbabab has the following derivation tree:S/ \/ \/ \/ \/ \X _ Y _/ \ / / \ \Y b / | | \

15

/ \ X b b Xa b / \ / \X a / \/ \ Y Ya a / \ / \a b a b

Note that if we walk around the tree starting down the left branch of the root with our left hand always touching the tree, then the order in which we first visit each nonterminal corresponds to the order in which the nonterminals are replaced in LMD.This is true for any derivation in any CFG

Theorem 27 Any word that can be generated by a given CFG by some derivationalso has a LMD.

Other References:http://web.njit.edu/~marvin/cis341/chap13.pdf

16

Chapter 14 Pushdown Automata

Introduction

Previously, we saw connection betweenRegular languages and Finite automata

We saw that certain languages generated by CFG’s could not be accepted by FA’s.

Pushdown Automata

Now we will introduce new kind of machine: pushdown automaton (PDA).Will see connection between

context-free languages and pushdown automata

Pushdown automata and FA’s share some features, but a PDA can have one extra key feature: STACK. (infinitely long INPUT TAPE on which input is written.)

INPUT TAPE is divided into cells, and each cell holds one input letter or a blank Δ.

Once blank Δ is encountered on INPUT TAPE, all of the following cells also contain Δ.

Read TAPE one cell at a time, from left to right. Cannot go back. START, ACCEPT, and REJECT states. Once enter either ACCEPT or REJECT state, cannot ever leave. READ state to read input letter from INPUT TAPE. Also, have an infinitely tall PUSHDOWN STACK, which has Last-In-First-Out

(LIFO) discipline. Always start with STACK empty.

17

A b b …………

STACK can hold letters of STACK alphabet (which can be same as input alphabet) and blanks .

PUSH and POP states alter contents of STACK.PUSH adds something to the top of the STACK.POP takes off the thing on the top of the STACK.

Example: Convert FA to PDA

Figure 2: An FA which is converted to a PDA in Figure 3

18

a

b

ab

b

a

A

B

B

..

-

+

Figure 3: A PDA

19

b

b a

b a

Start

Reject Reject

AcceptRead Rea

d

Read

A pushdown automaton for { anbn : where n >= 0 } is given below in Figure 4. Please note that X is used a stack symbol; that is, when an a is read from input X is pushed into the stack and when a b is read from the input X is poped out of the stack. On the other hand, a could also have been used as a stack symbol.

Figure 4. A Pushdown Automaton

20

Determinism and Nondeterminism

Definition: A PDA is deterministic if each input string can only be processed by the machine in one way.Definition: A PDA is nondeterministic if there is some string that can be processed by the machine in more than one way.

A nondeterministic PDA may have more than one edge with the same label leading out of a certain READ

state or POP state. may have more than one arc leaving the START state. Both deterministic and nondeterministic PDAs may have no edge with a certain label leading out of a certain READ state or

POP state. if we are in a READ or POP state and encounter a letter for which there is no out-

edge from this state, the PDA crashes.

Remarks:

For FA’s, nondeterminism does not increase power of machines.For PDA’s, nondeterminism does increase power of machines.

Example: Language PALINDROMEX, which consists of all words of the formsXreverse(s) where s is any string generated by (a + b)_.

PALINDROMEX = {X, aXa, bXb, aaXaa, abXba, baXab, bbXbb, . . .} Each word in PALINDROMEX has odd length and X in middle. When processing word on PDA, first read letters from TAPE and PUSH letters

onto STACK until read in X. Then POP letters off STACK, and check if they are the same as rest of input

string on TAPE.

PDA:• Input alphabet E = {a, b,X}• Stack alphabet L = {a, b}

21

Formal Definition of PDA

Definition: A pushdown automaton (PDA) is a collection of eight things: An alphabet ∑ of input letters. An input TAPE (infinite in one direction), which initially contains the input string to

be processed followed by an infinite number of blanks An alphabet E of STACK characters. A pushdown STACK (infinite in one direction), which initially contains all blanks

. One START state that has only out-edges, no in-edges. Can have more than one

arc leaving the START state. There are no labels on arcs leaving the START state.

Halt states of two kinds:1. zero or more ACCEPT states2. zero or more REJECT states

Each of the Halt states have in-edges but no out-edges. Finitely many nonbranching PUSH states that introduce characters from E onto

the top of the STACK. Finitely many branching states of two kinds:

1. READ states, which read the next unused letter from TAPE and may have out-edges labeled with letters from ∑ or a blank . (There is no restriction on duplication of labels and no requirement that there be a label for each letter of ∑, or .)

22

X

b

a

b

a

a

b

Start

Read 1

Read 2

POP1

POP2

POP3

Accept

Push b

Push a

2. POP states, which read the top character of STACK and may have out-edges labeled with letters of E and the blank character , with no restrictions.

Remarks:

The definition for PDA allows for nondeterminism. If we want to consider a PDA that does not have nondeterminism, then we will call it a deterministic PDA. Deterministic PDA (DPDA) accept only a subset of CF languages. That is, PDA and DPDA are not equivalent.

Other References:http://www.cs.odu.edu/~toida/nerzic/390teched/cfl/cfg.html

http://www.asethome.org/dey/Relating_Acm_p168-dey.pdf

http://www.cs.duke.edu/csed/jflapodsa/martin/Readings/Summaries/DeyEtAl.html

23