parsing context-free grammars

Parsing context-free grammars

Context-free grammars specify structure, not process.

There are many different ways to parse input in accordance with a given context-free grammar.

We will review– a top-down parsing algorithm– a bottom-up parsing algorithm

We will present the Earley algorithm

Bottom-up parsing

Yngve (1955) presented a bottom-up algorithm

Example (figure 10.4): Book that flight.

Book is ambiguous – there are two possible POS tags for the word “Book”.

Noun Det Noun Verb Det Noun

Book that flight Book that flight

Look up words in lexicon

NOM NOM NOM

Noun Det Noun Verb Det Noun


Build structure from bottom up

Now we have three possible structures:

NP NP

NOM NOM VP NOM NOM

Noun Det Noun Verb Det Noun Verb Det Noun

Book that flight Book that flight Book that flight


The Noun interpretation of Book leads to a dead end, so only two parse trees survive:

VP

NP NP

VP NOM NOM

Verb Det Noun Verb Det Noun



There is way to combine a VP and an NP to form an S, so only one parse tree survives: S

VP

NP

NOM

Verb Det Noun

Book that flight


When parsing top-down, we start with the grammar’s start symbol and apply productions to try to match input: S

Book that flight

Build structure from top down

Here we show only the successful choices:

S

VP

Book that flight



S

VP

NP

Verb

Book that flight



S

VP

NP

NOM

Verb Det

Book that flight



S

VP

NP

NOM

Verb Det Noun

Book that flight


Top-down versus bottom-up approaches

Top-down advantages– Doesn’t explore trees

which cannot be S– Subtrees fit under S

Top-down disadvantages– Many fruitless trees are

explored: trees explored may have no hope of matching input

Bottom-up advantages– All trees explored are

consistent with input

Bottom-up disadvantages– Builds structure even if S

cannot be formed– Builds neighboring

structures which can never combine

Approaches to dealing with ambiguity

parallel exploration depth-first strategy with backtracking

Improving top-down parsing

Make top-down parser pay attention to input with bottom-up filtering (left-corner parsing)

“The parser should not consider any grammar rule if he current input cannot serve as the first word along the left edge of some derivation from this rule.” [pg. 369]

Left corners are pre-compiled.

Problems with top-down parsers

left-recursionX * X

* Infinite loop in derivation!

ambiguitynot efficiently handled

recomputationsubtrees can be built multiple times (built, then thrown away during backtracking)

Earley’s algorithm

Earley’s algorithm employs the dynamic programming technique to address the weaknesses of general top-down parsing.

Dynamic programming involves storing of results so they don’t ever need to be recomputed.

Dynamic programming reduces exponential time requirement to polynomial time requirement: O(N3), where N is length of input in words.

Data structure

Earley’s algorithm uses a data structure called a chart to store information about the progress of the parse.

A chart contains an entry for each position in the input A position occurs before the first word, between

words, and after the last word.

word1 word2 … wordN

A position is represented by a number; positions in the input are numbered from 0 (at the left) to N (at the right).

Chart details

A chart entry consists of a sequence of states. A state represents

– a subtree corresponding to a single grammar rule– information about how much of a rule has been processed– information about the span of the subtree w.r.t. the input

A state is represented by an annotated grammar rule– a dot () is used to show how much of the rule has been

processed– a pair of positions, [x,y], indicates the span of the subtree

w.r.t. the input; x is the position of the left edge of the subtree, and y is the position of the dot.

Three operators on a chart

Predictor– applies when NonTerminal to right of in a state is not a

POS category (i.e. is not a pre-terminal)– adds states to current chart entry

Scanner– applies when NonTerminal to right of in a state is a POS

category (i.e. is a pre-terminal)– adds states to next chart entry

Completer– applies when there is no NonTermial (and hence no

Terminal) to right of in a state (i.e. is at end)– adds states to current chart entry

Predictor

Suppose rule to which Predicator applies is:

X NT [x,y] Predictor adds, to the current chart entry, a

new state for each possible expansion of NT For each expansion EX of NT, state added is

NT EX [y,y]

Scanner

Suppose rule to which Scanner applies is:

X POS [x,y] Scanner adds, to the next chart entry, a new

state for each possible expansion of POS The new state added is

X POS [x,y+1]

Completer

Suppose rule to which Completer applies is:X [x,y]

Completer adds, to the current chart entry, a new state for each possible reduction using the (now completed) state

For each state (from any earlier chart entry) of the form

Y X [w,x]a new state of the following form is added

Y X [w,y]

Completer (modification)

In order to recover parse tree information from the chart once parsing is complete, we need to modify the completer slightly.

Each state in the chart must be given a unique identifier (N for state N)

Each time the completer adds a state, it also adds the unique identifier of the state completed to the list of previous states for that new state (which is a copy of an already existing state, waiting for the category which the current state just completed).

Initial state of chart

chart[0] chart[1] chart[2] chart[3]

0: S

Example (from text)

(work through on board)

parsing context-free grammars

Documents

disadvantagesbuilds

sbookthatflightbuild

successful choices

parse trees

word book

advantagesall trees

parsing algorithma

current input