computer science department david caley thomas folz-donahue rob hall matt marzilli accurate parsing...

18
Computer Science Department David Caley Thomas Folz-Donahue Rob Hall Matt Marzilli Accurate Parsing ('they worry that air the shows , drink too much , whistle johnny b. goode and watch the other ropes , whistle johnny b. goode and watch closely and suffer through the sale', 2.1730387621600077e-11)

Upload: robert-whitehead

Post on 17-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Computer Science Department

David Caley Thomas Folz-Donahue

Rob HallMatt Marzilli

Accurate Parsing('they worry that air the shows , drink too much , whistle johnny b. goode and watch the other ropes , whistle johnny b. goode and watch closely and suffer through the sale', 2.1730387621600077e-11)

2Computer Science Department

Accurate Parsing: Our Goal

Given a grammar• For a sentence S, return the parse tree with the max

probability conditioned upon S.

arg-max t in T P (t| S) where T is the set of possible parse trees of sentence S

3Computer Science Department

Talking Points

Using the Penn-Treebank• Reading in n-ary trees• Finding Head-tags within n-ary productions• Converting to Binary Trees• Inducing a CFG grammar

Probabilistic CYK• Handling Unary rules• Dealing with unknowns• Dealing with run times

• Beam search, limiting depth of unary rules, further optimizations Example Parses and Trees Lexicalization Attempts

4Computer Science Department

Using the Penn-Treebank: Our Training Data

Contains tagged data and n-ary trees used from a Wall Street Journal corpus.

Contains some information unneeded by the parser.

Questionable Tagging • (JJ the) ??

Example…

5Computer Science Department

Using the Penn-Treebank: Handling N-ary trees

( (S (NP-SBJ-1 (NNS Consumers) ) (VP (MD may) (VP (VB want) (S (NP-SBJ (-NONE- *-1) ) (VP (TO to) (VP (VB move) (NP (PRP$ their) (NNS telephones) ) (ADVP-DIR (NP (DT a) (RB little) ) (RBR closer) (PP (TO to) (NP (DT the) (NN TV) (NN set) ))))))))

Functional tags such as NP-SBJ-1 are ignored

We simply call this an NP

Also –NONE- tags are used for traces, these are ignored also.

6Computer Science Department

Using the Penn-Treebank: Head-Tag Finding Algorithm

For a context-free rule X -> Y1 … Yn, for each rule we can use a function to determine the “head” of the rule.

In the example above this could be any Y1 … Yn

The head is the most important child tag.

Head-Tags Algorithm as Outlined in Collins Thesis• Allow us to determine the head-tags that will be used for later

binary tree conversion

7Computer Science Department

Using the Penn-Treebank: Head-Tag Finding Algorithm

If nothing is found in a list traversal the head-tag becomes the left or right most element.

8Computer Science Department

Using the Penn-Treebank: Head-Rule Finding Algorithm

Rules for NPs are a bit different• If the last word is tagged POS, return (last-word)• Else

• Search from right to left for the first child which is in the set {NN, NNP, NNPS, NNS, NX, POS, JJR}

• Else• Search from left to right for first child which is an NP

• Else• Search from right to left for the first child which is in the set {$,

ADJP, PRN}• Else

• Do the same with the set {CD}• Else

• Do the same with the set {JJ, JJS, RB, QP}• Else

• Return the last word

9Computer Science Department

Using the Penn-Treebank: Binary Tree Conversion

Now we put the Head-Tags to use• Necessary for CFG grammar use with probabilistic CYK

R - > LiLi-1…L1LoHRoR1 … Ri-1Ri A General n-ary rule

LiLi-1…L1LoHRoR1 … Ri-1 Ri

On right side of H-tag we recursively split last element to make a new binary rule, left recursive. On the left side we do the same by removing the first element, right recursive.

Li Li-1…L1LoH

10Computer Science Department

Using the Penn-Treebank: Grammar Induction Procedure After we have binary trees we can easily begin to

identify rules and record their frequency• Identify every production and save them into a python

dictionary

Frequencies cached in a local file for later use, read-in on subsequent executions

No immediate smoothing is done on probabilities, Grammar is later trimmed to help with performance

11Computer Science Department

Probabilistic CYK: The Parsing Step

We use a Probabilistic CYK implementation to parse our CFG grammar and also assign probabilities to final parse trees.• Useful to provide multiple parses and disambiguate

sentences

New Concerns• Unary Rules and their lengths• Runtime (result of incredibly large grammar)

12Computer Science Department

Probabilistic CYK: Handling Unary Rules within Grammar Unary Rules of the form X->Y or X->a are

ubiquitous in our grammar

• The closure of a constituent is needed to determine all the unary productions that can lead to that constituent.

• Def Closure(X) = U{Closure(Y) | Y->X}, i.e all non terminals that are reachable, by unary rules, from X.

• We implement this iteratively and also maintain a closed list and limit depth, to prevent possible infinite recursion

13Computer Science Department

Probabilistic CYK: Dealing with Run times

Beam Search• Limit the number of nodes saved in each cell of CYK

dynamic programming table.

• Using beam width k, All generations are kept sorted and the k best are saved for the next iteration

• Experiences with 100, 200, 1000?

list size <= k

14Computer Science Department

Probabilistic CYK: Dealing with Run Times

Another optimization was to remove all productions rules with frequency < fc• Used fc = 1, 2…

Also limited depth when calculating the unary rules (closure) of a constituent present in our CYK table• Extensive unary rules found to greatly slow down our parser• Also long chains of unary productions have extremely low

probabilities, they are commonly pruned by beam search anyway

15Computer Science Department

Probabilistic CYK: Random Sentences and Example Trees

Some random sentences from our grammar with associated probabilities.

('buy jam , cocoa and other war-rationed goodies',0.0046296296296296294)

('cartoonist garry trudeau refused to impose sanctions , including petroleum equipment , which go into semiannual payments , including watches , including three , which the federal government , the same company formed by mrs. yeargin school district would be confidential',

2.9911073159300768e-33)

('33 men selling individual copies selling securities at the central plaza hotel die', 7.4942533128815141e-08)

16Computer Science Department

Probabilistic CYK: Random Sentences and Example Trees('young people believe criticism is led by south korea',

1.3798001044090654e-11)

('the purchasing managers believe the art is the often amusing , often supercilious , even vicious chronicle of bank of the issue yen-support intervention', 7.1905882731776209e-1)

17Computer Science Department

S(buy) +--VP(buy) +--VB(buy) | +--buy +--NP(jam) +--NP(jam)-NP(goodies) | +--NP(jam)-CC(and) | | +--NP(jam)-NP(cocoa) | | | +--NP(jam) | | | | +--NN(jam) | | | | +--jam | | | +--,(,) | | | +--, | | +--NP(cocoa) | | +--NN(cocoa) | | +--cocoa | +--CC(and) | +--and +--NP(goodies) +--JJ(other) | +--other +--NP(goodies):JJ(other)- +--JJ(war-rationed) | +--war-rationed +--NNS(goodies) +--goodies

S:-(VP) +--VP +--VP:-(VB)-NP +--VP:-(VB) | +--VB | +--buy +--NP +--NP:-(NP)-NP +--NP:-(NP)-CC | +--NP:-(NP)-NP | | +--NP:-(NP)-, | | | +--NP:-(NP) | | | | +--NP | | | | +--NP:-(NN) | | | | +--NN | | | | +--jam | | | +--, | | | +--, | | +--NP | | +--NP:-(NN) | | +--NN | | +--cocoa | +--CC | +--and +--NP +--NP:-(NNS)JJ-NNS +--JJ | +--other +--NP:-(NNS)JJ-NNS +--JJ | +--war-rationed +--NP:-(NNS) +--NNS +--goodies

18Computer Science Department

Accurate Parsing Conclusion

Massive Lexicalized Grammar Working Probabilistic Parser

• Future Work• Handle sparsity• Smooth Probabilities