variational inference for adaptor...

32
Variational Inference for Adaptor Gramars Shay Cohen School of Computer Science Carnegie Mellon University David Blei Computer Science Department Princeton University Noah Smith School of Computer Science Carnegie Mellon University Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 1/32

Upload: others

Post on 22-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Variational Inference forAdaptor Gramars

Shay CohenSchool of Computer ScienceCarnegie Mellon University

David BleiComputer Science DepartmentPrinceton University

Noah SmithSchool of Computer ScienceCarnegie Mellon University

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 1/32

Page 2: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

OutlineThe lifecycle of unsupervised learning:

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 2/32

Page 3: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Outline

We give a new representation to an existing model (adaptorgrammars)

This representation leads to a new variational inferencealgorithm for adaptor grammars

We do a sanity check on word segmentation, comparing tostate-of-the-art results

Our inference algorithm permits to do dependencyunsupervised parsing with adaptor grammars

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 3/32

Page 4: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Problem 1 - PP Attachment

I saw the boy with the telescope I saw the boy with the telescopeS

VVVVVVVVVVVVV

hhhhhhhhhhhhh

NP VP

eeeeeeeeeeeeeeeeeee

qqqqqqqMMMMMMM

N V NP

qqqqqqqMMMMMMM PP

qqqqqqqMMMMMMM

I saw the boy with thetelescope

S

VVVVVVVVVVVVV

hhhhhhhhhhhhh

NP VP

eeeeeeeeeeeeeeeeeee

MMMMMMM

N V NP

hhhhhhhhhhhhhMMMMMMM

NP

qqqqqqqMMMMMMM PP

qqqqqqqMMMMMMM

I saw the boy with thetelescope

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 4/32

Page 5: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Problem 2 - Word Segmentation

Matthewslikeswordfighting MatthewslikeswordfightingMatthews like sword fighting Matthews likes word fighting

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 5/32

Page 6: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

What is missing?Context could resolve this ambiguity

But we want unsupervised learning...

Where do we get the context?

. . . . . .

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 6/32

Page 7: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Problem 1 - PP Attachment(S (NP The boy with the telescope) (V entered) (NP the park))I saw the boy with the telescope I saw the boy with the telescope

S

VVVVVVVVVVVVV

hhhhhhhhhhhhh

NP VP

eeeeeeeeeeeeeeeeeee

qqqqqqqMMMMMMM

N V NP

qqqqqqqMMMMMMM PP

qqqqqqqMMMMMMM

I saw the boy with thetelescope

S

VVVVVVVVVVVVV

hhhhhhhhhhhhh

NP VP

eeeeeeeeeeeeeeeeeee

MMMMMMM

N V NP

hhhhhhhhhhhhhMMMMMMM

NP

qqqqqqqMMMMMMM PP

qqqqqqqMMMMMMM

I saw the boy with thetelescope

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 7/32

Page 8: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Problem 2 - Word Segmentation

Word fighting is the new hobby of computational linguists.Mr. Matthews is a computational linguist.

Matthewslikeswordfighting MatthewslikeswordfightingMatthews like sword fighting Matthews likes word fighting

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 8/32

Page 9: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Dreaming Up Patterns

Context helps. Where do we get it? Adaptor grammars(Johnson et al. 2006)

Define a distribution over trees

New samples depend on the history - “rich get richer”dynamics

Dream up “patterns” as we go along

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 9/32

Page 10: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Adaptor Grammars

Use the Pitman-Yor process with PCFGs as base distribution

To make it fully Bayesian, we also have a Dirichlet prior over thePCFG rules

Originally represented using the Chinese restaurant process(CRP)

CRP is convenient for sampling – not for variational inference

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 10/32

Page 11: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Variational Inference in a Nutshell

“Posterior inference” requires that we find parse trees z1, ..., zngiven raw sentences x1, ..., xn

Mean-field approximation: take all hidden variables: z1, ..., znand parameters θ.Find a posterior of the form q(z1, ..., zn, θ) = q(θ)

∏ni=1 q(zi)

(makes inference tractable)

Makes independence assumptions in the posterior

That’s all! Almost. We need a manageable representation forz1, ..., zn and θ

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 11/32

Page 12: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Sampling vs. Variational Inference

MCMC sampling variational inferenceconvergence guaranteed local maximumspeed slow fastalgorithm randomized objective optimizationparallelization non-trivial easy

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 12/32

Page 13: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Stick Breaking Representation

Sticks are sampled from the GEM distributionEverything which is a number in this slide, belongs to θ

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 13/32

Page 14: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Stick Breaking Representation

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 14/32

Page 15: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Stick Breaking Representation

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 15/32

Page 16: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Stick Breaking Representation

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 16/32

Page 17: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Stick Breaking Representation

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 17/32

Page 18: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Stick Breaking Representation

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 18/32

Page 19: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Truncated Stick Approximation

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 19/32

Page 20: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Sanity Check - Word Segmentation

Task is to segment a sequence of phonemes into wordsExample: yuwanttulUk&tDIs→ yu want tu lUk &t DIs

Models language acquisition in children (using the corpus fromBrent and Cartwright, 1996)

The corpus includes 9,790 utterances

Has been used before with adaptor grammars with threegrammars

Baseline: Sampling method from Johnson and Goldwater, 2009

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 20/32

Page 21: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Word Segmentation - GrammarsUnigram Grammar Sentence

MMMMMMM

qqqqqqq

Word

qqqqqqqMMMMMMM Word

qqqqqqqMMMMMMM

yu want

Sentence→Word+

Word→ Char+

“Word” is adapted (hence, if something was a Wordconstituent previously, it is more likely to appear again)There are additional grammars: collocation grammar andsyllable grammar (take into account more informationabout language)Words are segmented according to “Word” constituentsAll grammars are not recursiveUsed in Johnson and Goldwater (2009)

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 21/32

Page 22: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Word Segmentation - Results

grammar our paper J&G 2009GUnigram 0.84 0.81GColloc 0.86 0.86GSyllable 0.83 0.89

J&G 2009 - Johnson and Goldwater (2009) – best result

Scores reported are F1 measure

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 22/32

Page 23: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Variants

Model:Pitman-Yor Process vs. Dirichlet Process (did not have mucheffect)

Inference:Fixed Stick vs. Dynamic Stick Expansion (fixed stick is better)

Decoding:Minimum Bayes Risk vs. Viterbi (MBR does better)

See paper for details!

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 23/32

Page 24: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Running Time

Running time (clock time) of the sampler and variationalinference is approximately the same (note that implementationsare different)

However, variational inference can be parallelized

Reduction in clock time by factor of 2.8 when parallelizing on 20weaker CPUs

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 24/32

Page 25: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Syntax and Power Law

0 2 4 6 8 10 12

−14

−12

−10

−8

−6

−4

−2

log Rank

log

Fre

quen

cy

EnglishChinesePortugueseTurkish

Motivating adaptorgrammars for unsu-pervised parsing, aplot of log rank ofconstituents vs. theirlog frequency

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 25/32

Page 26: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Recursive Grammars

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 26/32

Page 27: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Recursive Grammars - Solution

Our finite approximation of the stick zeros all “bad” events in thevariational distribution

Equivalent to inference when assuming the model is:

p′(x , z) =p(x , z)I(x , z /∈ bad)∑

(x ,z)/∈bad

p(x , z)

where p is the original adaptor grammar model that givesnon-zero probability to bad events and I is an 0/1 indicator

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 27/32

Page 28: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Unsupervised Parsing Setting

Experiments on the English Penn Treebank

Stripped off punctuation and kept only part-of-speech tags

Used adaptor grammars with dependency model with valence(Klein and Manning, 2004)

DMV has a PCFG representation (Smith, 2006)

We “adapt” the nonterminals that head noun constituents

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 28/32

Page 29: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Unsupervised Parsing - Results

model Viterbi MBRnon-Bayesian 45.8 46.1Dirichlet prior 45.9 46.1with adaptor grammars 48.3 50.2

Results are attachment-accuracy - fraction of parents correctlyidentified

A gain over vanilla Dirichlet, which is the prior used withadaptor grammars on the PCFG rules

Other priors (instead of Dirichlet prior) give performance ∼60.Can use it with adaptor grammars - future work

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 29/32

Page 30: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Summary

We described a variational inference algorithm for adaptorgrammars

We showed it can lead to improvement in performance forvarious grammars

We showed it can be faster than sampling when parallelizationis used

We applied adaptor grammars to dependency unsupervisedparsing

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 30/32

Page 31: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Thanks! Questions?

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 31/32

Page 32: Variational Inference for Adaptor Grammarshomepages.inf.ed.ac.uk/scohen/naacl10variadaptor-slides.pdf · (Johnson et al. 2006) Define a distribution over trees New samples depend

Sampling vs. Variational Inference

0 50 100 150 200

2500

0030

0000

3500

0040

0000

4500

0050

0000

time step

−lo

g bo

und/

likel

ihoo

d

Shay Cohen, David Blei, Noah Smith Variational Inference for Adaptor Grammars 32/32