recursive autoencoders for paraphrase detection (socher et al)

26
Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection 1 Richard Socher, Eric Huang, Jeffrey Penningotn, Andrew Ng, Christopher Manning Feynman Liang May 16, 2013 1 Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection Richard Socher, Eric H. Huang, Jeffrey Pennington, Andrew Y. Ng, and Christopher D. Manning Advances in Neural Information Processing Systems (NIPS 2011) F. Liang Unfolding RAE for Paraphrase Detection May 2013 1 / 26

Upload: feynman-liang

Post on 24-May-2015

308 views

Category:

Science


0 download

DESCRIPTION

Literature presentation for Dartmouth deep learning seminar.

TRANSCRIPT

Page 1: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Dynamic Pooling and Unfolding Recursive Autoencodersfor Paraphrase Detection1

Richard Socher, Eric Huang, Jeffrey Penningotn, Andrew Ng,Christopher Manning

Feynman Liang

May 16, 2013

1Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase DetectionRichard Socher, Eric H. Huang, Jeffrey Pennington, Andrew Y. Ng, and Christopher D.Manning Advances in Neural Information Processing Systems (NIPS 2011)

F. Liang Unfolding RAE for Paraphrase Detection May 2013 1 / 26

Page 2: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Motivation

Consider the following phrases:

The judge also refused to postpone the trial date of Sept. 29.

Obus also denied a defense motion to postpone the September trialdate.

F. Liang Unfolding RAE for Paraphrase Detection May 2013 2 / 26

Page 3: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Paraphrase Detection Problem

Given: A pair of sentences S1 = (w1, . . . ,wm) andS2 = (w1, . . . ,wn),w ∈ VTask: Classify whether S1 and S2 are paraphrases or not

F. Liang Unfolding RAE for Paraphrase Detection May 2013 3 / 26

Page 4: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Overview

Background

Neural Language Models

Recursive Autoencoders

Contributions

Unfolding RAEs

Dynamic Pooling of Similarity Matrix

Experiments

F. Liang Unfolding RAE for Paraphrase Detection May 2013 4 / 26

Page 5: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Prior Work

Similarity Metrics

n-gram Overlap / Longest Common SubsequenceOrdered Tree Edit DistanceWordNet hypernyms

Language Models

n-gram HMMsP(wt |w t−1

1 ) ≈ P(wt |w t−1t−n+1)

Log-Linear Models

P(y |w t1 ; θ) ≈ eθ

ᵀf (w t1 ,y)∑

y ′∈Yeθ

ᵀf (w t1 ,y

′)

Neural Language Models2

2R. Collobert and J. Weston. A unified architecture for natural language processing:deep neural networks with multitask learning. In ICML, 2008.

F. Liang Unfolding RAE for Paraphrase Detection May 2013 5 / 26

Page 6: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Neural Language Models

Vocabulary VEmbedding Matrix L ∈ Rn×|V|

L : V → Rn

Each column of L “embeds” w ∈ V on a n-dimensional feature spaceCapture semantic and syntactic information about a word

A sentence S = (w1, . . . ,wm),wi ∈ V is represented as an ordered list(x1, . . . , xm), xi ∈ Rn

F. Liang Unfolding RAE for Paraphrase Detection May 2013 6 / 26

Page 7: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Neural Language Models

F. Liang Unfolding RAE for Paraphrase Detection May 2013 7 / 26

Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic languagemodel. J. Mach. Learn. Res., 3, March 2003.

Page 8: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Recursive Autoencoders (RAEs)

Assume we are given a binary parse tree T :

A binary parse tree is a list of triplets of parents with children:(p → (c1, c2))

c1, c2 are either a terminal word vector xi ∈ Rn or a non-terminalparent y1 ∈ Rn

Figure : Parse tree for ((y1 → x2x3), (y2 → x1y1)),∀x , y ∈ Rn

F. Liang Unfolding RAE for Paraphrase Detection May 2013 8 / 26

Page 9: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Recursive Autoencoders (RAEs)

Non-terminal parent p computed as

p = f (We [c1; c2] + b)

f is an activation function (eg. sigmoid, tanh)We ∈ Rn×2n the encoding matrix to be learned[c1; c2] ∈ R2n is the concatenated childrenb is a bias term

Figure : y2 = f

We

x1

f

(We

[x2x3

]+ b1

) + b2

F. Liang Unfolding RAE for Paraphrase Detection May 2013 9 / 26

Page 10: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Recursive Autoencoders (RAEs)

Wd inverts We s.t. [c ′1; c ′2] = f (Wdp + bd) is the decoding of p

Erec(p) = ‖[c1; c2]− [c ′1; c ′2]‖2`2To train:

Minimize Erec(T ) =∑p∈T

Erec(p) = Erec(y1) + Erec(y2)

Add length normalization layer p = p‖p‖`2

to avoid degenerate solution

F. Liang Unfolding RAE for Paraphrase Detection May 2013 10 / 26

Page 11: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Unfolding RAEs

Measure reconstruction error down to terminal xi s:

For a node y that spans words i to j :

Erec(y(i ,j)) = ‖[xi ; . . . ; xj ]− [x ′i ; . . . ; x′j ]‖2`2

Hidden layer norms no longer shrink

Children with larger subtrees get more weight

F. Liang Unfolding RAE for Paraphrase Detection May 2013 11 / 26

Page 12: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Deep RAEs

h = f (W(1)e [c1; c2] + b

(1)e )

p = f (W(2)e h + b

(2)2 )

F. Liang Unfolding RAE for Paraphrase Detection May 2013 12 / 26

Andrew Ng. Autoencoders (CS294A Lecture notes).

Page 13: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Training RAEs

Data: A set of parse trees

Objective: Minimize

J =1

|T |∑n∈T

Erec(n;We) +λ

2(‖We‖2)

Gradient descent (backpropogation, L-BFGS)

Non-convex, smooth convergance =⇒ local optima

F. Liang Unfolding RAE for Paraphrase Detection May 2013 13 / 26

Page 14: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Sentence Similarity Matrix

For two sentences S1,S2 of lengths n and m, concatenate terminalxi s (in sentence order) with non-terminal yi s (depth-first, right-to-left)

Compute similarity matrix S ∈ R(2v−1)×(2w−1), where Si ,j is the`2-norm between the ith element from S1’s feature vector and the jthelement from S2’s feature vector

F. Liang Unfolding RAE for Paraphrase Detection May 2013 14 / 26

Page 15: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Dynamic Pooling

Sentence lengths may vary =⇒ S dimensionality may vary.Want to map S ∈ R(2n−1)×(2m−1) to Spooled ∈ Rnp×np with np constant

Dynamically partition rows and columns of S into np equal parts

Min. pool over each part

Normalize µ = 0, σ = 1 and pass on to classifier (e.g. softmax)

F. Liang Unfolding RAE for Paraphrase Detection May 2013 15 / 26

Page 16: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Qualitative Evaluation of Unsupervised Feature Learning

Dataset

150,000 sentences from NYT and AP sections of Gigaword corpus forRAE training

Setup

R100 unsupervised feature vectors provided by Turian et al.3 for initialword embeddings

Stanford parser4 to extract parse tree

Hidden layer h set to 200 units in both standard and unfolding RAE(0 in NN qualitative evaluation)

3J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and generalmethod for semisupervised learning. In Proceedings of ACL, pages 384394, 2010.

4D. Klein and C. D. Manning. Accurate unlexicalized parsing. In ACL, 2003.F. Liang Unfolding RAE for Paraphrase Detection May 2013 16 / 26

Page 17: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Nearest Neighbor

Figure : Comparison of nearest `2-norm neighbor

F. Liang Unfolding RAE for Paraphrase Detection May 2013 17 / 26

Page 18: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Recursive Decoding

Figure : Phrase reconstruction via recursive decoding

F. Liang Unfolding RAE for Paraphrase Detection May 2013 18 / 26

Page 19: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Paraphrase Detection Task

Dataset

Microsoft Research paraphrase corpus (MSRP)5

5,801 sentence pairs, 3,900 labeled as paraphrases

5B. Dolan, C. Quirk, and C. Brockett. Unsupervised construction of large paraphrasecorpora: exploiting massively parallel news sources. In COLING, 2004.

F. Liang Unfolding RAE for Paraphrase Detection May 2013 19 / 26

Page 20: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Paraphrase Detection Task

Setup

4,076 training pairs (67.5% positive), 1,725 test pairs (66.5%positive)

For all (S1, S2) in training data, (S2,S1) also added

Negative examples selected for high lexical overlap

Add features ∈ {0, 1} to Spooled related to the set of numbers in S1and S2

Numbers in S1 = numbers in S2(Numbers in S1 ∪ numbers in S2) 6= ∅Numbers in one sentence ⊂ numbers in other

Softmax classifier on top of SpooledHyperparameter selection: 10-fold cross-validation

np = 15λRAE = 10−5

λsoftmax = 0.05

Two annotators (83% agreement), third to resolve conflict

F. Liang Unfolding RAE for Paraphrase Detection May 2013 20 / 26

Page 21: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Example Results

F. Liang Unfolding RAE for Paraphrase Detection May 2013 21 / 26

Page 22: Recursive Autoencoders for Paraphrase Detection (Socher et al)

State of the Art

F. Liang Unfolding RAE for Paraphrase Detection May 2013 22 / 26

“Paraphrase Identification (State of the Art).” ACLWiki. Web. 14 May 2013.

Page 23: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Comparison of Unsupervised Feature Learning Methods

Setup

Dynamic pooling layer

Hyperparameters optimized over C.V. set

Results

Recursive averaging: 75.9%

Standard RAE: 75.5%

Unfolding RAE without hidden layers: 76.8%

Unfolding RAE with hidden layers: 76.6%

F. Liang Unfolding RAE for Paraphrase Detection May 2013 23 / 26

Page 24: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Evaluating cotribution of Dynamic Pooling Layer

Setup

Unfolding RAE used to compute SHyperparameters optimized over C.V. set

Results

S-histogram 73.0%Only added number features 73.2%Only Spooled 72.6%Top URAE Node 74.2%Spooled + number features 76.8%

F. Liang Unfolding RAE for Paraphrase Detection May 2013 24 / 26

Page 25: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Criteque

Pros:

Novel unfolding reconstruction error metric, dynamic pooling layer

State of the art (2011) performance

Cons:

Vague training details / time to convergence

Unconvincing improvement over baseline (recursive averaging, topRAE node)

Training requires labeled parse trees (unsupervised performancedepends on parser accuracy)

Representing phrases on same feature-space as words

F. Liang Unfolding RAE for Paraphrase Detection May 2013 25 / 26

Page 26: Recursive Autoencoders for Paraphrase Detection (Socher et al)

Criteque

Suggestions:

Add additional features to SpooledOverlap pooling regions

Letting We vary depending on labels of children in parse tree

Capture the operational meaning of a word to a sentence (MV-RNN6)

p = f

(We

[c1c2

]+ b

)→ p = f

(We

[Ba + b0Ab + a0

]+ p0

)

6Richard Socher, Brody Huval, Christopher D. Manning and Andrew Y. NgConference on Empirical Methods in Natural Language Processing

F. Liang Unfolding RAE for Paraphrase Detection May 2013 26 / 26