recursive autoencoders for paraphrase detection (socher et al)

Dynamic Pooling and Unfolding Recursive Autoencodersfor Paraphrase Detection1

Richard Socher, Eric Huang, Jeffrey Penningotn, Andrew Ng,Christopher Manning

Feynman Liang

May 16, 2013

1Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase DetectionRichard Socher, Eric H. Huang, Jeffrey Pennington, Andrew Y. Ng, and Christopher D.Manning Advances in Neural Information Processing Systems (NIPS 2011)

F. Liang Unfolding RAE for Paraphrase Detection May 2013 1 / 26

Motivation

Consider the following phrases:

The judge also refused to postpone the trial date of Sept. 29.

Obus also denied a defense motion to postpone the September trialdate.


Paraphrase Detection Problem

Given: A pair of sentences S1 = (w1, . . . ,wm) andS2 = (w1, . . . ,wn),w ∈ VTask: Classify whether S1 and S2 are paraphrases or not


Overview

Background

Neural Language Models

Recursive Autoencoders

Contributions

Unfolding RAEs

Dynamic Pooling of Similarity Matrix

Experiments


Prior Work

Similarity Metrics

n-gram Overlap / Longest Common SubsequenceOrdered Tree Edit DistanceWordNet hypernyms

Language Models

n-gram HMMsP(wt |w t−1

1 ) ≈ P(wt |w t−1t−n+1)

Log-Linear Models

P(y |w t1 ; θ) ≈ eθ

ᵀf (w t1 ,y)∑

y ′∈Yeθ

ᵀf (w t1 ,y

′)

Neural Language Models2

2R. Collobert and J. Weston. A unified architecture for natural language processing:deep neural networks with multitask learning. In ICML, 2008.



Vocabulary VEmbedding Matrix L ∈ Rn×|V|

L : V → Rn

Each column of L “embeds” w ∈ V on a n-dimensional feature spaceCapture semantic and syntactic information about a word

A sentence S = (w1, . . . ,wm),wi ∈ V is represented as an ordered list(x1, . . . , xm), xi ∈ Rn




Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic languagemodel. J. Mach. Learn. Res., 3, March 2003.

Recursive Autoencoders (RAEs)

Assume we are given a binary parse tree T :

A binary parse tree is a list of triplets of parents with children:(p → (c1, c2))

c1, c2 are either a terminal word vector xi ∈ Rn or a non-terminalparent y1 ∈ Rn

Figure : Parse tree for ((y1 → x2x3), (y2 → x1y1)),∀x , y ∈ Rn



Non-terminal parent p computed as

p = f (We [c1; c2] + b)

f is an activation function (eg. sigmoid, tanh)We ∈ Rn×2n the encoding matrix to be learned[c1; c2] ∈ R2n is the concatenated childrenb is a bias term

Figure : y2 = f

We

x1

f

(We

[x2x3

]+ b1

) + b2



Wd inverts We s.t. [c ′1; c ′2] = f (Wdp + bd) is the decoding of p

Erec(p) = ‖[c1; c2]− [c ′1; c ′2]‖2`2To train:

Minimize Erec(T ) =∑p∈T

Erec(p) = Erec(y1) + Erec(y2)

Add length normalization layer p = p‖p‖`2

to avoid degenerate solution


Unfolding RAEs

Measure reconstruction error down to terminal xi s:

For a node y that spans words i to j :

Erec(y(i ,j)) = ‖[xi ; . . . ; xj ]− [x ′i ; . . . ; x′j ]‖2`2

Hidden layer norms no longer shrink

Children with larger subtrees get more weight


Deep RAEs

h = f (W(1)e [c1; c2] + b

(1)e )

p = f (W(2)e h + b

(2)2 )


Andrew Ng. Autoencoders (CS294A Lecture notes).

Training RAEs

Data: A set of parse trees

Objective: Minimize

J =1

|T |∑n∈T

Erec(n;We) +λ

2(‖We‖2)

Gradient descent (backpropogation, L-BFGS)

Non-convex, smooth convergance =⇒ local optima


Sentence Similarity Matrix

For two sentences S1,S2 of lengths n and m, concatenate terminalxi s (in sentence order) with non-terminal yi s (depth-first, right-to-left)

Compute similarity matrix S ∈ R(2v−1)×(2w−1), where Si ,j is the`2-norm between the ith element from S1’s feature vector and the jthelement from S2’s feature vector


Dynamic Pooling

Sentence lengths may vary =⇒ S dimensionality may vary.Want to map S ∈ R(2n−1)×(2m−1) to Spooled ∈ Rnp×np with np constant

Dynamically partition rows and columns of S into np equal parts

Min. pool over each part

Normalize µ = 0, σ = 1 and pass on to classifier (e.g. softmax)


Qualitative Evaluation of Unsupervised Feature Learning

Dataset

150,000 sentences from NYT and AP sections of Gigaword corpus forRAE training

Setup

R100 unsupervised feature vectors provided by Turian et al.3 for initialword embeddings

Stanford parser4 to extract parse tree

Hidden layer h set to 200 units in both standard and unfolding RAE(0 in NN qualitative evaluation)

3J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and generalmethod for semisupervised learning. In Proceedings of ACL, pages 384394, 2010.

4D. Klein and C. D. Manning. Accurate unlexicalized parsing. In ACL, 2003.F. Liang Unfolding RAE for Paraphrase Detection May 2013 16 / 26

Nearest Neighbor

Figure : Comparison of nearest `2-norm neighbor


Recursive Decoding

Figure : Phrase reconstruction via recursive decoding


Paraphrase Detection Task

Dataset

Microsoft Research paraphrase corpus (MSRP)5

5,801 sentence pairs, 3,900 labeled as paraphrases

5B. Dolan, C. Quirk, and C. Brockett. Unsupervised construction of large paraphrasecorpora: exploiting massively parallel news sources. In COLING, 2004.


Paraphrase Detection Task

Setup

4,076 training pairs (67.5% positive), 1,725 test pairs (66.5%positive)

For all (S1, S2) in training data, (S2,S1) also added

Negative examples selected for high lexical overlap

Add features ∈ {0, 1} to Spooled related to the set of numbers in S1and S2

Numbers in S1 = numbers in S2(Numbers in S1 ∪ numbers in S2) 6= ∅Numbers in one sentence ⊂ numbers in other

Softmax classifier on top of SpooledHyperparameter selection: 10-fold cross-validation

np = 15λRAE = 10−5

λsoftmax = 0.05

Two annotators (83% agreement), third to resolve conflict


Example Results


State of the Art


“Paraphrase Identification (State of the Art).” ACLWiki. Web. 14 May 2013.

Comparison of Unsupervised Feature Learning Methods

Setup

Dynamic pooling layer

Hyperparameters optimized over C.V. set

Results

Recursive averaging: 75.9%

Standard RAE: 75.5%

Unfolding RAE without hidden layers: 76.8%

Unfolding RAE with hidden layers: 76.6%


Evaluating cotribution of Dynamic Pooling Layer

Setup

Unfolding RAE used to compute SHyperparameters optimized over C.V. set

Results

S-histogram 73.0%Only added number features 73.2%Only Spooled 72.6%Top URAE Node 74.2%Spooled + number features 76.8%


Criteque

Pros:

Novel unfolding reconstruction error metric, dynamic pooling layer

State of the art (2011) performance

Cons:

Vague training details / time to convergence

Unconvincing improvement over baseline (recursive averaging, topRAE node)

Training requires labeled parse trees (unsupervised performancedepends on parser accuracy)

Representing phrases on same feature-space as words


Criteque

Suggestions:

Add additional features to SpooledOverlap pooling regions

Letting We vary depending on labels of children in parse tree

Capture the operational meaning of a word to a sentence (MV-RNN6)

p = f

(We

[c1c2

]+ b

)→ p = f

(We

[Ba + b0Ab + a0

]+ p0

)

6Richard Socher, Brody Huval, Christopher D. Manning and Andrew Y. NgConference on Empirical Methods in Natural Language Processing


recursive autoencoders for paraphrase detection (socher et al)

Science

liang unfolding rae

paraphrase detection

c1 c2 c1 c2

f wd p

e c1 c2

terminal xi s

f wec1 c2

y y y e f wt