jointly learning word and phrase embeddings using neural ...hassy/publications/talk/ucl2015/... ·...

39
Jointly Learning Word and Phrase Embeddings Using Neural Networks and Implicit Tensor Factorization Kazuma Hashimoto Tsuruoka Laboratory, University of Tokyo 19/06/2015 Talk@UCL Machine Reading Lab.

Upload: others

Post on 04-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

Jointly Learning Word and Phrase Embeddings Using

Neural Networks and Implicit Tensor Factorization

Kazuma Hashimoto

Tsuruoka Laboratory, University of Tokyo

19/06/2015 Talk@UCL Machine Reading Lab.

Page 2: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Name

– Kazuma Hashimoto (橋本 和真 in Japanese)

– http://www.logos.t.u-tokyo.ac.jp/~hassy/

• Belong

– Tsuruoka Laboratory, University of Tokyo

• April 2015 – present Ph.D. student

• April 2013 – March 2015 Master’s student

– National Centre for Text Mining (NaCTeM)

• Research Interest

– Word/phrase/document embeddings and their

applications

Self Introduction

19/06/2015 Talk@UCL Machine Reading Lab. 2 / 39

Page 3: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

1. Background

– Word and Phrase Embeddings

2. Jointly Learning Word and Phrase Embeddings

– General Idea

3. Our Methods Focusing on Transitive Verb Phrases

– Word Prediction (EMNLP 2014)

– Implicit Tensor Factorization (CVSC 2015)

4. Experiments and Results

5. Summary

Today’s Agenda

19/06/2015 Talk@UCL Machine Reading Lab. 3 / 39

Page 4: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

1. Background

– Word and Phrase Embeddings

2. Jointly Learning Word and Phrase Embeddings

– General Idea

3. Our Methods Focusing on Transitive Verb Phrases

– Word Prediction (EMNLP 2014)

– Implicit Tensor Factorization (CVSC 2015)

4. Experiments and Results

5. Summary

Today’s Agenda

19/06/2015 Talk@UCL Machine Reading Lab. 4 / 39

Page 5: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Word: String Index Vector

• Why vectors?

– Word similarities can be measured using distance

metrics of the vectors (e.g., the cosine similarity)

Assigning Vectors to Words

cause

trigger

disorder

disease

animal

mouse

ratanimal

mouserat

diseasedisorder

triggercause

Embedding words in a vector space

19/06/2015 Talk@UCL Machine Reading Lab. 5 / 39

Page 6: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Two approaches using large corpora:

(systematic comparison of them in Baroni+ (2014))

– Count-based approach

• e.g.) Reducing the dimension of word co-

occurrence matrix using SVD

– Prediction-based approach

• e.g.) Predicting words from their contexts using

neural networks

• We focus on prediction-based approach

– Why?

Approaches to Word Representations

19/06/2015 Talk@UCL Machine Reading Lab. 6 / 39

Page 7: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Prediction-based approaches usually

– parameterize the word embeddings

– learn them based on co-occurrence statistics

• Word embeddings appearing in similar contexts get

close to each other

Learning Word Embeddings

------------

text data

… the prevalence of drunken driving and accidents caused by drinking …

target

word prediction using the word embedding

SkipGram model (Mikolov+, 2013) in word2vec

19/06/2015 Talk@UCL Machine Reading Lab. 7 / 39

Page 8: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Learning word embeddings for relation classification

– To appear at CoNLL 2015 (just advertising)

Task-Oriented Word Embeddings

19/06/2015 Talk@UCL Machine Reading Lab. 8 / 39

Page 9: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Treating phrases and sentences as well as words

– gaining much attention recently!

Beyond Word Embeddings

make payment

pay money

moneypay

pay moneymake payment

paymentmake

Embedding phrases in a vector space

19/06/2015 Talk@UCL Machine Reading Lab. 9 / 39

Page 10: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Element-wise addition/multiplication (Lapata+, 2010)

– 𝑣 sentnce = 𝑖 𝑣 𝑤𝑖

• Recursive autoencoders (Socher+, 2011; Hermann+, 2013)

– Using parse trees

– 𝑣 parent = 𝑓(𝑣 left child , 𝑣 right child )

• Tensor/matrix-based methods

– 𝑣 adj noun = 𝑀 adj 𝑣(noun) (Baroni+, 2010)

– 𝑀 verb = 𝑖,𝑗 𝑣 subj𝑖T𝑣 obj𝑗 (Grefenstette+, 2011)

• 𝑀 subj, verb, obj ={𝑣 subj T𝑣 obj } ∗ 𝑀(verb)

• 𝑣 subj, verb, obj = 𝑀 verb 𝑣 obj ∗ 𝑣 subj

(Kartsaklis+, 2012)

Approaches to Phrase Embeddings

19/06/2015 Talk@UCL Machine Reading Lab. 10 / 39

Page 11: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Co-occurrence matrix + SVD

• C&W (Collobert+, 2011)

• RNNLM (Mikolov+, 2013)

• SkipGram/CBOW (Mikolov+, 2013)

• vLBL/ivLBL (Mnih+, 2013)

• Dependency-based SkipGram (Levy+, 2014)

• Glove (Pennington+, 2014)

Which Word Embeddings are the Best?

19/06/2015 Talk@UCL Machine Reading Lab.

Which word embeddings should we use for which composition methods?

Joint leaning11 / 39

Page 12: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

1. Background

– Word and Phrase Embeddings

2. Jointly Learning Word and Phrase Embeddings

– General Idea

3. Our Methods Focusing on Transitive Verb Phrases

– Word Prediction (EMNLP 2014)

– Implicit Tensor Factorization (CVSC 2015)

4. Experiments and Results

5. Summary

Today’s Agenda

19/06/2015 Talk@UCL Machine Reading Lab. 12 / 39

Page 13: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Word co-occurrence statistics word embeddings

• How about phrase embeddings?

– Phrase co-occurrence statistics!

Co-Occurrence Statistics of Phrases

The importer made payment in his own domestic currency

19/06/2015 Talk@UCL Machine Reading Lab.

The businessman pays his monthly fee in yen

similar contexts

similar meanings?

13 / 39

Page 14: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Using Predicate-Argument Structures (PAS)

– Enju parer (Miyao+, 2008)

• Analyzes relations between phrases and words

How to Identify Phrase-Word Relations?

The importer made payment in his own domestic currency

NP

NP

NP

VPNP

verb prepositionpredicates

19/06/2015 Talk@UCL Machine Reading Lab.

arguments

14 / 39

Page 15: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

1. Background

– Word and Phrase Embeddings

2. Jointly Learning Word and Phrase Embeddings

– General Idea

3. Our Methods Focusing on Transitive Verb Phrases

– Word Prediction (EMNLP 2014)

– Implicit Tensor Factorization (CVSC 2015)

4. Experiments and Results

5. Summary

Today’s Agenda

19/06/2015 Talk@UCL Machine Reading Lab. 15 / 39

Page 16: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Meanings of transitive verbs are affected by their

arguments

– e.g.) run, make, etc.

Good target to test composition models

Why Transitive Verb Phrases?

19/06/2015 Talk@UCL Machine Reading Lab.

make

make payment

make money

make use (of)

pay

earn

use

16 / 39

Page 17: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Embedding subject-verb-object tuples in a vector space

– Semantic similarities between SVOs can be used!

Possible Application: Semantic Search

19/06/2015 Talk@UCL Machine Reading Lab. 17 / 39

Page 18: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Focusing on the role of prepositional adjuncts

– Prepositional adjuncts complement meanings of

verb phrases should be useful

Training Data from Large Corpora

simplification

How to model the relationships between predicates and arguments?

19/06/2015 Talk@UCL Machine Reading Lab.

------------

English Wikipedia,BNC, etc.

parse

18 / 39

Page 19: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

1. Background

– Word and Phrase Embeddings

2. Jointly Learning Word and Phrase Embeddings

– General Idea

3. Our Methods Focusing on Transitive Verb Phrases

– Word Prediction (EMNLP 2014)

– Implicit Tensor Factorization (CVSC 2015)

4. Experiments and Results

5. Summary

Today’s Agenda

19/06/2015 Talk@UCL Machine Reading Lab. 19 / 39

Page 20: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Predicting words in predicate-argument tuples

Word Prediction Model (like word2vec)

arg1

+

currency furniture

max(0, 1-s(currency)+s(furniture)) cost function

pred

[importer make payment] in

𝐩 = tanh(𝐡𝑎𝑟𝑔1prep

⊙𝐯𝑎𝑟𝑔1 +

𝐡𝑝𝑟𝑒𝑑prep

⊙𝐯𝑝𝑟𝑒𝑑)

𝐯𝑎𝑟𝑔1 𝐯𝑝𝑟𝑒𝑑

feature vectorfor the word prediction

𝐡𝑎𝑟𝑔1prep

𝐡𝑝𝑟𝑒𝑑prep

19/06/2015 Talk@UCL Machine Reading Lab.

PAS-CLBLM20 / 39

Page 21: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Two methods:

– (a) assigning a vector to each SVO tuple

– (b) composing SVO embeddings

How to Compute SVO Embeddings?

[importer make payment]

subj obj

+verb

[importer make payment]

(a) (b)

- parameterized vectors

- composed vectors

19/06/2015 Talk@UCL Machine Reading Lab. 21 / 39

Page 22: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

1. Background

– Word and Phrase Embeddings

2. Jointly Learning Word and Phrase Embeddings

– General Idea

3. Our Methods Focusing on Transitive Verb Phrases

– Word Prediction (EMNLP 2014)

– Implicit Tensor Factorization (CVSC 2015)

4. Experiments and Results

5. Summary

Today’s Agenda

19/06/2015 Talk@UCL Machine Reading Lab. 22 / 39

Page 23: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Only element-wise vector operations

– Pros: Fast training

– Cons: Poor interaction between predicates and

arguments

• Interactions between predicates and arguments are

important for transitive verbs

Weakness of PAS-CLBLM

19/06/2015 Talk@UCL Machine Reading Lab.

make

make payment

make money

make use (of)

pay

earn

use

23 / 39

Page 24: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Tensor/matrix-based approaches (Noun: vector)

– Adjective: matrix (Baroni+, 2010)

– Transitive verb: matrix

(Grefenstette+, 2011; Van de Cruys+, 2013)

Focusing on Tensor-Based Approaches

19/06/2015 Talk@UCL Machine Reading Lab.

verb

subject

verb

𝑑𝑑

𝑑

subject≅

𝑃𝑀𝐼(importer, make, payment) = 0.31

GivenGiven

Given

pre-trained

24 / 39

Page 25: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Parameterizing

– Predicate matrices and

– Argument embeddings

Implicit Tensor Factorization (1)

19/06/2015 Talk@UCL Machine Reading Lab.

predicate

argument 2

predicate

𝑑𝑑

𝑑

argument 2≅

GivenGiven

Given

25 / 39

Page 26: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Calculating plausibility scores

– Using predicate matrices & argument embeddings

Implicit Tensor Factorization (2)

19/06/2015 Talk@UCL Machine Reading Lab.

predicate

argument 2

predicate

𝑑𝑑

𝑑

argument 2≅

GivenGiven

Given

𝑇(i, j, k) =

ij k

26 / 39

Page 27: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Learning model parameters

– Using plausibility judgment task

• Observed tuple: (i, j, k)

• Collapsed tuple: (i’, j, k), (i, j’, k), (i, j, k’)

–Negative sampling (Mikolov+, 2013)

Implicit Tensor Factorization (3)

19/06/2015 Talk@UCL Machine Reading Lab.

Cost function

27 / 39

Page 28: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Discriminating between observed and collapsed ones

Example

19/06/2015 Talk@UCL Machine Reading Lab.

(i, j, k) = (in, importer make payment, currency)(i’, j, k)= (on, importer make payment, currency)(i, j’, k)= (in, child eat pizza, currency)(i, j, k’)= (in, importer make payment, furniture)

28 / 39

Page 29: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Two methods:

– (a) assigning a vector to each SVO tuple

– (b) composing SVO embeddings

How to Compute SVO Embeddings?

- parameterized vectors

- composed vectors

19/06/2015 Talk@UCL Machine Reading Lab.

[importer make payment][importer make payment]

(a) (b)

- parameterized matrices

(Kartsaklis+, 2012)

29 / 39

Page 30: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• The function is presented in Kartsaklis+ (2012)

– Using verb matrices in Grefenstette+ (2011)

• Our verb matrices are related to Grefenstette+

(2011)

• The function can compute

– verb-object phrase embeddings

– subject-verb-object phrase embeddings

Why the Copy-Subject Function?

19/06/2015 Talk@UCL Machine Reading Lab. 30 / 39

Page 31: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

1. Background

– Word and Phrase Embeddings

2. Jointly Learning Word and Phrase Embeddings

– General Idea

3. Our Methods Focusing on Transitive Verb Phrases

– Word Prediction (EMNLP 2014)

– Implicit Tensor Factorization (CVSC 2015)

4. Experiments and Results

5. Summary

Today’s Agenda

19/06/2015 Talk@UCL Machine Reading Lab. 31 / 39

Page 32: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Training corpus: English Wikipedia

– SVO data: 23.6 million instances

– SVO-preposition-noun data: 17.3 million instances

• Parameter Initialization: random values

• Optimization: mini-batch AdaGrad (Duchi+, 2011)

• Embedding dimensionality

– PAS-CLBLM: 200

– Tensor method: 50

• # of model parameters of PAS-CLBLM is a little

bit larger than that of the tensor method

Experimental Settings

19/06/2015 Talk@UCL Machine Reading Lab. 32 / 39

Page 33: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Case 1: assigning a vector to each SVO tuple

Examples of Learned SVO Embeddings

Adjuncts seem to be helpful in learning the meanings of verb phrases

This approach omits the information about individual words!

19/06/2015 Talk@UCL Machine Reading Lab. 33 / 39

Page 34: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Case 2: composing SVO embeddings

Examples of Learned SVO Embeddings

Tensor (CVSC 2015) PAS-CLBLM (EMNLP 2014)

More flexible!

19/06/2015 Talk@UCL Machine Reading Lab.

Strongly enhancing the head word

34 / 39

Page 35: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• In the latest approach, the learned verb matrices

capture multiple meanings

Multiple Meanings in Verb Matrices

19/06/2015 Talk@UCL Machine Reading Lab. 35 / 39

Page 36: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Measuring semantic similarities of verb pairs taking

the same subjects and objects (Grefenstette+, 2011)

– Evaluation: Speaman’s rank correlation between

similarity scores and human ratings

Verb Sense Disambiguation Task

verb pair with subj&obj human rating

student write name

student spell name7

child show sign

child express sign6

system meet criterion

system visit criterion1

19/06/2015 Talk@UCL Machine Reading Lab. 36 / 39

Page 37: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• State-of-the-art results on the disambiguation task

– Prepositional adjuncts improve the results

• How about other kinds of adjuncts?

Results

Method

Tensor (only verb data) 0.480

Tensor (verb and preposition data) 0.614

PAS-CLBLM (this experiment) 0.374

Milajevs+, 2014 0.456

Hashimoto+, 2014 0.422

Future work: improving real-world applications using the method

19/06/2015 Talk@UCL Machine Reading Lab. 37 / 39

Page 38: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

1. Background

– Word and Phrase Embeddings

2. Jointly Learning Word and Phrase Embeddings

– General Idea

3. Our Methods Focusing on Transitive Verb Phrases

– Word Prediction (EMNLP 2014)

– Implicit Tensor Factorization (CVSC 2015)

4. Experiments and Results

5. Summary

Today’s Agenda

19/06/2015 Talk@UCL Machine Reading Lab. 38 / 39

Page 39: Jointly Learning Word and Phrase Embeddings Using Neural ...hassy/publications/talk/ucl2015/... · –parameterize the word embeddings –learn them based on co-occurrence statistics

• Word and phrase embeddings are jointly learned

using large corpora parsed by syntactic parsers

– Tensor-based method is suitable for verb sense

disambiguation

– Adjuncts are useful in learning verb phrases

• Future directions:

– improving the embedding methods

– applying them to real-world NLP applications

• What kind of information should be captured?

Summary

19/06/2015 Talk@UCL Machine Reading Lab. 39 / 39