distant supervision with imitation learning

49
Distant Supervision with Imitation Learning Isabelle Augenstein i.augenstein@sheffield.ac.uk Department of Computer Science, University of Sheffield, UK Joint work with Andreas Vlachos, Diana Maynard (EMNLP 2015) 30 November 2015 Heriot-Watt University Computer Science Seminar

Upload: isabelle-augenstein

Post on 12-Apr-2017

979 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Distant Supervision with Imitation Learning

Distant Supervision with Imitation Learning

Isabelle Augenstein

[email protected]��

Department of Computer Science, University of Sheffield, UK

Joint work with Andreas Vlachos, Diana Maynard (EMNLP 2015)

30 November 2015

Heriot-Watt University Computer Science Seminar

Page 2: Distant Supervision with Imitation Learning

2 Talk Overview

•  Relation Extraction from the Web with Distant Supervision •  Extracting Relations from Web pages •  Relation are used for populating Knowledge Bases •  Distant Supervision allows to automatically generate relation extraction

training data using knowledge base Ø  No manual effort necessary

Page 3: Distant Supervision with Imitation Learning

3 Talk Overview

•  Imitation Learning for Distant Supervision •  Relation extraction relies on recognising and classifying named entities,

but sentences only have relation annotations •  Suitable manually labeled NERC training data can be difficult to obtain •  Imitation Learning decomposes tasks (RE) into sequence of actions

(e.g. NEC, RE), able to deal with latent variables •  Imitation Learning is a structured prediction method, also called

learning-to-search, inverse reinforcement learning Ø  Only labels for last action (RE) needed, no additional manual effort

Page 4: Distant Supervision with Imitation Learning

4

•  Large knowledge bases are useful for search, question answering etc.

Overall Problem

Structured Information from Google Knowledge Graph

Page 5: Distant Supervision with Imitation Learning

5

•  Large knowledge bases are useful for search, question answering etc. but far from complete

Overall Problem

Structured Information from Google Knowledge Graph

Band members, genre missing

Page 6: Distant Supervision with Imitation Learning

6

•  Large knowledge bases are useful for search, question answering etc. but far from complete

•  Approach: automatic knowledge base population (KBP) methods using Web information extraction (IE) 1)  Extracting entities and relations between them from text on Web pages 2)  Combining information from several sources to populate KBs

Overall Problem

Page 7: Distant Supervision with Imitation Learning

7

Relation extraction for knowledge base completion •  Given subject and name of relation, find object of relation in corpus •  E.g. “Where was Bill Gates born?”

•  Answer: birthplace(Bill Gates, Seattle_Washington)

Relation Extraction Overview

birthplace

Bill Gates was born in Seattle, Washington

LOC

Page 8: Distant Supervision with Imitation Learning

8

•  Why distant supervision for relation extraction (RE)?

•  RE methods requiring manual effort •  Rule-based approaches: manually created patters, e.g.

“X is a professor at Y” •  Supervised learning: statistical models, manually annotated training data Ø  Biased towards a domain, e.g. Biology, newswire, Wikipedia

•  RE methods requiring no manual effort •  Bootstrapping: semi-supervised, learning patterns iteratively starting with

prior knowledge, e.g. list of names Ø  “Semantic drift”, e.g. “X is a professor at Y” -> “X lives in Y” •  Open Information Extraction: unsupervised learning, discovering

patterns, clustering Ø  Difficult to map to schema

Existing Approaches

Page 9: Distant Supervision with Imitation Learning

9

“If two entities participate in a relation, any sentence that contains those two entities might express that relation.” (Mintz, 2009)

Amy Jade Winehouse was a singer and songwriter known for her eclectic mix of musical genres including R&B, soul and jazz. Blur helped to popularise the Britpop genre. Beckham rose to fame with the all-female pop group Spice Girls.

Name Genre … Amy Winehouse Amy Jade Winehouse Wino …

R&B soul jazz …

Blur …

Britpop …

Spice Girls …

pop …

different lexicalisations

Distant Supervision

Page 10: Distant Supervision with Imitation Learning

10

Creating positive & negative training

examples

Feature Extraction

Classifier Training

Prediction of New

Relations

Distant Supervision

Page 11: Distant Supervision with Imitation Learning

11

Creating positive & negative training

examples

Feature Extraction

Classifier Training

Prediction of New

Relations

Distant Supervision

KB: album(The Beatles, Abbey Road)

Positive: The Beatles released their album Abbey Road in 1969. Negative: The Beatles played in Edinburgh.

depLemmaPath=released_OJB, possPath=VBD_PRP_album, …

possPath=_release+VBN=0.354677 depLemmaPath=_release=1.81213, …

Michael Jackson’s third album is Music & Me

album(Michael Jackson, Music & Me)

Page 12: Distant Supervision with Imitation Learning

12

Distant Supervision

Creating positive & negative training

examples

Feature Extraction

Classifier Training

Prediction of New

Relations

Supervised learning

Automatically generated training data

+

Distant Supervision

Page 13: Distant Supervision with Imitation Learning

13

•  Requires no manual effort •  Automatically label text with relations from knowledge base •  Train statistical model (not patterns) •  Extract relations with respect to knowledge base

Ø  Combine benefits of supervised approaches (learn statistical model) and bootstrapping RE approaches (only list of extractions as input)

Distant Supervision

Page 14: Distant Supervision with Imitation Learning

14

•  Web crawl corpus, created using entity-specific search queries, e.g. “`The Beatles’ Musical Artist album”

Class Property / Relation

Book author, characters

Musical Artist

album, record label, track

Film director, producer, actor, character

Politician birthplace, educational institution, spouse

Evaluation: Corpus

Class Property / Relation

Business employees, founders

Educational Institution

mascot, city

River origin, mouth

Page 15: Distant Supervision with Imitation Learning

15

•  Distant Supervision does not require manual annotation but depends on NERC for candidate identification

NERC for Distant Supervision

birthplace

Bill Gates was born in Seattle, Washington

LOC

Page 16: Distant Supervision with Imitation Learning

16

•  Existing works use Stanford NER (Finkel et al. 2005) or FIGER (Ling and Weld 2012)

Stanford NER FIGER

Location 14 Location (City, Country, County, Province, Railway, …)

Person

15 Person (Actor, Architect, Artist, Musician, Terrorist, …)

Organisation

13 Org (Airline, Company, Educational_Institution, ….)

Misc

13 Product (Car, Train, Camera, Software, Weapon, …) 9 Building (Airport, Hospital, Restaurant, Theater, …) 5 Art (Film, Play, Written_Work, Music, Newspaper) 7 Event (Election, Military_Conflict, Terrorist_Attack, …) 30 Misc (Time, Educational_Degree, Drug, Algorithm, …)

NERC for Distant Supervision

Page 17: Distant Supervision with Imitation Learning

17

•  Problem 1: missing NE types even with fine-grained schemas

album

Michael Jackson’s third album is Music & Me

Musician ? Misc

NERC for Distant Supervision

Page 18: Distant Supervision with Imitation Learning

18

•  Problem 1: missing NE types even with fine-grained schemas

•  Problem 2: domain difference between training and testing data (e.g. newswire, Wikipedia vs. Web)

album

Michael Jackson’s third album is Music & Me

? Misc

NERC for Distant Supervision

Page 19: Distant Supervision with Imitation Learning

19

•  Task decomposition •  NER: Named Entity Boundary Recognition •  NEC: Assigning Types to NEs •  RE: Relation Extraction

•  Solution 1: •  NER: recognise NEs with heuristics (e.g. POS-based, HTML) •  NEC: apply trained model (e.g. Stanford, FIGER), add labels of objects

to RE features •  RE: train model with distantly annotated data as usual

•  NER Heuristics:

•  Noun phrases, capitalised phrases •  Phrases from HTML markup: <ahref>, <li>, <h1>, <h2>, <h3>,

<strong>, <b>, <em>, <i>

NERC for Distant Supervision

Page 20: Distant Supervision with Imitation Learning

20

album

Michael Jackson’s third album is Music & Me

O

NERC for Distant Supervision

•  Solution 1: •  NER: recognise NEs with heuristics (e.g. POS-based, HTML) •  NEC: add object candidate labels (e.g. with Stanford, FIGER) •  RE: train model with distantly annotated data as usual

•  RE features: ne=O, depLemmaPath=poss_album_subj, possPath=POS_JJ_album_VBZ, …

Page 21: Distant Supervision with Imitation Learning

21

•  Experiments with 16 relations (e.g. album, character, record label, author, origin)

Recall of NER with off-the-shelf Stanford model compared to heuristics

NERC for Distant Supervision

Page 22: Distant Supervision with Imitation Learning

22

•  Solution 2: •  NER: with heuristics •  NEC & RE: train one-stage model

•  NEC features: obj=Music & Me, w[-1-2]=album is, … •  RE features: depLemmaPath=poss_album_subj,

possPath=POS_JJ_album_VBZ, …

album

Michael Jackson’s third album is Music & Me

NERC for Distant Supervision

Page 23: Distant Supervision with Imitation Learning

23

•  Solution 2: •  NER: with heuristics •  NEC & RE: train one-stage model

•  Problem 3: NEC features useful for RE but

•  RE features are sparse (e.g. path between subject and object) •  NEC features can overpower RE features

album

Michael Jackson’s third album is Music & Me

NERC for Distant Supervision

Page 24: Distant Supervision with Imitation Learning

24

•  Problem 3: NEC features useful for RE but: •  RE features are sparse (e.g. path between subject and object) •  NEC features can overpower RE features

Ø  Model would incorrectly predict Stephen Spielberg, because context is stronger (w[-1]=director)

One of director Stephen Spielberg’s greatest heroes was Alfred Hitchcock, the mastermind behind

Psycho.

Candidates for director relation with subject Psycho: Stephen Spielberg, Alfred Hitchcock

NERC for Distant Supervision

Page 25: Distant Supervision with Imitation Learning

25

•  Ideal Solution: •  NER: with heuristics •  NEC: trained classifier •  RE: trained classifier

Ø  That would be great, but how can we do this without NEC training data?

NERC for Distant Supervision

Page 26: Distant Supervision with Imitation Learning

26

•  Imitation learning with DAGGER (Ross et al. 2011) •  Also called learning-to-search, inverse reinforcement learning •  Structured prediction method •  Able to deal with latent variables, only labels for last stage (RE) needed •  Decompose tasks into sequence of actions made at different stages •  Dependencies between tasks are learnt by appropriate generation of

training examples •  Classifiers are trained iteratively

•  Relationship between Reinforcement Learning and Imitation learning •  In reinforcement, the policy is being learnt and the actions are given •  In imitation learning, the policy is given and the actions are learnt •  (hence inverse)

Imitation Learning for Distant Supervision

Page 27: Distant Supervision with Imitation Learning

27 Imitation Learning for Distant Supervision

•  Learning from demonstrator •  Possible actions are given •  Correctness of actions (i.e.

costs) are assessed by taking actions, predicting remaining ones and evaluating result

•  Dependencies between actions are learnt by observation

•  Origins of Imitation learning •  Robotics •  Game playing (e.g. Ortega et al. 2012)

•  Mario’s possible actions (simplified): move left, move right, duck, run, jump, fire

Page 28: Distant Supervision with Imitation Learning

28 Imitation Learning for Distant Supervision

•  Imitation Learning for NLP •  Actions: NEC, if NEC positive followed by RE •  Demonstrator (expert policy) tries to replicate labelled RE data •  Base classifier: cost sensitive classification learning with PA

(passive-aggressive classifier) •  NEC labels are needed but not specified by labelled RE data •  Solution: look-ahead!

Page 29: Distant Supervision with Imitation Learning

29

•  Iteration 1, NEC Stage

Imitation Learning for Distant Supervision

True False Features

NEC Stage ? ? obj=Music & Me, …

RE Stage depLemma=poss_album_subj, …

Michael Jackson’s third album is Music & Me

?

Page 30: Distant Supervision with Imitation Learning

30

•  Iteration 1, RE Stage

Imitation Learning for Distant Supervision

True False Features

NEC Stage ? ? obj=Music & Me, …

RE Stage 0 1 depLemma=poss_album_subj, …

True

Michael Jackson’s third album is Music & Me

?

Page 31: Distant Supervision with Imitation Learning

31

•  Iteration 1, RE Stage

Imitation Learning for Distant Supervision

True False Features

NEC Stage 0 1 obj=Music & Me, …

RE Stage 0 1 depLemma=poss_album_subj, …

True

Michael Jackson’s third album is Music & Me

True

Page 32: Distant Supervision with Imitation Learning

32

•  Iteration 1 •  NEC and RE Stage: predict labels according to labelled data

(expert policy) with look-ahead •  Extract features •  Assess costs •  CSC example: features, costs -> will be remembered for next iterations! •  Train classifier for each stage based on CSC example (learned policy)

Imitation Learning for Distant Supervision

Page 33: Distant Supervision with Imitation Learning

33

•  Iteration 1 •  NEC and RE Stage: predict labels according to labelled data

(expert policy) with look-ahead •  Extract features •  Assess costs •  CSC example: features, costs -> will be remembered for next iterations! •  Train classifier for each stage based on CSC example (learned policy)

•  Iteration >= 2 •  Predict labels according to expert policy or learned policy •  Learned policy is chosen stochastically, i.e. p=(1−β)

i: number iteration, β: learning rate •  With each iteration it is more likely that expert policy is chosen •  The bigger the learning rate the faster learner moves away from labelled

data

Imitation Learning for Distant Supervision

i-1

Page 34: Distant Supervision with Imitation Learning

34

•  Reminder: Problem 3: NEC features useful for RE but: •  RE features are sparse (e.g. path between subject and object) •  NEC features can overpower RE features

Ø  Model would incorrectly predict Stephen Spielberg, because context is stronger (w[-1]=director)

One of director Stephen Spielberg’s greatest heroes was Alfred Hitchcock, the mastermind behind

Psycho.

Candidates for director relation with subject Psycho: Stephen Spielberg, Alfred Hitchcock

NERC for Distant Supervision

Page 35: Distant Supervision with Imitation Learning

35

•  Multi-stage modelling compensates for mistakes

Imitation Learning for Distant Supervision

Confidence Prediction Features

NEC Stage 0.629 True obj=Stephen Spielberg, …

RE Stage -0.571 False depLemma=_POSS_heroes_ …

False

Steven Spielberg’s greatest heroes (…) Psycho

True

Page 36: Distant Supervision with Imitation Learning

36

•  Multi-stage modelling compensates for mistakes

Imitation Learning for Distant Supervision

True

Alfred Hitchcock, the mastermind behind Psycho

True

Confidence Prediction Features

NEC Stage 0.629 True obj=Alfred Hitchcock, …

RE Stage 0.571 True depLemma=_APPOS_mastermind …

Page 37: Distant Supervision with Imitation Learning

37

•  Web crawl corpus, created using entity-specific search queries, e.g. “`The Beatles’ Musical Artist album”

Class Property / Relation

Book author, characters

Musical Artist

album, record label, track

Film director, producer, actor, character

Politician birthplace, educational institution, spouse

Evaluation: Corpus

Class Property / Relation

Business employees, founders

Educational Institution

mascot, city

River origin, mouth

Page 38: Distant Supervision with Imitation Learning

38

•  Improving NEC for RE with Web Features

Evaluation: NEC Features

Arctic Monkeys Arctic Monkeys are a rock band from Sheffield, famous for albums such as AM. Albums: - Whatever People Say I Am, That's What I'm Not - AM

header link

bold list

Page 39: Distant Supervision with Imitation Learning

39

•  NEC: •  Word features: Object occurrence, POS, digit and capitalisation

pattern etc. •  Context features: 2 words to left and right: BOW, sequence, bag of

POS, POS sequence, as 1-grams and 2-grams •  Web features Ø  Best F1 and P-avg achieved with all of those

•  RE: •  Context features (as for NEC) •  POS and words between subject and object, as seq and BOW •  Dependency path with/without lemmas Ø  Best F1 and P-avg with sparse dependency features and 2-gram

context features

Evaluation: Features

Page 40: Distant Supervision with Imitation Learning

40 Evaluation Setting

•  Models: •  All models: NER with candidate identification heuristics (POS,

Web-based)

•  Rel only: one-stage, only relation features •  Stanf: one-stage with Stanf NEC labels added to RE features •  FIGER: one-stage with FIGER labels added to RE features •  OS: one-stage with NEC features added to RE features •  IL: two-stage with imitation learning

Page 41: Distant Supervision with Imitation Learning

41 Overall Results

Page 42: Distant Supervision with Imitation Learning

42 Conclusions EMNLP Experiments

•  Imitation learning approach outperforms baselines with supervised NEC (Stanford NER and FIGER) by 10 points in average precision

•  For NEC: Web features such as appearance in lists or links to other Web improve average precision by 7 points

•  For RE: parse, high-precision features (such as parse) outperform high-recall low-precision features (such as BOW features)

Page 43: Distant Supervision with Imitation Learning

43 Distant Supervision Challenges

•  Automatically generating training data •  Can lead to noisy training examples

Let It Be is the twelfth album by The Beatles which contains their hit single Let It Be.

Name Album Track The Beatles …

Let It Be …

Let It Be …

Page 44: Distant Supervision with Imitation Learning

44 Distant Supervision Challenges

•  Automatically generating training data •  Can lead to noisy training examples

•  Use ‘Let It Be’ mentions as positive training examples for album or for track?

•  Problem: if both mentions of ‘Let It Be’ are used to extract features for both album and track, wrong weights are learnt

Let It Be is the twelfth album by The Beatles which contains their hit single Let It Be.

Name Album Track The Beatles …

Let It Be …

Let It Be …

Page 45: Distant Supervision with Imitation Learning

45 Distant Supervision Challenges

•  Automatically generating training data •  Can lead to noisy training examples

•  Evaluation •  If training data is generated automatically, how / on what data can

approaches be evaluated?

•  Co-Reference Resolution •  Does training / testing data have to contain names of subj and obj

directly?

•  Named Entity Recognition and Classification •  Supervised off-the-shelf NERC approaches are not perfect (see rest of

talk)

Page 46: Distant Supervision with Imitation Learning

46 Conclusions / Future Work

•  Distant supervision allows to automatically populate knowledge bases without manual effort

•  Distant supervision can be applied to any domain •  Ongoing challenges:

•  Reducing errors made by automatic labeling •  Distant supervision with co-reference resolution •  NERC for distant supervision

Page 47: Distant Supervision with Imitation Learning

47 References

•  Isabelle Augenstein, Andreas Vlachos, Diana Maynard (2015). Extracting Relations between Non-Standard Entities using Distant Supervision and Imitation Learning. EMNLP 2015.

•  Isabelle Augenstein, Diana Maynard, Fabio Ciravegna (2015). Distantly Supervised Web Relation Extraction for Knowledge Base Population. Semantic Web Journal.

•  Isabelle Augenstein, Diana Maynard, Fabio Ciravegna (2014). Relation Extraction from the Web using Distant Supervision. EKAW 2014, nominated for best paper award.

•  Isabelle Augenstein (2014). Joint Information Extraction from the Web using Linked Data. ISWC 2014.

•  Isabelle Augenstein (2014). Seed Selection for Distantly Supervised Web-Based Relation Extraction. SWAIE Workshop at COLING 2014.

Page 48: Distant Supervision with Imitation Learning

48 References

Distant Supervision: •  Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant

supervision for relation extraction without labeled data. ACL- IJCNLP. NERC: •  Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005.

Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. ACL.

•  Xiao Ling and Daniel S. Weld. 2012. Fine-Grained Entity Recognition. AAAI. Imitation Learning: •  Stéphane Ross, Geoffrey J. Gordon, and Drew Bagnell. 2011. A Reduction

of Imitation Learning and Structured Prediction to No-Regret Online Learning. JMLR.

•  Juan Ortega, Noor Shaker, Julian Togelius and Georgios N. Yannakakis (2013): Imitating human playing styles in Super Mario Bros. Entertainment Computing, Elsevier.

Page 49: Distant Supervision with Imitation Learning

49

Thank you for your attention!

Questions?