practical structured learning for natural language...

39
Structured Learning for Language Slide 1 Hal Daumé III ([email protected]) Practical Structured Learning for Natural Language Processing Hal Daumé III Information Sciences Institute University of Southern California [email protected] Committee:   Daniel Marcu (Adviser), Kevin Knight, Eduard Hovy   Stefan Schaal, Gareth James, Andrew McCallum

Upload: others

Post on 24-May-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Structured Learning for LanguageSlide 1

Hal Daumé III ([email protected])

Practical Structured Learningfor Natural Language Processing

Hal Daumé III

Information Sciences InstituteUniversity of Southern California

[email protected]

Committee:  Daniel Marcu (Adviser), Kevin Knight, Eduard Hovy  Stefan Schaal, Gareth James, Andrew McCallum

Structured Learning for LanguageSlide 2

Hal Daumé III ([email protected])

What Is Natural Language Processing?Processing of language to solve a problem

Machine Translation

Summarization

Information Extraction

Question Answering

Parsing

Understanding

Input OutputArabic sentence English sentence

Long documents Short documents

Document Database

Corpus + query

Sentence

Sentence

Answer

Syntactic tree

Logical sentence

All have innatestructure to

varying degrees

Corpus­based NLP:Collect example input/output pairs and learn

Structured Learning for LanguageSlide 3

Hal Daumé III ([email protected])

What is Machine Learning?Automatically uncovering structure in data

Classification

Regression

Dimensionality reduction

Clustering

Sequence labeling

Reinforcement learning State space +   observations

Input OutputVector Bit (or bits)

Vector Real number

Vector Shorter vector

Vectors

Vectors

Partition

Labels

Policy

Structured Learning for LanguageSlide 4

Hal Daumé III ([email protected])

NLP versus Machine LearningMachine TranslationSummarizationInformation ExtractionQuestion AnsweringParsingUnderstanding

Arabic sentence English sentenceLong documents Short documentsDocument DatabaseCorpus + querySentenceSentence

AnswerSyntactic treeLogical sentence

ClassificationRegressionDimensionality reductionClusteringSequence labeling

Vector Bit (or bits)Vector Real numberVector Shorter vectorVectorsVectors

PartitionLabels

Reinforcement learning­inspired

State space Policy

EncodingAd HocSearch

ProvablyGood

NaturalDecomposition

Structured Learning for LanguageSlide 5

Hal Daumé III ([email protected])

Entity Detection and TrackingShakespeare scholarship, lively at the best of times, saw the fur flying yesterday after a German academic claimed to have authenticated not just one but four contemporary images of the playwright ­ and suggested, to boot, that he had died of cancer.

As the National Portrait Gallery planned to reveal that only one of half a dozen claimed portraits of William Shakespeare can now be considered genuine, Prof Hildegard Hammerschmidt­Hummel said she could prove that there were at least four surviving portraits of the playwright.

Why?  Summarization, Machine Translation, etc...

§1.3

(Exa

mple

Pro

blem

: EDT

)

Structured Learning for LanguageSlide 6

Hal Daumé III ([email protected])

Why is EDT Hard?

Long­range dependencies

Syntax, Semantics and Discourse

World Knowledge

Highly constrained outputs

Shakespeare scholarship, lively at the best of times, saw the fur flying yesterday after a German academic claimed to have authenticated not just one but four contemporary images of the playwright ­ and suggested, to boot, that he had died of cancer.

§1.3

(Exa

mple

Pro

blem

: EDT

)

Structured Learning for LanguageSlide 7

Hal Daumé III ([email protected])

How is EDT Typically Attacked?Shakespeare scholarship, lively at the best of times, saw the fur flying yesterday after a German academic claimed to have authenticated not just one but four contemporary images of the playwright ­ and suggested, to boot, that he had died of cancer.

Shakespeare  scholarship  ,  ...   a  German  academic  claimed  ...      PER                 *         *        *    GPE         PER           *

Shakespeare

playwright

German

academiche

Sequence Labeling

Binary Classification +    Clustering

§5.2

(EDT

: Prio

r Wor

k)

Structured Learning for LanguageSlide 8

Hal Daumé III ([email protected])

Learning in NLP Applications

Model

Features

Learning

Search

Structured Learning for LanguageSlide 9

Hal Daumé III ([email protected])

Result: Searn (= “Search + Learn”)

Computationally efficient

Strong theoretical guarantees

State­of­the­art performance

Broadly applicable

Easy to implement

§3 (S

earc

h-ba

sed

Stru

cture

d Pr

edict

ion)

Structured Learning for LanguageSlide 10

Hal Daumé III ([email protected])

Talk Outline➢ Searn: Search­based Structured Prediction

➢ Experimental Results:➢ Sequence Labeling➢ Entity Detection and Tracking➢ Automatic Document Summarization

➢ Discussion and Future Work

Structured Learning for LanguageSlide 11

Hal Daumé III ([email protected])

Structured Prediction

Formulated as a maximization problem:

y = arg max y∈Y x f x , y ;

Search Model Features Learning

Search (and learning) tractableonly for very simple Y(x) and f

§2.2

(Mac

hine

Lear

ning:

Stru

cture

d Pr

edict

ion)

Structured Learning for LanguageSlide 12

Hal Daumé III ([email protected])

Reduction for Structured Prediction➢ Idea: view structured prediction in light of search

N

P

D

V

R

N

P

D

V

R

N

P

D

V

R

Each step here looks like it could be represented as a weighted multi­class problem. 

Can we formalize this idea?

I sing merrily

§3.3

(Sea

rch-

base

d St

ructu

red

Pred

iction

)

Structured Learning for LanguageSlide 13

Hal Daumé III ([email protected])

Error-Limiting Reductions➢ A formal mapping between learning problems

LearningProblem

“A”

LearningProblem “B”

What we wantto solve

What we know howto solve

F

F ­1

➢ F maps examples from A to examples from B➢ F ­1 maps a classifier h for B to a classifier that solves A➢ Error limiting: good performance on B implies good 

performance on A

[Beygelzimer et al.,ICML 2005]

§2.3

(Mac

hine

Lear

ning:

Red

uctio

ns)

Structured Learning for LanguageSlide 14

Hal Daumé III ([email protected])

Our Reduction Goal

StructuredPrediction

BinaryClassification

ImportanceWeighted

Classification

“Searn”

Cost­sensitive

Classification

“Weighted All Pairs”[Beygelzimer et al.,

ICML 2005]

“Costing”[Zadronzy,

Langford &Abe,

ICDM 2003]

➢ Error­boundsget composed

➢ New techniques for binaryclassification lead to newtechniques for structured prediction

➢ Any generalization theory for binaryclassification immediately appliesto structured prediction

§2.3

(Mac

hine

Lear

ning:

Red

uctio

ns)

Structured Learning for LanguageSlide 15

Hal Daumé III ([email protected])

Two Easy ReductionsCosting [Zadronzy, Langford &

 Abe, ICDM 2003]

Weighted­All­Pairs[Beygelzimer, Dani, Hayes,Langford and Zadrozny, ICML 2005]

X×{±1}×ℝ≥0X×{±1}

hh hh hh L IW ≤L BC E [w]

X×{±1}×ℝ≥0X×ℝ≥0×⋯×ℝ≥0

1v21v2 1v31v3 1vK1vK

2v32v3 2vK2vK......

L CS ≤2 L IW

§2.3

(Mac

hine

Lear

ning:

Red

uctio

ns)

Structured Learning for LanguageSlide 16

Hal Daumé III ([email protected])

A First Attempt

A

N

D

V

R

A

N

D

V

R

A

N

D

V

R

I sing merrily

At eachcorrect state:  Train classifier  to choose   next state

Thm (Kääriäinen): There is a 1st­order binary Markov problem  such that a classification error  implies a Hamming loss of:

T2−

1−1−2T1

4

12≈

T2

§3.3

(Sea

rch-

base

d St

ructu

red

Pred

iction

)

Structured Learning for LanguageSlide 17

Hal Daumé III ([email protected])

Reducing Structured Prediction

A

N

D

V

R

A

N

D

V

R

A

N

D

V

R

I sing merrily

Key Assumption: Optimal Policy for training data

Weak!

Given: input, true output and state;Return: best successor state

§3.4

.2 (S

earn

: Opt

imal

Polic

y)

Structured Learning for LanguageSlide 18

Hal Daumé III ([email protected])

IdeaOptimal Policy Learned Policy

(Features)

§3.4

(Sea

rn: T

raini

ng)

Structured Learning for LanguageSlide 19

Hal Daumé III ([email protected])

Generate classification problems

Iterative Algorithm: SearnSet current policy = optimal policy

Decode using current policy

... after  a  German  academic  claimed ...

    *      *  GPE     PER       *

Repeat:

PER

ORG

*

*         *

PER       *

PER       *

L = 0.3

L = 0

L = 0.2

L = 0.25

Learn new multiclass classifierInterpolate: cur = *cur + (1­)*new

§3.4

.3 (S

earn

: Algo

rithm

)

Structured Learning for LanguageSlide 20

Hal Daumé III ([email protected])

Theoretical Analysis

Theorem: For conservative , after 2T3 ln Titerations, the loss of the learned policy is bounded as follows:

L h ≤ L h0 2 T ln T lavg 1ln T cmax

T

Loss of theoptimal policy

Averagemulticlass

classificationloss

Worst caseper­step

loss

§3.5

(Sea

rn: T

heor

etica

l Ana

lysis)

Structured Learning for LanguageSlide 21

Hal Daumé III ([email protected])

Why Does Searn Work?

A classifier tells us how to search

Impossible to train on full space

Want to train only where we'll wind up

Searn tells us where we'll wind up

§3.7

(Sea

rn: D

iscus

sion

and

Conc

lusion

s)

Structured Learning for LanguageSlide 22

Hal Daumé III ([email protected])

Talk Outline➢ Searn: Search­based Structured Prediction

➢ Experimental Results:➢ Sequence Labeling➢ Entity Detection and Tracking➢ Automatic Document Summarization

➢ Discussion and Future Work

Structured Learning for LanguageSlide 23

Hal Daumé III ([email protected])

Proof on Concept: Sequence LabelsSpanish Named

Entity Recognition

94

95

96

97

98

SVM

ISO

CR

FSe

arn

Sear

n+da

ta

65

70

75

80

85

90

95

HandwritingRecognition

LR

SVM

M3N

Sear

nSe

arn+

data

93

94

95

96

97

Chunking+Tagging

Incr

emen

tal P

erce

ptro

n

LaS

O

CR

F

Sear

n

§4.4

(Seq

uenc

e La

belin

g: C

ompa

rison

)

Structured Learning for LanguageSlide 24

Hal Daumé III ([email protected])

Talk Outline➢ Searn: Search­based Structured Prediction

➢ Experimental Results:➢ Sequence Labeling➢ Entity Detection and Tracking➢ Automatic Document Summarization

➢ Discussion and Future Work

Structured Learning for LanguageSlide 25

Hal Daumé III ([email protected])

Entity Detection and TrackingShakespeare scholarship, lively at the best of times, saw the fur flying yesterday after a German academic claimed to have authenticated not just one but four contemporary images of the playwright ­ and suggested, to boot, that he had died of cancer.

... he had died of cancer

... he had died of cancer

... he had died of cancer

... he had died of cancer

... he had died of cancer

... he had died of cancer

... he had died of cancer

...

§5.6

.1 (E

DT: J

oint D

etec

tion

and

Core

fere

nce:

Sea

rch)

Structured Learning for LanguageSlide 26

Hal Daumé III ([email protected])

EDT Features

➢ Lexical➢ Syntactic➢ Semantic➢ Gazetteer➢ Knowledge­based

Shakespeare scholarship, lively at the best of times, saw the fur flying yesterday after a German academic claimed to have authenticated not just one but four contemporary images of the playwright ­ and suggested, to boot, that he had died of cancer.

§5.[4

5].3

(EDT

: Fea

ture

Fun

ction

s)

Structured Learning for LanguageSlide 27

Hal Daumé III ([email protected])

EDT Results (ACE 2004 data)

Sys1 Sys2 Sys3 Sys4 Searn72

74

76

78

80

Previous state of the artwith extra proprietary data§5

.6.3

(EDT

: Exp

erim

enta

l Res

ults)

Structured Learning for LanguageSlide 28

Hal Daumé III ([email protected])

Feature Contributions

-Lex -Syn -Sem -Gaz -KB68

70

72

74

76

78

80

82

§5.6

.3 (E

DT: E

xper

imen

tal R

esult

s)

Structured Learning for LanguageSlide 29

Hal Daumé III ([email protected])

Talk Outline➢ Searn: Search­based Structured Prediction

➢ Experimental Results:➢ Sequence Labeling➢ Entity Detection and Tracking➢ Automatic Document Summarization

➢ Discussion and Future Work

Structured Learning for LanguageSlide 30

Hal Daumé III ([email protected])

New Task: SummarizationGive me information about

the Falkland islands war

Argentina was still obsessed with the Falkland Islands even in 1994, 12 years after its defeat in the 74­day war with Britain. The country's overriding foreign policy aim continued to be winning sovereignty over the islands.

That's too muchinformation to read!

The Falkland islandswar, in 1982, wasfought betweenBritain and Argentina.

That's perfect!

Standard approach is sentence extraction, but that is often deemed too “coarse” to producegood, very short summaries.  We wish to also drop words and phrases => document compression

Structured Learning for LanguageSlide 31

Hal Daumé III ([email protected])

Structure of Search

S1 S2 S3 S4 S5 Sn...

= frontier node = summary node

➢ Lay sentences out sequentially➢ Generate a dependency parse of each sentence➢ Mark each root as a frontier node➢ Repeat:

➢ Choose a frontier node node to add to the summary➢ Add all its children to the frontier➢ Finish when we have enough words

To make search more tractable, we run an initial round of sentence extraction @ 5x length

The   man   ate   a   big   sandwich

The

man

ate

a big

sandwich

Argentina was still obsessed with the Falkland Islands even in 1994, 12 years after its defeat in the 74­day war with Britain. The country's overriding foreign policy aim continued to be winning sovereignty over the islands.

Optimal Policy not analytically available;Approximated with beam search.

§6.1

(Sum

mar

izatio

n: V

ine-G

rowt

h M

odel)

Structured Learning for LanguageSlide 32

Hal Daumé III ([email protected])

Example Output (40 word limit)Sentence Extraction + Compression:

Argentina and Britain announced an agreement, nearly eight years after they fought a 74­day war a populated archipelago off Argentina's coast.  Argentina gets out the red carpet, official royal visitor since the end of the Falklands war in 1982.

Vine Growth:Argentina and Britain announced to restore full ties, eight years after they fought a 74­day war over the Falkland islands.  Britain invited Argentina's minister Cavallo to London in 1992 in the first official visit since the Falklands war in 1982.

6  Diplomatic ties restored 3  Falkland war was in 19825  Major cabinet member visits 3  Cavallo visited UK5  Exchanges were in 1992 2  War was 74­days long3  War between Britain and Argentina

+24

+13

§6.7

(Sum

mar

izatio

n: E

rror A

nalay

sis)

Structured Learning for LanguageSlide 33

Hal Daumé III ([email protected])

Results

Approx. Op-timal Vine Growth

Vine Growth (Searn)

Optimal Ex-tract @100

Extract @100

Baseline2

2.5

3

3.5

4

[Daumé III and Marcu, DUC 2005]

§6.6

(Sum

mar

izatio

n: E

xper

imen

tal R

esult

s)

Structured Learning for LanguageSlide 34

Hal Daumé III ([email protected])

Talk Outline➢ Searn: Search­based Structured Prediction

➢ Experimental Results:➢ Sequence Labeling➢ Entity Detection and Tracking➢ Automatic Document Summarization

➢ Discussion and Future Work

Structured Learning for LanguageSlide 35

Hal Daumé III ([email protected])

What Can Searn Do?

Solve structured prediction problems...

...efficiently.

...with theoretical guarantees.

...with weak assumptions on structure.

...and outperform the competition.

...with little extra code required.

§7 (C

onclu

sions

and

Fut

ure

Dire

ction

s)

Structured Learning for LanguageSlide 36

Hal Daumé III ([email protected])

What Can Searn Not Do? (Yet)

Solve structured prediction problems...

...with only weak feedback.

...with hidden variables.

...with enormous branching factors.

...without pre­defined search spaces.

...with “combinatorial” outputs.

...with only an optimal path.

§7 (C

onclu

sions

and

Fut

ure

Dire

ction

s)

Structured Learning for LanguageSlide 37

Hal Daumé III ([email protected])

Thanks!Questions?

Structured Learning for LanguageSlide 38

Hal Daumé III ([email protected])

BACKUP SLIDES

Structured Learning for LanguageSlide 39

Hal Daumé III ([email protected])

Conditional Random Fields[Lafferty, McCallum, Pereira, ICML 2001]

Max­margin Markov Networks[Taskar, Guestrin, Koller, NIPS 2003]

Searn[Daumé III, Langford, Marcu,  submitted to NIPS 2006]

Incremental Perceptron[Collins & Roark, EMNLP 2004]

LaSO[Daumé III, Marcu, ICML 2005]

Optimal Policies

Optimal Path

Approximately Optimal Policy

Optimal Policy

Weak Feedback

Loss­augmented Search

Normalized Search

Optimally Completable

⊇ ⊇⊃ ⊃

⊃ ⊃⊃ ⊃

??

⊃ ⊃

??

⊃ ⊃

Most NLP Problems