accurate max-margin training for structured output spaces

27
Accurate Max-Margin Training for Structured Output Spaces Sunita Sarawagi Rahul Gupta IIT Bombay http://www.cse.iitb.ac.in/ ~{sunita,grahul} 1

Upload: denise-hayes

Post on 02-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Accurate Max-Margin Training for Structured Output Spaces. Sunita Sarawagi Rahul Gupta IIT Bombay http://www.cse.iitb.ac.in/~{sunita,grahul}. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A A A A. Structured learning. Model. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Accurate Max-Margin Training for Structured Output Spaces

1

Accurate Max-Margin Training for Structured Output Spaces

Sunita Sarawagi Rahul Gupta

IIT Bombay

http://www.cse.iitb.ac.in/~{sunita,grahul}

Page 2: Accurate Max-Margin Training for Structured Output Spaces

2

Structured learning

Score of a prediction y for input x: s(x,y) = w. f(x,y)

Predict: y* = argmaxy s(x,y) Exploit decomposability of feature functions

f(x,y) = d f (x,yd,d)

x Model Structured y1. Vector: y1 ,y2,..,yn

2. Segmentation3. Tree4. Alignment5. ..

Feature function vector f(x,y) = f1(x,y), f2(x,y),…,fK(x,y),

w=w1,..,wK

Page 3: Accurate Max-Margin Training for Structured Output Spaces

Training structured models Given

N input output pairs (x1 y1), (x2 y2), …, (xN yN)

Error of output : Ei(y) Also decomposes over smaller parts: Ei(y) = c Ei,c(yc) Example: Hamming(yi, y)= c [[yi,c != yc]]

Find w Small training error Generalizes to unseen instances Efficient for structured models

Page 4: Accurate Max-Margin Training for Structured Output Spaces

4

Related work: max-margin trainingof structured models

Taskar, B. (2004). Learning structured prediction models: A large margin approach. Doctoral dissertation, Stanford University.

Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research (JMLR), 6(Sep), 1453–1484.

Several others…

Page 5: Accurate Max-Margin Training for Structured Output Spaces

5

Max-margin formulations Margin scaling

minw;»12jjwjj2 + C

P Ni=1 »i

s:t: w:f(xi;yi) ¸ Ei(y) + w:f(xi;y) ¡ »i 8y 6= yi;8i

»i ¸ 0 8i

Page 6: Accurate Max-Margin Training for Structured Output Spaces

Max-margin formulations Margin scaling

minw;»12jjwjj2 + C

P Ni=1 »i

s:t: w:f(xi;yi) ¸ Ei(y) + w:f(xi;y) ¡ »i 8y 6= yi;8i

»i ¸ 0 8i

minw;»12jjwjj2 + C

P Ni=1 »i

s:t: w:f(xi;yi) ¸ 1+ w:f(xi;y) ¡ »iE i (y) 8y 6= yi;8i

»i ¸ 0 8i

Exponential number of constraints Use cutting plane

Slack scaling

Page 7: Accurate Max-Margin Training for Structured Output Spaces

7

Margin Vs Slack scaling Margin

Easy inference of most violated constraint for decomposable f and E

Too much importance to y far from margin Slack

Difficult inference of violated constraint

Zero loss of everything outside margin of 1

yM = argmaxy (w:f(xi;y) + Ei(y))

yS = argmaxy (w:f(xi;y) ¡ »iE i (y) )

Page 8: Accurate Max-Margin Training for Structured Output Spaces

8

Accuracy comparison

Address Cora CoNLL0310

12

14

16

18

20

22

24

26

28

30 Sequence labelingMarginSlack

Sp

an

F1

Err

or

Address Cora15

16

17

18

19

20

21

22

23 Segmentation

Sp

an

F1

Err

or

Slack scaling up to 25% better than Margin scaling.

Page 9: Accurate Max-Margin Training for Structured Output Spaces

9

Approximating Slack inference Slack inference: maxy s(y)-»/E(y)

Decomposability of E(y) cannot be exploited.

-»/E(y) is concave in E(y) Variational method to rewrite as linear function

¡ »E (y) = min¸ ¸ 0 ¸E (y) ¡ 2

p(»̧ )

Page 10: Accurate Max-Margin Training for Structured Output Spaces

10

Approximating Slack inference Now approximate the inference problem as:

maxy

µs(y) ¡

»E (y)

¶= max

ymin¸ ¸ 0

s(y) + ¸E (y) ¡ 2p

»̧

· min¸ ¸ 0

maxy

s(y) + ¸E (y) ¡ 2p

»̧

Same tractable MAP as in Margin Scaling

Page 11: Accurate Max-Margin Training for Structured Output Spaces

11

Approximating slack inference Now approximate the inference problem as:

maxy

µs(y) ¡

»E (y)

¶= max

ymin¸ ¸ 0

s(y) + ¸E (y) ¡ 2p

»̧

· min¸ ¸ 0

maxy

s(y) + ¸E (y) ¡ 2p

»̧

Same tractable MAP as in margin scaling

Convex in ¸minimize using line search, Bounded interval [¸l, ¸u] exists since only want violating y.

Page 12: Accurate Max-Margin Training for Structured Output Spaces

12

Slack Vs ApproxSlack

Address Cora CoNLL0310

12

14

16

18

20

22

24

26

28

30 Sequence labeling Margin

Slack

ApproxSlack

Sp

an

F1

Err

or

Address Cora15

16

17

18

19

20

21

22

23 Segmentation

Sp

an

F1

Err

or

ApproxSlack gives the accuracy gains of Slack scaling while requiring same the MAP inference same as Margin scaling.

Page 13: Accurate Max-Margin Training for Structured Output Spaces

13

Limitation of ApproxSlack Cannot ensure that a violating y will be found

even if it exists No ¸ can ensure that.

Proof: s(y1)=-1/2 E(y1) = 1 s(y2) = -13/18 E(y2) = 2 s(y3) = -5/6 E(y3) = 3 s(correct) = 0 » = 19/36 y2 has highest s(y)-»/E(y) and is violating. No ¸ can score y2 higher than both y1 and y2

-0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 00

1

2

3

4

Correcty1

y2

y3

s(y)

E(y

)

Page 14: Accurate Max-Margin Training for Structured Output Spaces

Max-margin formulations Margin scaling

minw;»12jjwjj2 + C

P Ni=1 »i

s:t: w:f(xi;yi) ¸ Ei(y) + w:f(xi;y) ¡ »i 8y 6= yi;8i

»i ¸ 0 8i

minw;»12jjwjj2 + C

P Ni=1 »i

s:t: w:f(xi;yi) ¸ 1+ w:f(xi;y) ¡ »iE i (y) 8y 6= yi;8i

»i ¸ 0 8i

Slack scaling

Page 15: Accurate Max-Margin Training for Structured Output Spaces

The pitfalls of a single shared slack variables Inadequate coverage for decomposable losses

s=0 s=-3 Correct : y0 = [0 0 0]

Separable: y1 = [0 1 0]

Non-separable: y2 = [0 0 1]

Margin/Slack loss = 1. Since y2 non-separable from y0, »=1, Terminate.Premature since different features may be involved.

Page 16: Accurate Max-Margin Training for Structured Output Spaces

16

A new loss function: PosLearn Ensure margin at each loss position

Compare with slack scaling.

minw;»

12jjwjj2 + C

P Ni=1

Pc »i ;c

s:t w:f(xi ;yi ) ¸ 1+ w:f(xi ;y) ¡ »i ;c

E i ;c (yc ) 8y : yc 6= yi ;c

»i ;c ¸ 0 i : 1:: :N;8c

minw;»12jjwjj2 + C

P Ni=1 »i

s:t: w:f(xi;yi) ¸ 1+ w:f(xi;y) ¡ »iE i (y) 8y 6= yi;8i

»i ¸ 0 8i

Page 17: Accurate Max-Margin Training for Structured Output Spaces

The pitfalls of a single shared slack variables Inadequate coverage for decomposable losses

s=0 s=-3 Correct : y0 = [0 0 0]

Separable: y1 = [0 1 0]

Non-separable: y2 = [0 0 1]

Margin/Slack loss = 1. Since y2 non-separable from y0, »=1, Terminate.Premature since different features may be involved.

PosLearn loss = 2Will continue to optimize for y1 even after slack » 3 becomes 1

Page 18: Accurate Max-Margin Training for Structured Output Spaces

18

Comparing loss functions

Address Cora CoNLL0310

12

14

16

18

20

22

24

26

28

30 Sequence labeling Margin

Slack

ApproxSlack

PosLearn

Sp

an

F1

Err

or

Address Cora15

16

17

18

19

20

21

22

23 Segmentation

Sp

an

F1

Err

or

PosLearn: same or better than Slack and ApproxSlack

Page 19: Accurate Max-Margin Training for Structured Output Spaces

19

Inference for PosLearn QP Cutting plane inference

For each position c, find best y that is wrong at c

Solve simultaneously for all positions c Markov models: Max-Marginals Segmentation models: forward-backward passes Parse trees

maxy:yc 6=y i ;c

µsi (y) ¡

»i ;c

E i ;c(yc)

¶= max

yc 6=y i ;c

µmaxy» yc

si (y) ¡»i ;c

E i ;c(yc)

MAP with restriction, easy!

Small enumerable set

Page 20: Accurate Max-Margin Training for Structured Output Spaces

20

Running time

0 25 50 75 100200

600

1000

1400

1800ApproxSlack

Margin

PosLearn

Slack

Training Percent

Tra

inin

g T

ime

(s

ec

)

Margin scaling might take time with less data since good constraints may not be found earlyPosLearn adds more constraints but needs fewer iterations.

Page 21: Accurate Max-Margin Training for Structured Output Spaces

21

Summary1. Margin scaling popular due to computational

reasons, but slack scaling more accurate A variational approximation for slack inference

2. Single slack variable inadequate for structured models where errors are additive A new loss function that ensures margin at each

possible error position of y

Future work: theoretical analysis of generalizability of loss functions for structured models

Page 22: Accurate Max-Margin Training for Structured Output Spaces

22

Questions?

Page 23: Accurate Max-Margin Training for Structured Output Spaces

23

Slack scaling: which constraint?

yS better coordinate ascent direction than yT for the dual faster convergence

yS = argmaxy (w:f(xi ;y) ¡ »iE i (y) )

yT = argmaxyEi(y)(1¡ w:±f(xi ;y))Versus

0 200 400 600 800 1000 12000

0.5

1

1.5

2

2.5

3

3.5

4

Original Slack

Primal Slack

Number of constraints

Tra

inin

g O

bje

cti

ve

Primal

Original

Page 24: Accurate Max-Margin Training for Structured Output Spaces

24

Is slack scaling the best there is? Y = Vector y_1,y_2,\ldots y_n E(y) = sum_i E(y_i) Two cases:

Y_i s independent Margin scaling is exactly what we want! Slack scaling puts too little margin

Y_i s dependent Margin scaling asks for more margin than needed Slack scaling better but

Page 25: Accurate Max-Margin Training for Structured Output Spaces

25

Max-margin loss surrogatesTrue error E i (argmaxyw:f(xi ;y))

maxy [E i (y) ¡ w:±f(xi ;y)]+

maxy E i (y)[1¡ w:±f(xi ;y)]+

Let w:±f(xi ;y) = w:f(xi ;yi) ¡ w:f(xi ;y)

1. Margin Loss

2. Slack Loss

-2 -1 0 1 2 3 40

2

4

6

8

10

12

Slack

Margin

Ideal

Ei(y)=4

w:±f(xi ;y)

Page 26: Accurate Max-Margin Training for Structured Output Spaces

26

Approximating slack inference -»/E(y) is concave in E(y) Variational method to rewrite as linear function

¡ »E (y) = min¸ ¸ 0 ¸E (y) ¡ 2

p(»̧ )

-»/E(y)

E(y)

Page 27: Accurate Max-Margin Training for Structured Output Spaces

27

When is margin scaling bad

Margin scaling adequate when• No useful edge features: each position independent of each other

PosLearn useful when • Edge features are important truly structured models

MarginPosLearnAddress Edge 29 21.6

No Edge 38 36.4

Cora Edge 25.1 16.6No Edge 56 58.3

ConLL Edge 15.3 14.9No Edge 19 20.1

F1 Span error