accurate max-margin training for structured output spaces

1

Accurate Max-Margin Training for Structured Output Spaces

Sunita Sarawagi Rahul Gupta

IIT Bombay

http://www.cse.iitb.ac.in/~{sunita,grahul}

2

Structured learning

Score of a prediction y for input x: s(x,y) = w. f(x,y)

Predict: y* = argmaxy s(x,y) Exploit decomposability of feature functions

f(x,y) = d f (x,yd,d)

x Model Structured y1. Vector: y1 ,y2,..,yn

2. Segmentation3. Tree4. Alignment5. ..

Feature function vector f(x,y) = f1(x,y), f2(x,y),…,fK(x,y),

w=w1,..,wK

Training structured models Given

N input output pairs (x1 y1), (x2 y2), …, (xN yN)

Error of output : Ei(y) Also decomposes over smaller parts: Ei(y) = c Ei,c(yc) Example: Hamming(yi, y)= c [[yi,c != yc]]

Find w Small training error Generalizes to unseen instances Efficient for structured models

4

Related work: max-margin trainingof structured models

Taskar, B. (2004). Learning structured prediction models: A large margin approach. Doctoral dissertation, Stanford University.

Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research (JMLR), 6(Sep), 1453–1484.

Several others…

5

Max-margin formulations Margin scaling

minw;»12jjwjj2 + C

P Ni=1 »i

s:t: w:f(xi;yi) ¸ Ei(y) + w:f(xi;y) ¡ »i 8y 6= yi;8i

»i ¸ 0 8i


minw;»12jjwjj2 + C

P Ni=1 »i


»i ¸ 0 8i

minw;»12jjwjj2 + C

P Ni=1 »i

s:t: w:f(xi;yi) ¸ 1+ w:f(xi;y) ¡ »iE i (y) 8y 6= yi;8i

»i ¸ 0 8i

Exponential number of constraints Use cutting plane

Slack scaling

7

Margin Vs Slack scaling Margin

Easy inference of most violated constraint for decomposable f and E

Too much importance to y far from margin Slack

Difficult inference of violated constraint

Zero loss of everything outside margin of 1

yM = argmaxy (w:f(xi;y) + Ei(y))

yS = argmaxy (w:f(xi;y) ¡ »iE i (y) )

8

Accuracy comparison

Address Cora CoNLL0310

12

14

16

18

20

22

24

26

28

30 Sequence labelingMarginSlack

Sp

an

F1

Err

or

Address Cora15

16

17

18

19

20

21

22

23 Segmentation

Sp

an

F1

Err

or

Slack scaling up to 25% better than Margin scaling.

9

Approximating Slack inference Slack inference: maxy s(y)-»/E(y)

Decomposability of E(y) cannot be exploited.

-»/E(y) is concave in E(y) Variational method to rewrite as linear function

¡ »E (y) = min¸ ¸ 0 ¸E (y) ¡ 2

p(»̧ )

10

Approximating Slack inference Now approximate the inference problem as:

maxy

µs(y) ¡

»E (y)

¶= max

ymin¸ ¸ 0

s(y) + ¸E (y) ¡ 2p

»̧

· min¸ ¸ 0

maxy

s(y) + ¸E (y) ¡ 2p

»̧

Same tractable MAP as in Margin Scaling

11

Approximating slack inference Now approximate the inference problem as:

maxy

µs(y) ¡

»E (y)

¶= max

ymin¸ ¸ 0

s(y) + ¸E (y) ¡ 2p

»̧

· min¸ ¸ 0

maxy

s(y) + ¸E (y) ¡ 2p

»̧

Same tractable MAP as in margin scaling

Convex in ¸minimize using line search, Bounded interval [¸l, ¸u] exists since only want violating y.

12

Slack Vs ApproxSlack


12

14

16

18

20

22

24

26

28

30 Sequence labeling Margin

Slack

ApproxSlack

Sp

an

F1

Err

or

Address Cora15

16

17

18

19

20

21

22

23 Segmentation

Sp

an

F1

Err

or

ApproxSlack gives the accuracy gains of Slack scaling while requiring same the MAP inference same as Margin scaling.

13

Limitation of ApproxSlack Cannot ensure that a violating y will be found

even if it exists No ¸ can ensure that.

Proof: s(y1)=-1/2 E(y1) = 1 s(y2) = -13/18 E(y2) = 2 s(y3) = -5/6 E(y3) = 3 s(correct) = 0 » = 19/36 y2 has highest s(y)-»/E(y) and is violating. No ¸ can score y2 higher than both y1 and y2

-0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 00

1

2

3

4

Correcty1

y2

y3

s(y)

E(y

)


minw;»12jjwjj2 + C

P Ni=1 »i


»i ¸ 0 8i

minw;»12jjwjj2 + C

P Ni=1 »i


»i ¸ 0 8i

Slack scaling

The pitfalls of a single shared slack variables Inadequate coverage for decomposable losses

s=0 s=-3 Correct : y0 = [0 0 0]

Separable: y1 = [0 1 0]

Non-separable: y2 = [0 0 1]

Margin/Slack loss = 1. Since y2 non-separable from y0, »=1, Terminate.Premature since different features may be involved.

16

A new loss function: PosLearn Ensure margin at each loss position

Compare with slack scaling.

minw;»

12jjwjj2 + C

P Ni=1

Pc »i ;c

s:t w:f(xi ;yi ) ¸ 1+ w:f(xi ;y) ¡ »i ;c

E i ;c (yc ) 8y : yc 6= yi ;c

»i ;c ¸ 0 i : 1:: :N;8c

minw;»12jjwjj2 + C

P Ni=1 »i


»i ¸ 0 8i

The pitfalls of a single shared slack variables Inadequate coverage for decomposable losses

s=0 s=-3 Correct : y0 = [0 0 0]

Separable: y1 = [0 1 0]

Non-separable: y2 = [0 0 1]

Margin/Slack loss = 1. Since y2 non-separable from y0, »=1, Terminate.Premature since different features may be involved.

PosLearn loss = 2Will continue to optimize for y1 even after slack » 3 becomes 1

18

Comparing loss functions


12

14

16

18

20

22

24

26

28

30 Sequence labeling Margin

Slack

ApproxSlack

PosLearn

Sp

an

F1

Err

or

Address Cora15

16

17

18

19

20

21

22

23 Segmentation

Sp

an

F1

Err

or

PosLearn: same or better than Slack and ApproxSlack

19

Inference for PosLearn QP Cutting plane inference

For each position c, find best y that is wrong at c

Solve simultaneously for all positions c Markov models: Max-Marginals Segmentation models: forward-backward passes Parse trees

maxy:yc 6=y i ;c

µsi (y) ¡

»i ;c

E i ;c(yc)

¶= max

yc 6=y i ;c

µmaxy» yc

si (y) ¡»i ;c

E i ;c(yc)

¶

MAP with restriction, easy!

Small enumerable set

20

Running time

0 25 50 75 100200

600

1000

1400

1800ApproxSlack

Margin

PosLearn

Slack

Training Percent

Tra

inin

g T

ime

(s

ec

)

Margin scaling might take time with less data since good constraints may not be found earlyPosLearn adds more constraints but needs fewer iterations.

21

Summary1. Margin scaling popular due to computational

reasons, but slack scaling more accurate A variational approximation for slack inference

2. Single slack variable inadequate for structured models where errors are additive A new loss function that ensures margin at each

possible error position of y

Future work: theoretical analysis of generalizability of loss functions for structured models

22

Questions?

23

Slack scaling: which constraint?

yS better coordinate ascent direction than yT for the dual faster convergence

yS = argmaxy (w:f(xi ;y) ¡ »iE i (y) )

yT = argmaxyEi(y)(1¡ w:±f(xi ;y))Versus

0 200 400 600 800 1000 12000

0.5

1

1.5

2

2.5

3

3.5

4

Original Slack

Primal Slack

Number of constraints

Tra

inin

g O

bje

cti

ve

Primal

Original

24

Is slack scaling the best there is? Y = Vector y_1,y_2,\ldots y_n E(y) = sum_i E(y_i) Two cases:

Y_i s independent Margin scaling is exactly what we want! Slack scaling puts too little margin

Y_i s dependent Margin scaling asks for more margin than needed Slack scaling better but

25

Max-margin loss surrogatesTrue error E i (argmaxyw:f(xi ;y))

maxy [E i (y) ¡ w:±f(xi ;y)]+

maxy E i (y)[1¡ w:±f(xi ;y)]+

Let w:±f(xi ;y) = w:f(xi ;yi) ¡ w:f(xi ;y)

1. Margin Loss

2. Slack Loss

-2 -1 0 1 2 3 40

2

4

6

8

10

12

Slack

Margin

Ideal

Ei(y)=4

w:±f(xi ;y)

26

Approximating slack inference -»/E(y) is concave in E(y) Variational method to rewrite as linear function

¡ »E (y) = min¸ ¸ 0 ¸E (y) ¡ 2

p(»̧ )

-»/E(y)

E(y)

27

When is margin scaling bad

Margin scaling adequate when• No useful edge features: each position independent of each other

PosLearn useful when • Edge features are important truly structured models

MarginPosLearnAddress Edge 29 21.6

No Edge 38 36.4

Cora Edge 25.1 16.6No Edge 56 58.3

ConLL Edge 15.3 14.9No Edge 19 20.1

F1 Span error

accurate max-margin training for structured output spaces

Documents