accurate max-margin training for structured output spaces
DESCRIPTION
Accurate Max-Margin Training for Structured Output Spaces. Sunita Sarawagi Rahul Gupta IIT Bombay http://www.cse.iitb.ac.in/~{sunita,grahul}. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A A A A. Structured learning. Model. - PowerPoint PPT PresentationTRANSCRIPT
1
Accurate Max-Margin Training for Structured Output Spaces
Sunita Sarawagi Rahul Gupta
IIT Bombay
http://www.cse.iitb.ac.in/~{sunita,grahul}
2
Structured learning
Score of a prediction y for input x: s(x,y) = w. f(x,y)
Predict: y* = argmaxy s(x,y) Exploit decomposability of feature functions
f(x,y) = d f (x,yd,d)
x Model Structured y1. Vector: y1 ,y2,..,yn
2. Segmentation3. Tree4. Alignment5. ..
Feature function vector f(x,y) = f1(x,y), f2(x,y),…,fK(x,y),
w=w1,..,wK
Training structured models Given
N input output pairs (x1 y1), (x2 y2), …, (xN yN)
Error of output : Ei(y) Also decomposes over smaller parts: Ei(y) = c Ei,c(yc) Example: Hamming(yi, y)= c [[yi,c != yc]]
Find w Small training error Generalizes to unseen instances Efficient for structured models
4
Related work: max-margin trainingof structured models
Taskar, B. (2004). Learning structured prediction models: A large margin approach. Doctoral dissertation, Stanford University.
Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research (JMLR), 6(Sep), 1453–1484.
Several others…
5
Max-margin formulations Margin scaling
minw;»12jjwjj2 + C
P Ni=1 »i
s:t: w:f(xi;yi) ¸ Ei(y) + w:f(xi;y) ¡ »i 8y 6= yi;8i
»i ¸ 0 8i
Max-margin formulations Margin scaling
minw;»12jjwjj2 + C
P Ni=1 »i
s:t: w:f(xi;yi) ¸ Ei(y) + w:f(xi;y) ¡ »i 8y 6= yi;8i
»i ¸ 0 8i
minw;»12jjwjj2 + C
P Ni=1 »i
s:t: w:f(xi;yi) ¸ 1+ w:f(xi;y) ¡ »iE i (y) 8y 6= yi;8i
»i ¸ 0 8i
Exponential number of constraints Use cutting plane
Slack scaling
7
Margin Vs Slack scaling Margin
Easy inference of most violated constraint for decomposable f and E
Too much importance to y far from margin Slack
Difficult inference of violated constraint
Zero loss of everything outside margin of 1
yM = argmaxy (w:f(xi;y) + Ei(y))
yS = argmaxy (w:f(xi;y) ¡ »iE i (y) )
8
Accuracy comparison
Address Cora CoNLL0310
12
14
16
18
20
22
24
26
28
30 Sequence labelingMarginSlack
Sp
an
F1
Err
or
Address Cora15
16
17
18
19
20
21
22
23 Segmentation
Sp
an
F1
Err
or
Slack scaling up to 25% better than Margin scaling.
9
Approximating Slack inference Slack inference: maxy s(y)-»/E(y)
Decomposability of E(y) cannot be exploited.
-»/E(y) is concave in E(y) Variational method to rewrite as linear function
¡ »E (y) = min¸ ¸ 0 ¸E (y) ¡ 2
p(»̧ )
10
Approximating Slack inference Now approximate the inference problem as:
maxy
µs(y) ¡
»E (y)
¶= max
ymin¸ ¸ 0
s(y) + ¸E (y) ¡ 2p
»̧
· min¸ ¸ 0
maxy
s(y) + ¸E (y) ¡ 2p
»̧
Same tractable MAP as in Margin Scaling
11
Approximating slack inference Now approximate the inference problem as:
maxy
µs(y) ¡
»E (y)
¶= max
ymin¸ ¸ 0
s(y) + ¸E (y) ¡ 2p
»̧
· min¸ ¸ 0
maxy
s(y) + ¸E (y) ¡ 2p
»̧
Same tractable MAP as in margin scaling
Convex in ¸minimize using line search, Bounded interval [¸l, ¸u] exists since only want violating y.
12
Slack Vs ApproxSlack
Address Cora CoNLL0310
12
14
16
18
20
22
24
26
28
30 Sequence labeling Margin
Slack
ApproxSlack
Sp
an
F1
Err
or
Address Cora15
16
17
18
19
20
21
22
23 Segmentation
Sp
an
F1
Err
or
ApproxSlack gives the accuracy gains of Slack scaling while requiring same the MAP inference same as Margin scaling.
13
Limitation of ApproxSlack Cannot ensure that a violating y will be found
even if it exists No ¸ can ensure that.
Proof: s(y1)=-1/2 E(y1) = 1 s(y2) = -13/18 E(y2) = 2 s(y3) = -5/6 E(y3) = 3 s(correct) = 0 » = 19/36 y2 has highest s(y)-»/E(y) and is violating. No ¸ can score y2 higher than both y1 and y2
-0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 00
1
2
3
4
Correcty1
y2
y3
s(y)
E(y
)
Max-margin formulations Margin scaling
minw;»12jjwjj2 + C
P Ni=1 »i
s:t: w:f(xi;yi) ¸ Ei(y) + w:f(xi;y) ¡ »i 8y 6= yi;8i
»i ¸ 0 8i
minw;»12jjwjj2 + C
P Ni=1 »i
s:t: w:f(xi;yi) ¸ 1+ w:f(xi;y) ¡ »iE i (y) 8y 6= yi;8i
»i ¸ 0 8i
Slack scaling
The pitfalls of a single shared slack variables Inadequate coverage for decomposable losses
s=0 s=-3 Correct : y0 = [0 0 0]
Separable: y1 = [0 1 0]
Non-separable: y2 = [0 0 1]
Margin/Slack loss = 1. Since y2 non-separable from y0, »=1, Terminate.Premature since different features may be involved.
16
A new loss function: PosLearn Ensure margin at each loss position
Compare with slack scaling.
minw;»
12jjwjj2 + C
P Ni=1
Pc »i ;c
s:t w:f(xi ;yi ) ¸ 1+ w:f(xi ;y) ¡ »i ;c
E i ;c (yc ) 8y : yc 6= yi ;c
»i ;c ¸ 0 i : 1:: :N;8c
minw;»12jjwjj2 + C
P Ni=1 »i
s:t: w:f(xi;yi) ¸ 1+ w:f(xi;y) ¡ »iE i (y) 8y 6= yi;8i
»i ¸ 0 8i
The pitfalls of a single shared slack variables Inadequate coverage for decomposable losses
s=0 s=-3 Correct : y0 = [0 0 0]
Separable: y1 = [0 1 0]
Non-separable: y2 = [0 0 1]
Margin/Slack loss = 1. Since y2 non-separable from y0, »=1, Terminate.Premature since different features may be involved.
PosLearn loss = 2Will continue to optimize for y1 even after slack » 3 becomes 1
18
Comparing loss functions
Address Cora CoNLL0310
12
14
16
18
20
22
24
26
28
30 Sequence labeling Margin
Slack
ApproxSlack
PosLearn
Sp
an
F1
Err
or
Address Cora15
16
17
18
19
20
21
22
23 Segmentation
Sp
an
F1
Err
or
PosLearn: same or better than Slack and ApproxSlack
19
Inference for PosLearn QP Cutting plane inference
For each position c, find best y that is wrong at c
Solve simultaneously for all positions c Markov models: Max-Marginals Segmentation models: forward-backward passes Parse trees
maxy:yc 6=y i ;c
µsi (y) ¡
»i ;c
E i ;c(yc)
¶= max
yc 6=y i ;c
µmaxy» yc
si (y) ¡»i ;c
E i ;c(yc)
¶
MAP with restriction, easy!
Small enumerable set
20
Running time
0 25 50 75 100200
600
1000
1400
1800ApproxSlack
Margin
PosLearn
Slack
Training Percent
Tra
inin
g T
ime
(s
ec
)
Margin scaling might take time with less data since good constraints may not be found earlyPosLearn adds more constraints but needs fewer iterations.
21
Summary1. Margin scaling popular due to computational
reasons, but slack scaling more accurate A variational approximation for slack inference
2. Single slack variable inadequate for structured models where errors are additive A new loss function that ensures margin at each
possible error position of y
Future work: theoretical analysis of generalizability of loss functions for structured models
22
Questions?
23
Slack scaling: which constraint?
yS better coordinate ascent direction than yT for the dual faster convergence
yS = argmaxy (w:f(xi ;y) ¡ »iE i (y) )
yT = argmaxyEi(y)(1¡ w:±f(xi ;y))Versus
0 200 400 600 800 1000 12000
0.5
1
1.5
2
2.5
3
3.5
4
Original Slack
Primal Slack
Number of constraints
Tra
inin
g O
bje
cti
ve
Primal
Original
24
Is slack scaling the best there is? Y = Vector y_1,y_2,\ldots y_n E(y) = sum_i E(y_i) Two cases:
Y_i s independent Margin scaling is exactly what we want! Slack scaling puts too little margin
Y_i s dependent Margin scaling asks for more margin than needed Slack scaling better but
25
Max-margin loss surrogatesTrue error E i (argmaxyw:f(xi ;y))
maxy [E i (y) ¡ w:±f(xi ;y)]+
maxy E i (y)[1¡ w:±f(xi ;y)]+
Let w:±f(xi ;y) = w:f(xi ;yi) ¡ w:f(xi ;y)
1. Margin Loss
2. Slack Loss
-2 -1 0 1 2 3 40
2
4
6
8
10
12
Slack
Margin
Ideal
Ei(y)=4
w:±f(xi ;y)
26
Approximating slack inference -»/E(y) is concave in E(y) Variational method to rewrite as linear function
¡ »E (y) = min¸ ¸ 0 ¸E (y) ¡ 2
p(»̧ )
-»/E(y)
E(y)
27
When is margin scaling bad
Margin scaling adequate when• No useful edge features: each position independent of each other
PosLearn useful when • Edge features are important truly structured models
MarginPosLearnAddress Edge 29 21.6
No Edge 38 36.4
Cora Edge 25.1 16.6No Edge 56 58.3
ConLL Edge 15.3 14.9No Edge 19 20.1
F1 Span error