Accurate Max-Margin Training for Structured Output Spaces
-
Upload
denise-hayes -
Category
Documents
-
view
24 -
download
0
description
Transcript of Accurate Max-Margin Training for Structured Output Spaces
![Page 1: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/1.jpg)
1
Accurate Max-Margin Training for Structured Output Spaces
Sunita Sarawagi Rahul Gupta
IIT Bombay
http://www.cse.iitb.ac.in/~{sunita,grahul}
![Page 2: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/2.jpg)
2
Structured learning
Score of a prediction y for input x: s(x,y) = w. f(x,y)
Predict: y* = argmaxy s(x,y) Exploit decomposability of feature functions
f(x,y) = d f (x,yd,d)
x Model Structured y1. Vector: y1 ,y2,..,yn
2. Segmentation3. Tree4. Alignment5. ..
Feature function vector f(x,y) = f1(x,y), f2(x,y),…,fK(x,y),
w=w1,..,wK
![Page 3: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/3.jpg)
Training structured models Given
N input output pairs (x1 y1), (x2 y2), …, (xN yN)
Error of output : Ei(y) Also decomposes over smaller parts: Ei(y) = c Ei,c(yc) Example: Hamming(yi, y)= c [[yi,c != yc]]
Find w Small training error Generalizes to unseen instances Efficient for structured models
![Page 4: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/4.jpg)
4
Related work: max-margin trainingof structured models
Taskar, B. (2004). Learning structured prediction models: A large margin approach. Doctoral dissertation, Stanford University.
Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research (JMLR), 6(Sep), 1453–1484.
Several others…
![Page 5: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/5.jpg)
5
Max-margin formulations Margin scaling
minw;»12jjwjj2 + C
P Ni=1 »i
s:t: w:f(xi;yi) ¸ Ei(y) + w:f(xi;y) ¡ »i 8y 6= yi;8i
»i ¸ 0 8i
![Page 6: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/6.jpg)
Max-margin formulations Margin scaling
minw;»12jjwjj2 + C
P Ni=1 »i
s:t: w:f(xi;yi) ¸ Ei(y) + w:f(xi;y) ¡ »i 8y 6= yi;8i
»i ¸ 0 8i
minw;»12jjwjj2 + C
P Ni=1 »i
s:t: w:f(xi;yi) ¸ 1+ w:f(xi;y) ¡ »iE i (y) 8y 6= yi;8i
»i ¸ 0 8i
Exponential number of constraints Use cutting plane
Slack scaling
![Page 7: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/7.jpg)
7
Margin Vs Slack scaling Margin
Easy inference of most violated constraint for decomposable f and E
Too much importance to y far from margin Slack
Difficult inference of violated constraint
Zero loss of everything outside margin of 1
yM = argmaxy (w:f(xi;y) + Ei(y))
yS = argmaxy (w:f(xi;y) ¡ »iE i (y) )
![Page 8: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/8.jpg)
8
Accuracy comparison
Address Cora CoNLL0310
12
14
16
18
20
22
24
26
28
30 Sequence labelingMarginSlack
Sp
an
F1
Err
or
Address Cora15
16
17
18
19
20
21
22
23 Segmentation
Sp
an
F1
Err
or
Slack scaling up to 25% better than Margin scaling.
![Page 9: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/9.jpg)
9
Approximating Slack inference Slack inference: maxy s(y)-»/E(y)
Decomposability of E(y) cannot be exploited.
-»/E(y) is concave in E(y) Variational method to rewrite as linear function
¡ »E (y) = min¸ ¸ 0 ¸E (y) ¡ 2
p(»̧ )
![Page 10: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/10.jpg)
10
Approximating Slack inference Now approximate the inference problem as:
maxy
µs(y) ¡
»E (y)
¶= max
ymin¸ ¸ 0
s(y) + ¸E (y) ¡ 2p
»̧
· min¸ ¸ 0
maxy
s(y) + ¸E (y) ¡ 2p
»̧
Same tractable MAP as in Margin Scaling
![Page 11: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/11.jpg)
11
Approximating slack inference Now approximate the inference problem as:
maxy
µs(y) ¡
»E (y)
¶= max
ymin¸ ¸ 0
s(y) + ¸E (y) ¡ 2p
»̧
· min¸ ¸ 0
maxy
s(y) + ¸E (y) ¡ 2p
»̧
Same tractable MAP as in margin scaling
Convex in ¸minimize using line search, Bounded interval [¸l, ¸u] exists since only want violating y.
![Page 12: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/12.jpg)
12
Slack Vs ApproxSlack
Address Cora CoNLL0310
12
14
16
18
20
22
24
26
28
30 Sequence labeling Margin
Slack
ApproxSlack
Sp
an
F1
Err
or
Address Cora15
16
17
18
19
20
21
22
23 Segmentation
Sp
an
F1
Err
or
ApproxSlack gives the accuracy gains of Slack scaling while requiring same the MAP inference same as Margin scaling.
![Page 13: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/13.jpg)
13
Limitation of ApproxSlack Cannot ensure that a violating y will be found
even if it exists No ¸ can ensure that.
Proof: s(y1)=-1/2 E(y1) = 1 s(y2) = -13/18 E(y2) = 2 s(y3) = -5/6 E(y3) = 3 s(correct) = 0 » = 19/36 y2 has highest s(y)-»/E(y) and is violating. No ¸ can score y2 higher than both y1 and y2
-0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 00
1
2
3
4
Correcty1
y2
y3
s(y)
E(y
)
![Page 14: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/14.jpg)
Max-margin formulations Margin scaling
minw;»12jjwjj2 + C
P Ni=1 »i
s:t: w:f(xi;yi) ¸ Ei(y) + w:f(xi;y) ¡ »i 8y 6= yi;8i
»i ¸ 0 8i
minw;»12jjwjj2 + C
P Ni=1 »i
s:t: w:f(xi;yi) ¸ 1+ w:f(xi;y) ¡ »iE i (y) 8y 6= yi;8i
»i ¸ 0 8i
Slack scaling
![Page 15: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/15.jpg)
The pitfalls of a single shared slack variables Inadequate coverage for decomposable losses
s=0 s=-3 Correct : y0 = [0 0 0]
Separable: y1 = [0 1 0]
Non-separable: y2 = [0 0 1]
Margin/Slack loss = 1. Since y2 non-separable from y0, »=1, Terminate.Premature since different features may be involved.
![Page 16: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/16.jpg)
16
A new loss function: PosLearn Ensure margin at each loss position
Compare with slack scaling.
minw;»
12jjwjj2 + C
P Ni=1
Pc »i ;c
s:t w:f(xi ;yi ) ¸ 1+ w:f(xi ;y) ¡ »i ;c
E i ;c (yc ) 8y : yc 6= yi ;c
»i ;c ¸ 0 i : 1:: :N;8c
minw;»12jjwjj2 + C
P Ni=1 »i
s:t: w:f(xi;yi) ¸ 1+ w:f(xi;y) ¡ »iE i (y) 8y 6= yi;8i
»i ¸ 0 8i
![Page 17: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/17.jpg)
The pitfalls of a single shared slack variables Inadequate coverage for decomposable losses
s=0 s=-3 Correct : y0 = [0 0 0]
Separable: y1 = [0 1 0]
Non-separable: y2 = [0 0 1]
Margin/Slack loss = 1. Since y2 non-separable from y0, »=1, Terminate.Premature since different features may be involved.
PosLearn loss = 2Will continue to optimize for y1 even after slack » 3 becomes 1
![Page 18: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/18.jpg)
18
Comparing loss functions
Address Cora CoNLL0310
12
14
16
18
20
22
24
26
28
30 Sequence labeling Margin
Slack
ApproxSlack
PosLearn
Sp
an
F1
Err
or
Address Cora15
16
17
18
19
20
21
22
23 Segmentation
Sp
an
F1
Err
or
PosLearn: same or better than Slack and ApproxSlack
![Page 19: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/19.jpg)
19
Inference for PosLearn QP Cutting plane inference
For each position c, find best y that is wrong at c
Solve simultaneously for all positions c Markov models: Max-Marginals Segmentation models: forward-backward passes Parse trees
maxy:yc 6=y i ;c
µsi (y) ¡
»i ;c
E i ;c(yc)
¶= max
yc 6=y i ;c
µmaxy» yc
si (y) ¡»i ;c
E i ;c(yc)
¶
MAP with restriction, easy!
Small enumerable set
![Page 20: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/20.jpg)
20
Running time
0 25 50 75 100200
600
1000
1400
1800ApproxSlack
Margin
PosLearn
Slack
Training Percent
Tra
inin
g T
ime
(s
ec
)
Margin scaling might take time with less data since good constraints may not be found earlyPosLearn adds more constraints but needs fewer iterations.
![Page 21: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/21.jpg)
21
Summary1. Margin scaling popular due to computational
reasons, but slack scaling more accurate A variational approximation for slack inference
2. Single slack variable inadequate for structured models where errors are additive A new loss function that ensures margin at each
possible error position of y
Future work: theoretical analysis of generalizability of loss functions for structured models
![Page 22: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/22.jpg)
22
Questions?
![Page 23: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/23.jpg)
23
Slack scaling: which constraint?
yS better coordinate ascent direction than yT for the dual faster convergence
yS = argmaxy (w:f(xi ;y) ¡ »iE i (y) )
yT = argmaxyEi(y)(1¡ w:±f(xi ;y))Versus
0 200 400 600 800 1000 12000
0.5
1
1.5
2
2.5
3
3.5
4
Original Slack
Primal Slack
Number of constraints
Tra
inin
g O
bje
cti
ve
Primal
Original
![Page 24: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/24.jpg)
24
Is slack scaling the best there is? Y = Vector y_1,y_2,\ldots y_n E(y) = sum_i E(y_i) Two cases:
Y_i s independent Margin scaling is exactly what we want! Slack scaling puts too little margin
Y_i s dependent Margin scaling asks for more margin than needed Slack scaling better but
![Page 25: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/25.jpg)
25
Max-margin loss surrogatesTrue error E i (argmaxyw:f(xi ;y))
maxy [E i (y) ¡ w:±f(xi ;y)]+
maxy E i (y)[1¡ w:±f(xi ;y)]+
Let w:±f(xi ;y) = w:f(xi ;yi) ¡ w:f(xi ;y)
1. Margin Loss
2. Slack Loss
-2 -1 0 1 2 3 40
2
4
6
8
10
12
Slack
Margin
Ideal
Ei(y)=4
w:±f(xi ;y)
![Page 26: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/26.jpg)
26
Approximating slack inference -»/E(y) is concave in E(y) Variational method to rewrite as linear function
¡ »E (y) = min¸ ¸ 0 ¸E (y) ¡ 2
p(»̧ )
-»/E(y)
E(y)
![Page 27: Accurate Max-Margin Training for Structured Output Spaces](https://reader030.fdocuments.us/reader030/viewer/2022033106/56813366550346895d9a7e30/html5/thumbnails/27.jpg)
27
When is margin scaling bad
Margin scaling adequate when• No useful edge features: each position independent of each other
PosLearn useful when • Edge features are important truly structured models
MarginPosLearnAddress Edge 29 21.6
No Edge 38 36.4
Cora Edge 25.1 16.6No Edge 56 58.3
ConLL Edge 15.3 14.9No Edge 19 20.1
F1 Span error