vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84%...
Transcript of vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84%...
![Page 1: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/1.jpg)
Learning with Sparse Latent StructureVlad Niculae
Ins tuto de Telecomunicações
Work with: André Mar ns, Claire Cardie, Mathieu Blondel
github.com/vene/sparsemap @vnfrombucharest
![Page 2: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/2.jpg)
Structured Prediction
VERB PREP NOUNdog on wheels
NOUN PREP NOUNdog on wheels
NOUN DET NOUNdog on wheels
· · ·
⋆ dog on wheels
⋆ dog on wheels
⋆ dog on wheels
· · ·
dogon
wheels
hondopwielen
dogon
wheels
hondopwielen
dogon
wheels
hondopwielen
![Page 3: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/3.jpg)
Structured Prediction
VERB PREP NOUNdog on wheels
NOUN PREP NOUNdog on wheels
NOUN DET NOUNdog on wheels
· · ·
⋆ dog on wheels
⋆ dog on wheels
⋆ dog on wheels
· · ·
dogon
wheels
hondopwielen
dogon
wheels
hondopwielen
dogon
wheels
hondopwielen
![Page 4: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/4.jpg)
Structured Prediction· · ·
· · ·
![Page 5: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/5.jpg)
Latent Structure Models
input · · ·
· · ·
output
posi ve
neutral
nega ve
![Page 6: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/6.jpg)
*record scratch*
*freeze frame*
How to select an itemfrom a set?
![Page 7: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/7.jpg)
How to select an item from a set?
· · ·
![Page 8: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/8.jpg)
How to select an item from a set?
c1c2
· · ·cN
θ pinputx
outputy
θ = f1(x;w) y = f2(p, x;w)
∂y∂w=? or, essen ally, ∂p
∂θ=?
![Page 9: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/9.jpg)
How to select an item from a set?
c1c2
· · ·cN
θ
pinputx
outputy
θ = f1(x;w) y = f2(p, x;w)
∂y∂w=? or, essen ally, ∂p
∂θ=?
![Page 10: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/10.jpg)
How to select an item from a set?
c1c2
· · ·cN
θ p
inputx
outputy
θ = f1(x;w) y = f2(p, x;w)
∂y∂w=? or, essen ally, ∂p
∂θ=?
![Page 11: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/11.jpg)
How to select an item from a set?
c1c2
· · ·cN
θ pinputx
outputy
θ = f1(x;w) y = f2(p, x;w)
∂y∂w=? or, essen ally, ∂p
∂θ=?
![Page 12: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/12.jpg)
How to select an item from a set?
c1c2
· · ·cN
θ pinputx
outputy
θ = f1(x;w) y = f2(p, x;w)
∂y∂w=?
or, essen ally, ∂p∂θ=?
![Page 13: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/13.jpg)
How to select an item from a set?
c1c2
· · ·cN
θ pinputx
outputy
θ = f1(x;w) y = f2(p, x;w)
∂y∂w=? or, essen ally, ∂p
∂θ=?
![Page 14: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/14.jpg)
Argmax
θc1c2
· · ·cN
θ pp
∂p∂θ=?
![Page 15: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/15.jpg)
Argmax
θc1c2
· · ·cN
θ pp
∂p∂θ=?
![Page 16: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/16.jpg)
Argmax
θc1c2
· · ·cN
θ pp
∂p∂θ=?
![Page 17: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/17.jpg)
Argmax
θc1c2
· · ·cN
θ pp
∂p∂θ=?
![Page 18: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/18.jpg)
Argmax
θc1c2
· · ·cN
θ pp
∂p∂θ=?
![Page 19: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/19.jpg)
Argmax
θc1c2
· · ·cN
θ pp
∂p∂θ=?
![Page 20: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/20.jpg)
Argmax
θc1c2
· · ·cN
θ pp
∂p∂θ=?
![Page 21: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/21.jpg)
Argmax
θc1c2
· · ·cN
θ pp
∂p∂θ=?
![Page 22: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/22.jpg)
Argmax
c1c2
· · ·cN
θ p
∂p∂θ = 0
p1
θ10
1
θ2 − 1 θ2 θ2 + 1
![Page 23: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/23.jpg)
Argmax vs. Softmax
c1c2
· · ·cN
θ p
∂p∂θ = diag(p) − pp
⊤
p1
θ10
1
θ2 − 1 θ2 θ2 + 1
pj = exp(θj)/Z
![Page 24: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/24.jpg)
Variational Form of Argmax
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
![Page 25: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/25.jpg)
Variational Form of Argmax
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
![Page 26: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/26.jpg)
Variational Form of Argmax
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
p = [1,0]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
![Page 27: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/27.jpg)
Variational Form of Argmax
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
p = [1,0]
p = [0,1]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
![Page 28: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/28.jpg)
Variational Form of Argmax
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
p = [1,0]
p = [0,1]
p = [1/2, 1/2]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
![Page 29: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/29.jpg)
Variational Form of Argmax
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
p = [1,0]
p = [0,1]
p = [1/2, 1/2]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
![Page 30: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/30.jpg)
Variational Form of Argmax
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
p = [1,0]
p = [0,1]
p = [1/2, 1/2]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
p = [0,1,0]
N = 3
![Page 31: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/31.jpg)
Variational Form of Argmax
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
p = [1,0]
p = [0,1]
p = [1/2, 1/2]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
p = [0,1,0]
p = [0,0,1]
N = 3
![Page 32: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/32.jpg)
Variational Form of Argmax
= p ∈ RN : p ≥ 0, 1⊤p = 1
0.5 1 1.5
0.5
1
1.5
p = [1,0]
p = [0,1]
p = [1/2, 1/2]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
p = [0,1,0]
p = [0,0,1]
p = [1/3, 1/3, 1/3]
N = 3
![Page 33: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/33.jpg)
Variational Form of Argmax
maxjθj =max
p∈p⊤θ Fundamental Thm. Lin. Prog.
(Dantzig et al., 1955)
0.5 1 1.5
0.5
1
1.5
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
![Page 34: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/34.jpg)
Variational Form of Argmax
maxjθj =max
p∈p⊤θ Fundamental Thm. Lin. Prog.
(Dantzig et al., 1955)
0.5 1 1.5
0.5
1
1.5θ = [.2,1.4]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
![Page 35: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/35.jpg)
Variational Form of Argmax
maxjθj =max
p∈p⊤θ Fundamental Thm. Lin. Prog.
(Dantzig et al., 1955)
0.5 1 1.5
0.5
1
1.5θ = [.2,1.4]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
![Page 36: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/36.jpg)
Variational Form of Argmax
maxjθj =max
p∈p⊤θ Fundamental Thm. Lin. Prog.
(Dantzig et al., 1955)
0.5 1 1.5
0.5
1
1.5θ = [.2,1.4]
p⋆ = [0,1]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5
N = 3
![Page 37: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/37.jpg)
Variational Form of Argmax
maxjθj =max
p∈p⊤θ Fundamental Thm. Lin. Prog.
(Dantzig et al., 1955)
0.5 1 1.5
0.5
1
1.5θ = [.2,1.4]
p⋆ = [0,1]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5θ = [.7, .1,1.5]
N = 3
![Page 38: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/38.jpg)
Variational Form of Argmax
maxjθj =max
p∈p⊤θ Fundamental Thm. Lin. Prog.
(Dantzig et al., 1955)
0.5 1 1.5
0.5
1
1.5θ = [.2,1.4]
p⋆ = [0,1]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5θ = [.7, .1,1.5]
N = 3
![Page 39: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/39.jpg)
Variational Form of Argmax
maxjθj =max
p∈p⊤θ Fundamental Thm. Lin. Prog.
(Dantzig et al., 1955)
0.5 1 1.5
0.5
1
1.5θ = [.2,1.4]
p⋆ = [0,1]
N = 2
0.5 11.5
0.5 1 1.5
0.5
1
1.5θ = [.7, .1,1.5]
p⋆ = [0,0,1]
N = 3
![Page 40: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/40.jpg)
Smoothed Max Operators
πΩ(θ) = argmaxp∈
p⊤θ −Ω(p)
argmax: Ω(p)=0
so max: Ω(p)=∑
j pj log pj
sparsemax: Ω(p)= 1/2∥p∥22fusedmax: Ω(p)= 1/2∥p∥22 +
∑j |pj − pj−1|
csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)
p1
θ10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017)
![Page 41: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/41.jpg)
Smoothed Max Operators
πΩ(θ) = argmaxp∈
p⊤θ −Ω(p)
argmax: Ω(p)=0
so max: Ω(p)=∑
j pj log pj
sparsemax: Ω(p)= 1/2∥p∥22fusedmax: Ω(p)= 1/2∥p∥22 +
∑j |pj − pj−1|
csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)
p1
θ10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017)
![Page 42: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/42.jpg)
Smoothed Max Operators
πΩ(θ) = argmaxp∈
p⊤θ −Ω(p)
argmax: Ω(p)=0
so max: Ω(p)=∑
j pj log pj
sparsemax: Ω(p)= 1/2∥p∥22fusedmax: Ω(p)= 1/2∥p∥22 +
∑j |pj − pj−1|
csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)
p1
θ10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017)
![Page 43: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/43.jpg)
Smoothed Max Operators
πΩ(θ) = argmaxp∈
p⊤θ −Ω(p)
argmax: Ω(p)=0
so max: Ω(p)=∑
j pj log pj
sparsemax: Ω(p)= 1/2∥p∥22
fusedmax: Ω(p)= 1/2∥p∥22 +∑
j |pj − pj−1|csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)
p1
θ10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017)
(Mar ns and Astudillo, 2016)
![Page 44: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/44.jpg)
so max
![Page 45: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/45.jpg)
sparsemax
![Page 46: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/46.jpg)
fusedmax ?!
![Page 47: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/47.jpg)
Smoothed Max Operators
πΩ(θ) = argmaxp∈
p⊤θ −Ω(p)
argmax: Ω(p)=0
so max: Ω(p)=∑
j pj log pj
sparsemax: Ω(p)= 1/2∥p∥22
fusedmax: Ω(p)= 1/2∥p∥22 +∑
j |pj − pj−1|csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)
p1
θ10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017)
![Page 48: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/48.jpg)
Smoothed Max Operators
πΩ(θ) = argmaxp∈
p⊤θ −Ω(p)
argmax: Ω(p)=0
so max: Ω(p)=∑
j pj log pj
sparsemax: Ω(p)= 1/2∥p∥22fusedmax: Ω(p)= 1/2∥p∥22 +
∑j |pj − pj−1|
csparsemax: Ω(p)= 1/2∥p∥22 + ι(a ≤ p ≤ b)
p1
θ10
1
−1 0 1
[0,0,1]
[.3, .2, .5]
[.3,0, .7]
(Niculae and Blondel, 2017)
![Page 49: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/49.jpg)
Sparsemaxsparsemax(θ) = argmax
p∈p⊤θ − 1/2∥p∥22
= argminp∈∥p − θ∥22
Computa on:
p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via par al sort
(Held et al., 1974; Brucker, 1984; Condat, 2016)
Backward pass:
Jsparsemax = diag(s) − 1|S|ss⊤
where S = j : p⋆j > 0,sj = Jj ∈ SK
(Mar ns and Astudillo, 2016)
argmin differen a on(Gould et al., 2016; Amos and Kolter, 2017)
![Page 50: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/50.jpg)
Sparsemaxsparsemax(θ) = argmax
p∈p⊤θ − 1/2∥p∥22
= argminp∈∥p − θ∥22
Computa on:
p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via par al sort
(Held et al., 1974; Brucker, 1984; Condat, 2016)
Backward pass:
Jsparsemax = diag(s) − 1|S|ss⊤
where S = j : p⋆j > 0,sj = Jj ∈ SK
(Mar ns and Astudillo, 2016)
argmin differen a on(Gould et al., 2016; Amos and Kolter, 2017)
![Page 51: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/51.jpg)
Sparsemaxsparsemax(θ) = argmax
p∈p⊤θ − 1/2∥p∥22
= argminp∈∥p − θ∥22
Computa on:
p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via par al sort
(Held et al., 1974; Brucker, 1984; Condat, 2016)
Backward pass:
Jsparsemax = diag(s) − 1|S|ss⊤
where S = j : p⋆j > 0,sj = Jj ∈ SK
(Mar ns and Astudillo, 2016)
argmin differen a on(Gould et al., 2016; Amos and Kolter, 2017)
![Page 52: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/52.jpg)
Sparsemaxsparsemax(θ) = argmax
p∈p⊤θ − 1/2∥p∥22
= argminp∈∥p − θ∥22
Computa on:
p⋆ = [θ − τ1]+θi > θj⇒ pi ≥ pjO(d) via par al sort
(Held et al., 1974; Brucker, 1984; Condat, 2016)
Backward pass:
Jsparsemax = diag(s) − 1|S|ss⊤
where S = j : p⋆j > 0,sj = Jj ∈ SK
(Mar ns and Astudillo, 2016)
argmin differen a on(Gould et al., 2016; Amos and Kolter, 2017)
![Page 53: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/53.jpg)
Structured Predictionfinally
![Page 54: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/54.jpg)
Structured Predictionis essen ally a (very high-dimensional) argmax
c1c2
· · ·cN
θ pinputx
outputyc2
There are exponen allymany structures(θ cannot fit in memory!)
![Page 55: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/55.jpg)
Structured Predictionis essen ally a (very high-dimensional) argmax
· · ·
θ pinputx
outputy
There are exponen allymany structures(θ cannot fit in memory!)
![Page 56: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/56.jpg)
Structured Predictionis essen ally a (very high-dimensional) argmax
· · ·
θ pinputx
outputy
There are exponen allymany structures(θ cannot fit in memory!)
![Page 57: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/57.jpg)
Factorization Into Partsθ =A⊤η
⋆ dog on wheels
A =
⋆→dog 1 0 0on→dog 0 1 1
wheels→dog 0 0 0⋆→on 0 1 1
dog→on 1 ... 0 0 ...wheels→on 0 0 0
⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1
η =
.1
.2−.1.3.8.1−.3.2−.1
dogon
wheels
hondopwielen
A =
dog—hond 1 0 0dog—op 0 1 1dog—wielen 0 0 0on—hond 0 0 0on—op 1 ... 0 0 ...on—wielen 0 1 1
wheels—hond 0 1 0wheels—op 0 0 0wheels—wielen 1 0 1
η =
.1
.2−.1.3.8.1−.3.2−.1
![Page 58: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/58.jpg)
Factorization Into Partsθ =A⊤η
⋆ dog on wheels
A =
⋆→dog 1 0 0on→dog 0 1 1
wheels→dog 0 0 0⋆→on 0 1 1
dog→on 1 ... 0 0 ...wheels→on 0 0 0
⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1
η =
.1
.2−.1.3.8.1−.3.2−.1
dogon
wheels
hondopwielen
A =
dog—hond 1 0 0dog—op 0 1 1dog—wielen 0 0 0on—hond 0 0 0on—op 1 ... 0 0 ...on—wielen 0 1 1
wheels—hond 0 1 0wheels—op 0 0 0wheels—wielen 1 0 1
η =
.1
.2−.1.3.8.1−.3.2−.1
![Page 59: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/59.jpg)
Factorization Into Partsθ =A⊤η
⋆ dog on wheels
A =
⋆→dog 1 0 0on→dog 0 1 1
wheels→dog 0 0 0⋆→on 0 1 1
dog→on 1 ... 0 0 ...wheels→on 0 0 0
⋆→wheels 0 0 0dog→wheels 0 1 0on→wheels 1 0 1
η =
.1
.2−.1.3.8.1−.3.2−.1
dogon
wheels
hondopwielen
A =
dog—hond 1 0 0dog—op 0 1 1dog—wielen 0 0 0on—hond 0 0 0on—op 1 ... 0 0 ...on—wielen 0 1 1
wheels—hond 0 1 0wheels—op 0 0 0wheels—wielen 1 0 1
η =
.1
.2−.1.3.8.1−.3.2−.1
![Page 60: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/60.jpg)
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
![Page 61: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/61.jpg)
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
![Page 62: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/62.jpg)
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
![Page 63: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/63.jpg)
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
![Page 64: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/64.jpg)
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
![Page 65: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/65.jpg)
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
![Page 66: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/66.jpg)
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
![Page 67: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/67.jpg)
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
![Page 68: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/68.jpg)
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)
SparseMAP argmaxμ∈M
μ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
![Page 69: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/69.jpg)
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)
SparseMAP argmaxμ∈M
μ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
![Page 70: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/70.jpg)
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)
SparseMAP argmaxμ∈M
μ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
![Page 71: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/71.jpg)
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)
SparseMAP argmaxμ∈M
μ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
![Page 72: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/72.jpg)
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)
SparseMAP argmaxμ∈M
μ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
![Page 73: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/73.jpg)
argmax argmaxp∈
p⊤θ
so max argmaxp∈
p⊤θ +H(p)
sparsemax argmaxp∈
p⊤θ − 1/2∥p∥2
MAP argmaxμ∈M
μ⊤η
marginals argmaxμ∈M
μ⊤η + eH(μ)SparseMAP argmax
μ∈Mμ⊤η − 1/2∥μ∥2
M
e.g. dependency parsing→ max. spanning treematching→ the Hungarian algorithm
e.g. sequence labelling→ forward-backward(Rabiner, 1989)
As a en on: (Kim et al., 2017)
e.g. dependency parsing→ the Matrix-Tree theorem(Koo et al., 2007; D. A. Smith and N. A. Smith, 2007; McDonald and Sa a, 2007)
As a en on: (Liu and Lapata, 2018)
e.g. matchings→ #P-complete!(Taskar, 2004; Valiant, 1979)
(Niculae, Mar ns, Blondel, and Cardie, 2018)
M := convay : y ∈ Y
=Ap : p ∈
=EY∼p aY : p ∈
![Page 74: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/74.jpg)
SparseMAP Solution
μ⋆ = argmaxμ∈M
μ⊤η − 1/2∥μ∥2
= = .6 + .4
= Ap⋆ with very sparse p⋆ ∈ N
![Page 75: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/75.jpg)
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p
• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set
(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
finite & linear convergence!
Completely modular: just add MAP
![Page 76: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/76.jpg)
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p
• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set
(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
finite & linear convergence!
Completely modular: just add MAP
![Page 77: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/77.jpg)
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p
• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set
(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
finite & linear convergence!
Completely modular: just add MAP
![Page 78: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/78.jpg)
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)
• select a new corner ofM
• update the (sparse) coefficients of p
• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set
(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
finite & linear convergence!
Completely modular: just add MAP
![Page 79: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/79.jpg)
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)
• select a new corner ofM
• update the (sparse) coefficients of p
• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set
(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eη
Ac ve Set achievesfinite & linear convergence!
Completely modular: just add MAP
![Page 80: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/80.jpg)
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away-step, pairwise
• Quadra c objec ve: Ac ve Set(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
finite & linear convergence!
Completely modular: just add MAP
![Page 81: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/81.jpg)
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set
(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
finite & linear convergence!
Completely modular: just add MAP
![Page 82: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/82.jpg)
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set
(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eη
Ac ve Set achievesfinite & linear convergence!
Completely modular: just add MAP
![Page 83: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/83.jpg)
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set
(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
finite & linear convergence!
Completely modular: just add MAP
![Page 84: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/84.jpg)
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set
(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
finite & linear convergence!
Completely modular: just add MAP
![Page 85: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/85.jpg)
Algorithms for SparseMAPμ⋆ = argmax
μ∈Mμ⊤η − 1/2∥μ∥2
Condi onal Gradient(Frank and Wolfe, 1956; Lacoste-Julien and Jaggi, 2015)
• select a new corner ofM• update the (sparse) coefficients of p• Update rules: vanilla, away-step, pairwise• Quadra c objec ve: Ac ve Set
(Nocedal and Wright, 1999, Ch. 16.4 & 16.5)(Wolfe, 1976; Vinyes and Obozinski, 2017)
Backward pass
∂μ∂η is sparse
compu ng∂μ∂η
⊤dy
takes O(dim(μ) nnz(p⋆))
quadra c objec velinear constraints(alas, exponen ally many!)
ay⋆ = argmaxμ∈M
μ⊤ (η −μ(t−1))︸ ︷︷ ︸eηAc ve Set achieves
finite & linear convergence!
Completely modular: just add MAP
![Page 86: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/86.jpg)
Structured Attention & Graphical Models
![Page 87: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/87.jpg)
Structured Attention & Graphical Models
![Page 88: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/88.jpg)
Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situa on.
hypothesis: A police officer watches a situa on closely.
input
(P, H)A
gentleman
overlooking
...
situa on
A
police
officer...
closely
output
entails
contradicts
neutral
(Model: ESIM (Chen et al., 2017))
![Page 89: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/89.jpg)
Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situa on.
hypothesis: A police officer watches a situa on closely.
input
(P, H)A
gentleman
overlooking
...
situa on
A
police
officer...
closely
output
entails
contradicts
neutral
(Model: ESIM (Chen et al., 2017))
![Page 90: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/90.jpg)
Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situa on.
hypothesis: A police officer watches a situa on closely.
input
(P, H)A
gentleman
overlooking
...
situa on
A
police
officer...
closely
output
entails
contradicts
neutral
(Model: ESIM (Chen et al., 2017))
![Page 91: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/91.jpg)
Structured Attention for AlignmentsNLI premise: A gentleman overlooking a neighborhood situa on.
hypothesis: A police officer watches a situa on closely.
input
(P, H)A
gentleman
overlooking
...
situa on
A
police
officer...
closely
output
entails
contradicts
neutral
(Proposed model: global matching)
![Page 92: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/92.jpg)
SNLI
so max matching sequence85.5%
86%
86.5%
87%accuracy(3-class)
Mul NLI
so max matching sequence75%
75.5%
76%
76.5%
![Page 93: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/93.jpg)
a
gentleman
overlooking
a
neighborhood
situation
.
a
police
officer
watches a
situation
closely . a
police
officer
watches a
situation
closely .
![Page 94: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/94.jpg)
a
police
officer
watches a
situation
closely .
a
gentleman
overlooking
a
neighborhood
situation
.
A
gentleman
overlooking
a
neighborhood
situa on
.
A
police
officer
watches
a
situa on
closely
.
![Page 95: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/95.jpg)
Dynamically inferringthe computation graph
![Page 96: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/96.jpg)
Dependency TreeLSTM
The bears eat the pre y ones
(Tai et al., 2015)
closely related to GCNs, e.g.(Kipf and Welling, 2017)
(Marcheggiani and Titov, 2017)
![Page 97: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/97.jpg)
Dependency TreeLSTM
The bears eat the pre y ones
(Tai et al., 2015)
closely related to GCNs, e.g.(Kipf and Welling, 2017)
(Marcheggiani and Titov, 2017)
![Page 98: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/98.jpg)
Dependency TreeLSTM
The bears eat the pre y ones
(Tai et al., 2015)
closely related to GCNs, e.g.(Kipf and Welling, 2017)
(Marcheggiani and Titov, 2017)
![Page 99: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/99.jpg)
Dependency TreeLSTM
The bears eat the pre y ones
(Tai et al., 2015)
closely related to GCNs, e.g.(Kipf and Welling, 2017)
(Marcheggiani and Titov, 2017)
![Page 100: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/100.jpg)
Dependency TreeLSTM
The bears eat the pre y ones
(Tai et al., 2015)
closely related to GCNs, e.g.(Kipf and Welling, 2017)
(Marcheggiani and Titov, 2017)
![Page 101: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/101.jpg)
Dependency TreeLSTM
The bears eat the pre y ones
(Tai et al., 2015)
closely related to GCNs, e.g.(Kipf and Welling, 2017)
(Marcheggiani and Titov, 2017)
![Page 102: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/102.jpg)
Dependency TreeLSTM
The bears eat the pre y ones
(Tai et al., 2015)
closely related to GCNs, e.g.(Kipf and Welling, 2017)
(Marcheggiani and Titov, 2017)
![Page 103: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/103.jpg)
Latent Dependency TreeLSTM
input
x
p(y|x) =∑h∈H
p(y | h, x) p(h | x)
h ∈H
The bears eat the pre y ones
output
y
(Niculae, Mar ns, and Cardie, 2018)
![Page 104: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/104.jpg)
Latent Dependency TreeLSTM
input
x
p(y|x) =∑h∈H
p(y | h, x) p(h | x)
h ∈HThe bears eat the pre y ones
output
y
(Niculae, Mar ns, and Cardie, 2018)
![Page 105: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/105.jpg)
Structured Latent Variable Models
p(y | x) = ∑h∈H
p
φ
(y | h, x) p
π
(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
so maxidea 3 SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
![Page 106: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/106.jpg)
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
so maxidea 3 SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
![Page 107: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/107.jpg)
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
so maxidea 3 SparseMAP
e.g., a TreeLSTM defined by h
sum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
![Page 108: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/108.jpg)
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
so maxidea 3 SparseMAP
e.g., a TreeLSTM defined by h
sum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
![Page 109: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/109.jpg)
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
so maxidea 3 SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
![Page 110: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/110.jpg)
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ?
∑h∈H
∂p(y | x)∂π
idea 1
pπ(h | x) = 1 if h = h⋆ else 0 argmax
idea 2
pπ(h | x) ∝ expscoreπ(h; x)
so max
idea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
![Page 111: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/111.jpg)
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1
pπ(h | x) = 1 if h = h⋆ else 0 argmax
idea 2
pπ(h | x) ∝ expscoreπ(h; x)
so max
idea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
![Page 112: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/112.jpg)
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1
pπ(h | x) = 1 if h = h⋆ else 0 argmax
idea 2
pπ(h | x) ∝ expscoreπ(h; x)
so max
idea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
![Page 113: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/113.jpg)
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2
pπ(h | x) ∝ expscoreπ(h; x)
so max
idea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
![Page 114: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/114.jpg)
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2
pπ(h | x) ∝ expscoreπ(h; x)
so max
idea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
![Page 115: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/115.jpg)
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2
pπ(h | x) ∝ expscoreπ(h; x)
so max
idea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
![Page 116: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/116.jpg)
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
so maxidea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
![Page 117: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/117.jpg)
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
so maxidea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
![Page 118: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/118.jpg)
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
so maxidea 3
SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
![Page 119: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/119.jpg)
Structured Latent Variable Models
p(y | x) = ∑h∈H
pφ(y | h, x) pπ(h | x)
How to define pπ? ∑h∈H
∂p(y | x)∂π
idea 1 pπ(h | x) = 1 if h = h⋆ else 0 argmaxidea 2 pπ(h | x) ∝ exp
scoreπ(h; x)
so maxidea 3 SparseMAP
e.g., a TreeLSTM defined by hsum overall possible trees
parsing model,using some scoreπ(h; x)
Exponen ally large sum!
![Page 120: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/120.jpg)
SparseMAP
• • •= .7 • • • + .3 • • •
+ 0 • • • + ...p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )
![Page 121: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/121.jpg)
SparseMAP
• • •= .7 • • • + .3 • • • + 0 • • • + ...
p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )
![Page 122: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/122.jpg)
SparseMAP
• • •= .7 • • • + .3 • • • + 0 • • • + ...p(y | x)= .7 pφ(y | • • •)+ .3 pφ(y | • • • )
![Page 123: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/123.jpg)
Sen ment classifica on (SST)
80%
81%
82%
83%
84%
85%
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3-class)
Reverse dic onary lookup(defini ons)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10⋆ The bears eat the pre y ones
Le -to-right: regular LSTM
⋆ The bears eat the pre y ones
Flat: bag-of-words–like
⋆ The bears eat the pre y ones
CoreNLP: off-line parser
Sentence pair classifica on (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
![Page 124: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/124.jpg)
Sen ment classifica on (SST)
LTR80%
81%
82%
83%
84%
85%
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3-class)
Reverse dic onary lookup(defini ons)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
⋆ The bears eat the pre y ones
Le -to-right: regular LSTM
⋆ The bears eat the pre y ones
Flat: bag-of-words–like
⋆ The bears eat the pre y ones
CoreNLP: off-line parser
Sentence pair classifica on (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
![Page 125: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/125.jpg)
Sen ment classifica on (SST)
LTR Flat80%
81%
82%
83%
84%
85%
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3-class)
Reverse dic onary lookup(defini ons)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10⋆ The bears eat the pre y ones
Le -to-right: regular LSTM
⋆ The bears eat the pre y ones
Flat: bag-of-words–like
⋆ The bears eat the pre y ones
CoreNLP: off-line parser
Sentence pair classifica on (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
![Page 126: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/126.jpg)
Sen ment classifica on (SST)
LTR Flat CoreNLP80%
81%
82%
83%
84%
85%
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3-class)
Reverse dic onary lookup(defini ons)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10⋆ The bears eat the pre y ones
Le -to-right: regular LSTM
⋆ The bears eat the pre y ones
Flat: bag-of-words–like
⋆ The bears eat the pre y ones
CoreNLP: off-line parser
Sentence pair classifica on (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
![Page 127: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/127.jpg)
Sen ment classifica on (SST)
LTR Flat CoreNLP Latent80%
81%
82%
83%
84%
85%
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3-class)
Reverse dic onary lookup(defini ons)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10⋆ The bears eat the pre y ones
Le -to-right: regular LSTM
⋆ The bears eat the pre y ones
Flat: bag-of-words–like
⋆ The bears eat the pre y ones
CoreNLP: off-line parser
Sentence pair classifica on (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
![Page 128: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/128.jpg)
Sen ment classifica on (SST)
LTR Flat CoreNLP Latent80%
81%
82%
83%
84%
85%accuracy(binary)
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3-class)
Reverse dic onary lookup(defini ons)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10⋆ The bears eat the pre y ones
Le -to-right: regular LSTM
⋆ The bears eat the pre y ones
Flat: bag-of-words–like
⋆ The bears eat the pre y ones
CoreNLP: off-line parser
Sentence pair classifica on (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
![Page 129: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/129.jpg)
Sen ment classifica on (SST)
LTR Flat CoreNLP Latent80%
81%
82%
83%
84%
85%accuracy(binary)
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3-class)
Reverse dic onary lookup(defini ons)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10⋆ The bears eat the pre y ones
Le -to-right: regular LSTM
⋆ The bears eat the pre y ones
Flat: bag-of-words–like
⋆ The bears eat the pre y ones
CoreNLP: off-line parser
Sentence pair classifica on (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)
given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
![Page 130: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/130.jpg)
Sen ment classifica on (SST)
LTR Flat CoreNLP Latent80%
81%
82%
83%
84%
85%accuracy(binary)
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3-class)
Reverse dic onary lookup
(defini ons)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10⋆ The bears eat the pre y ones
Le -to-right: regular LSTM
⋆ The bears eat the pre y ones
Flat: bag-of-words–like
⋆ The bears eat the pre y ones
CoreNLP: off-line parser
Sentence pair classifica on (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)
given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
![Page 131: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/131.jpg)
Sen ment classifica on (SST)
LTR Flat CoreNLP Latent80%
81%
82%
83%
84%
85%accuracy(binary)
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3-class)
Reverse dic onary lookup(defini ons)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
⋆ The bears eat the pre y ones
Le -to-right: regular LSTM
⋆ The bears eat the pre y ones
Flat: bag-of-words–like
⋆ The bears eat the pre y ones
CoreNLP: off-line parser
Sentence pair classifica on (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
![Page 132: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/132.jpg)
Sen ment classifica on (SST)
LTR Flat CoreNLP Latent80%
81%
82%
83%
84%
85%accuracy(binary)
Natural Language Inference (SNLI)
LTR Flat CoreNLP Latent
80.6%
80.8%
81%
81.2%
81.4%
81.6%
81.8%
82%accuracy(3-class)
Reverse dic onary lookup(defini ons)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
(concepts)
LTR Flat Latent30%
32%
34%
36%
38%accuracy@10
⋆ The bears eat the pre y ones
Le -to-right: regular LSTM
⋆ The bears eat the pre y ones
Flat: bag-of-words–like
⋆ The bears eat the pre y ones
CoreNLP: off-line parser
Sentence pair classifica on (P,H)
p(y | P,H) = ∑hP∈H (P)
∑hH∈H (H)
pφ(y | hP,hH) pπ(hP | P) pπ(hH | H)given word descrip on, predict word embedding (Hill et al., 2016)instead of p(y | x), we model Epπg(x) =
∑h∈H
g(x;h) pπ(h | x)
![Page 133: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/133.jpg)
Syntax vs. Composition Order
p = 22.6%
⋆ lovely and poignant .
CoreNLP parse, p = 21.4%
⋆ lovely and poignant .
· · ·
![Page 134: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/134.jpg)
Syntax vs. Composition Orderp = 22.6%
⋆ lovely and poignant .
CoreNLP parse, p = 21.4%
⋆ lovely and poignant .
· · ·
![Page 135: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/135.jpg)
Syntax vs. Composition Order
p = 22.6%
⋆ lovely and poignant .
CoreNLP parse, p = 21.4%
⋆ lovely and poignant .
· · ·
p = 15.33%
⋆ a deep and meaningful film .1.0 1.0
1.01.0
1.0 1.0
p = 15.27%
⋆ a deep and meaningful film .
1.01.0
1.01.0
1.01.0
· · ·CoreNLP parse, p = 0%
⋆ a deep and meaningful film .
1.0 1.0
1.01.0
1.0
1.0
![Page 136: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/136.jpg)
ConclusionsDifferen able & sparsestructured inference
Generic, extensible algorithms
Interpretable structured a en on
Dynamically-inferredcomputa on graphs
[email protected] github.com/vene/sparsemaphttps://vene.ro @vnfrombucharest
· · ·
· · ·
a
police
officer
watches a
situation
closely .
a
gentleman
overlooking
a
neighborhood
situation
.
p = 22.6%
⋆ lovely and poignant .
p = 21.4%
⋆ lovely and poignant .
· · ·
![Page 137: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/137.jpg)
Extra slides
![Page 138: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/138.jpg)
Acknowledgements
This work was supported by the European Research Council (ERC StG DeepSPIN 758969) and by theFundação para a Ciência e Tecnologia through contract UID/EEA/50008/2013.
Some icons by Dave Gandy and Freepik via fla con.com.
![Page 139: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/139.jpg)
Danskin’s Theorem(Danskin, 1966; Prop. B.25 in Bertsekas, 1999)
Let φ : Rd × Z→ R,Z ⊂ Rd compact.∂max
z∈Z φ(x, z) = conv∇xφ(x, z⋆) | z⋆ ∈ argmaxz∈Z
φ(x, z).
Example: maximum of a vector
∂maxj∈[d] θj = ∂max
p∈ p⊤θ
= ∂maxp∈ φ(p,θ)
= conv∇θφ(p⋆,θ)= convp⋆
θ = [t,0]
t0
1
−1 0 +1maxj θj
t0
1
−1 0 +1g1 | g ∈ ∂maxj θj
![Page 140: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/140.jpg)
Danskin’s Theorem(Danskin, 1966; Prop. B.25 in Bertsekas, 1999)
Let φ : Rd × Z→ R,Z ⊂ Rd compact.∂max
z∈Z φ(x, z) = conv∇xφ(x, z⋆) | z⋆ ∈ argmaxz∈Z
φ(x, z).
Example: maximum of a vector
∂maxj∈[d] θj = ∂max
p∈ p⊤θ
= ∂maxp∈ φ(p,θ)
= conv∇θφ(p⋆,θ)= convp⋆
θ = [t,0]
t0
1
−1 0 +1maxj θj
t0
1
−1 0 +1g1 | g ∈ ∂maxj θj
![Page 141: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/141.jpg)
Danskin’s Theorem(Danskin, 1966; Prop. B.25 in Bertsekas, 1999)
Let φ : Rd × Z→ R,Z ⊂ Rd compact.∂max
z∈Z φ(x, z) = conv∇xφ(x, z⋆) | z⋆ ∈ argmaxz∈Z
φ(x, z).
Example: maximum of a vector
∂maxj∈[d] θj = ∂max
p∈ p⊤θ
= ∂maxp∈ φ(p,θ)
= conv∇θφ(p⋆,θ)= convp⋆
θ = [t,0]
t0
1
−1 0 +1maxj θj
t0
1
−1 0 +1g1 | g ∈ ∂maxj θj
![Page 142: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/142.jpg)
Fusedmax
fusedmax(θ) = argmaxp∈
p⊤θ − 1/2∥p∥22 −∑2≤j≤d|pj − pj−1|
= argminp∈∥p − θ∥22 +∑2≤j≤d|pj − pj−1|
proxfused(θ) = argminp∈Rd∥p − θ∥22 +∑2≤j≤d|pj − pj−1|
Proposi on: fusedmax(θ) = sparsemaxproxfused(θ)
(Niculae and Blondel, 2017)
“Fused Lasso” a.k.a. 1-d Total Varia on
(Tibshirani et al., 2005)
![Page 143: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/143.jpg)
Fusedmax
fusedmax(θ) = argmaxp∈
p⊤θ − 1/2∥p∥22 −∑2≤j≤d|pj − pj−1|
= argminp∈∥p − θ∥22 +∑2≤j≤d|pj − pj−1|
proxfused(θ) = argminp∈Rd∥p − θ∥22 +∑2≤j≤d|pj − pj−1|
Proposi on: fusedmax(θ) = sparsemaxproxfused(θ)
(Niculae and Blondel, 2017)
“Fused Lasso” a.k.a. 1-d Total Varia on
(Tibshirani et al., 2005)
![Page 144: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/144.jpg)
Example: Source Sentence with Three Words
(0.52, 0.35, 0.13)
softmax
(0.36, 0.44, 0.2)
(0.18, 0.27, 0.55)
0
1
Ferti
lities
(0.7, 0.3, 0)
sparsemax
(0.4, 0.6, 0)
(0, 0.15, 0.85)
0
1
(0.52, 0.35, 0.13)
csoftmax
(0.36, 0.44, 0.2)
(0.12, 0.21, 0.67)
0
1
(0.7, 0.3, 0)
csparsemax
(0.3, 0.7, 0)
(0, 0, 1)
0
1
![Page 145: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/145.jpg)
e.g., fertility constraints for NMT
thisisthelast
hundredyearslawofthelast
hundred
<EOS>
<SINK> .
jahrehundertletzten
dergesetzmoores istdas
.
iam
goingto
giveyouthe
governmentgovernment
.<EOS>
<SINK> .
wählen '
regierung 'them
adasnun
werdeich
thisis
moore's
lawlast
hundredyears
.<EOS>
nowi
amgoing
tochoose
thegovernment
.<EOS>
constrained so max: (Mar ns and Kreutzer, 2017) constrained sparsemax: (Malaviya et al., 2018)
![Page 146: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/146.jpg)
a
gentleman
overlooking
a
neighborhood
situation
.
a
police
officer
watches a
situation
closely .a
police
officer
watches a
situation
closely .
![Page 147: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/147.jpg)
Structured Output Prediction
SparseMAP LA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2
− η⊤μ + 1/2∥μ∥2
cost-SparseMAP LρA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2+ρ(μ, μ)
− η⊤μ + 1/2∥μ∥2
Instance of a structured Fenchel-Young loss, like CRF, SVM, etc. (Blondel, Mar ns, and Niculae, 2019)
![Page 148: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/148.jpg)
Structured Output Prediction
SparseMAP LA(η, μ) = maxμ∈Mη⊤μ − 1/2∥μ∥2
− η⊤μ + 1/2∥μ∥2cost-SparseMAP LρA(η, μ) = max
μ∈Mη⊤μ − 1/2∥μ∥2+ρ(μ, μ)
− η⊤μ + 1/2∥μ∥2
Instance of a structured Fenchel-Young loss, like CRF, SVM, etc. (Blondel, Mar ns, and Niculae, 2019)
![Page 149: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/149.jpg)
![Page 150: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/150.jpg)
![Page 151: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/151.jpg)
![Page 152: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/152.jpg)
![Page 153: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/153.jpg)
![Page 154: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/154.jpg)
References IAmos, Brandon and J. Zico Kolter (2017). “OptNet: Differen able op miza on as a layer inneural networks”. In: Proc. of ICML.
Bertsekas, Dimitri P (1999). Nonlinear Programming. Athena Scien fic Belmont.Blondel, Mathieu, André FT Mar ns, and Vlad Niculae (2019). “Learning with Fenchel-YoungLosses”. In: preprint arXiv:1901.02324.
Brucker, Peter (1984). “An O(n) algorithm for quadra c knapsack problems”. In: Opera onsResearch Le ers 3.3, pp. 163–166.
Chen, Qian et al. (2017). “Enhanced LSTM for natural language inference”. In: Proc. of ACL.Condat, Laurent (2016). “Fast projec on onto the simplex and the ℓ1 ball”. In:Mathema calProgramming 158.1-2, pp. 575–585.
Danskin, John M (1966). “The theory of max-min, with applica ons”. In: SIAM Journal onApplied Mathema cs 14.4, pp. 641–664.
Dantzig, George B, Alex Orden, and Philip Wolfe (1955). “The generalized simplex method forminimizing a linear form under linear inequality restraints”. In: Pacific Journal of Mathema cs5.2, pp. 183–195.
![Page 155: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/155.jpg)
References IIFrank, Marguerite and Philip Wolfe (1956). “An algorithm for quadra c programming”. In: Nav.Res. Log. 3.1-2, pp. 95–110.
Gould, Stephen et al. (2016). “On differen a ng parameterized argmin and argmax problemswith applica on to bi-level op miza on”. In: preprint arXiv:1607.05447.
Held, Michael, Philip Wolfe, and Harlan P Crowder (1974). “Valida on of subgradientop miza on”. In:Mathema cal Programming 6.1, pp. 62–88.
Hill, Felix et al. (2016). “Learning to understand phrases by embedding the dic onary”. In: TACL4.1, pp. 17–30.
Kim, Yoon et al. (2017). “Structured a en on networks”. In: Proc. of ICLR.Kipf, Thomas N. and Max Welling (2017). “Semi-supervised classifica on with graphconvolu onal networks”. In: Proc. of ICLR.
Koo, Terry et al. (2007). “Structured predic on models via the matrix-tree theorem”. In: Proc. ofEMNLP.
![Page 156: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/156.jpg)
References IIILacoste-Julien, Simon and Mar n Jaggi (2015). “On the global linear convergence ofFrank-Wolfe op miza on variants”. In: Proc. of NeurIPS.
Liu, Yang and Mirella Lapata (2018). “Learning structured text representa ons”. In: TACL 6,pp. 63–75.
Malaviya, Chaitanya, Pedro Ferreira, and André F. T. Mar ns (2018). “Sparse and constraineda en on for neural machine transla on”. In: Proc. of ACL.
Marcheggiani, Diego and Ivan Titov (2017). “Encoding sentences with graph convolu onalnetworks for seman c role labeling”. In: Proc. of EMNLP.
Mar ns, André FT and Ramón Fernandez Astudillo (2016). “From so max to sparsemax: Asparse model of a en on and mul -label classifica on”. In: Proc. of ICML.
Mar ns, André FT and Julia Kreutzer (2017). “Learning What’s Easy: Fully Differen ableNeural Easy-First Taggers”. In: Proc. of EMNLP, pp. 349–362.
McDonald, Ryan T and Giorgio Sa a (2007). “On the complexity of non-projec ve data-drivendependency parsing”. In: Proc. of ICPT.
![Page 157: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/157.jpg)
References IVNiculae, Vlad and Mathieu Blondel (2017). “A regularized framework for sparse and structuredneural a en on”. In: Proc. of NeurIPS.
Niculae, Vlad, André FT Mar ns, Mathieu Blondel, et al. (2018). “SparseMAP: Differen ablesparse structured inference”. In: Proc. of ICML.
Niculae, Vlad, André FT Mar ns, and Claire Cardie (2018). “Towards dynamic computa ongraphs via sparse latent structure”. In: Proc. of EMNLP.
Nocedal, Jorge and Stephen Wright (1999). Numerical Op miza on. Springer New York.Rabiner, Lawrence R. (1989). “A tutorial on Hidden Markov Models and selected applica onsin speech recogni on”. In: P. IEEE 77.2, pp. 257–286.
Smith, David A and Noah A Smith (2007). “Probabilis c models of nonprojec ve dependencytrees”. In: Proc. of EMNLP.
Tai, Kai Sheng, Richard Socher, and Christopher D Manning (2015). “Improved seman crepresenta ons from tree-structured Long Short-Term Memory networks”. In: Proc. ofACL-IJCNLP.
![Page 158: vene.rovene.ro/talks/19-edinburgh.pdf · Senmentclassificaon (SST) LTR Flat 80% 81% 82% 83% 84% 85% NaturalLanguageInference(SNLI) LTR Flat CoreNLP Latent 80.6% 80.8% 81% 81.2% 81.4%](https://reader033.fdocuments.us/reader033/viewer/2022041810/5e575716dd387923683e8173/html5/thumbnails/158.jpg)
References VTaskar, Ben (2004). “Learning structured predic on models: A large margin approach”.PhD thesis. Stanford University.
Tibshirani, Robert et al. (2005). “Sparsity and smoothness via the fused lasso”. In: Journal of theRoyal Sta s cal Society: Series B (Sta s cal Methodology) 67.1, pp. 91–108.
Valiant, Leslie G (1979). “The complexity of compu ng the permanent”. In: Theor. Comput. Sci.8.2, pp. 189–201.
Vinyes, Marina and Guillaume Obozinski (2017). “ Fast column genera on for atomic normregulariza on”. In: Proc. of AISTATS.
Wolfe, Philip (1976). “Finding the nearest point in a polytope”. In:Mathema cal Programming11.1, pp. 128–149.