Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex...

Transcript of Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex...

Part II: Structured Prediction

Alex Schwing & Raquel Urtasun

University of Toronto

December 12, 2015

A. G. Schwing (UofT) Deep Structured Models II December 12, 2015 1 / 60

Page 2: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Roadmap Towards Learning Deep Structured Models

1 Part I: Deep Models

2 Part II: Structured Models

3 Part III: Deep Structured Models

Part II: Structured Prediction

How to predict in structured spaces?How to learn in structured spaces?

Page 5: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

How to predict in structured spaces?

Page 6: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Binary Classification

Prediction answers a Yes-No question

Spam filtering (Is this eMail spam?)Quality control (Can I sell this product?)Medical testing (Does this lab analysis look normal?)Driver assistance systems (Is there an obstacle upfront?)

x → y ∈ {Yes,No}

Page 7: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Spam filtering (Is this eMail spam?)

Quality control (Can I sell this product?)Medical testing (Does this lab analysis look normal?)Driver assistance systems (Is there an obstacle upfront?)

Subject: ICCV TutorialText: Hey Raquel, ...

Page 8: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Spam filtering (Is this eMail spam?)Quality control (Can I sell this product?)

Medical testing (Does this lab analysis look normal?)Driver assistance systems (Is there an obstacle upfront?)

Page 9: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Spam filtering (Is this eMail spam?)Quality control (Can I sell this product?)Medical testing (Does this lab analysis look normal?)

Driver assistance systems (Is there an obstacle upfront?)

Page 10: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Spam filtering (Is this eMail spam?)Quality control (Can I sell this product?)Medical testing (Does this lab analysis look normal?)Driver assistance systems (Is there an obstacle upfront?)

Page 11: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

How to find y ∈ {Yes,No} given input data x?

Feature−−−−→

x2

x1

Page 12: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

x2

x1

Page 13: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

x2

x1

Page 14: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

x2

x1

Page 15: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

x2

x1

Page 16: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

x2

x1

Page 17: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

x2

x1

Page 18: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

x2

x1

Page 19: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

x2

x1

Page 20: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

x2

x1

Page 21: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Linear Discriminant

x1

x2

w>1 x = 0

y ∈ {−1,1}

y∗ = sign(w>1 x)

Equivalent:

y ∈ {1,2}

y∗ = arg maxy

[w1−w1

]> [ xδ(y = 1)xδ(y = 2)

]

Page 22: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Linear Discriminant

x1

x2 w>1 x = 0

y ∈ {−1,1}

y∗ = sign(w>1 x)

Equivalent:

y ∈ {1,2}

y∗ = arg maxy

[w1−w1

]> [ xδ(y = 1)xδ(y = 2)

]

Page 23: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Linear Discriminant

x1

x2 w>1 x = 0

y ∈ {−1,1}

y∗ = sign(w>1 x)

Equivalent:

y ∈ {1,2}

y∗ = arg maxy

[w1−w1

]> [ xδ(y = 1)xδ(y = 2)

]

Page 24: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Linear Discriminant

x1

x2 w>1 x = 0

y ∈ {−1,1}

y∗ = sign(w>1 x)

Equivalent:

y ∈ {1,2}

y∗ = arg maxy

[w1−w1

]> [ xδ(y = 1)xδ(y = 2)

]

Page 25: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Non-linear Discriminant

x1

x2

y ∈ {1,2}

y∗ = arg maxy

w>φ(x , y)

More generally:

y∗ = arg maxy

F (y , x ,w)

Page 26: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

x1

x2

y ∈ {1,2}

y∗ = arg maxy

w>φ(x , y)

More generally:

y∗ = arg maxy

F (y , x ,w)

Page 27: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

x1

x2

y ∈ {1,2}

y∗ = arg maxy

w>φ(x , y)

More generally:

y∗ = arg maxy

F (y , x ,w)

Page 28: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

x1

x2

y ∈ {1,2}

y∗ = arg maxy

w>φ(x , y)

More generally:

y∗ = arg maxy

F (y , x ,w)

Page 29: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass Classification

Prediction answers a categorical question

Object classification (Which object is in the image?)Document retrieval (What topic is this document about?)Medical testing (Which disease fits the symptoms?)

x → y ∈ {1, . . . ,K}

Page 30: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Object classification (Which object is in the image?)

Document retrieval (What topic is this document about?)Medical testing (Which disease fits the symptoms?)

x → y ∈ {1, . . . ,K}

Page 31: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Object classification (Which object is in the image?)Document retrieval (What topic is this document about?)

Medical testing (Which disease fits the symptoms?)

x → y ∈ {1, . . . ,K}

Page 32: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Object classification (Which object is in the image?)Document retrieval (What topic is this document about?)Medical testing (Which disease fits the symptoms?)

x → y ∈ {1, . . . ,K}

Page 33: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

1 vs all

How to deal with more than two classes?

Use K − 1 classifiers, each solving a two class problem ofseparating a point in class Ck from points not in the class.Known as 1 vs all or 1 vs the rest classifier

Issue: more than one good answer

Page 34: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

1 vs all

How to deal with more than two classes?Use K − 1 classifiers, each solving a two class problem ofseparating a point in class Ck from points not in the class.Known as 1 vs all or 1 vs the rest classifier

Issue: more than one good answerA. G. Schwing (UofT) Deep Structured Models II December 12, 2015 11 / 60

Page 35: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

1 vs 1 classifier

How to deal with more than two classes?

Introduce K (K − 1)/2 two-way classifiers, one for each possiblepair of classesEach point is classified according to majority vote amongst thediscriminant functionKnown as the 1 vs 1 classifier

Issue: two-way preferences need not be transitive

Page 36: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

1 vs 1 classifier

How to deal with more than two classes?Introduce K (K − 1)/2 two-way classifiers, one for each possiblepair of classesEach point is classified according to majority vote amongst thediscriminant functionKnown as the 1 vs 1 classifier

Issue: two-way preferences need not be transitiveA. G. Schwing (UofT) Deep Structured Models II December 12, 2015 12 / 60

Page 37: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Score Maximization

y ∈ {1,2, . . . ,K}Prediction/Inference:

y∗ = arg maxy

w1...

wK

> xδ(y = 1)

...xδ(y = K )

More Generally:

y∗ = arg maxy

w>φ(x , y)

Even more generally:

y∗ = arg maxy

F (y , x ,w)

Try all possible classes and return the highest scoring one.

Page 38: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Score Maximization

y∗ = arg maxy

w1...

wK

> xδ(y = 1)

...xδ(y = K )

More Generally:

y∗ = arg maxy

w>φ(x , y)

y∗ = arg maxy

F (y , x ,w)

Page 39: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Score Maximization

y∗ = arg maxy

w1...

wK

> xδ(y = 1)

...xδ(y = K )

More Generally:

y∗ = arg maxy

w>φ(x , y)

y∗ = arg maxy

F (y , x ,w)

Page 40: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Score Maximization

y∗ = arg maxy

w1...

wK

> xδ(y = 1)

...xδ(y = K )

More Generally:

y∗ = arg maxy

w>φ(x , y)

y∗ = arg maxy

F (y , x ,w)

Structured Prediction

Q V I Z Q U I Z

Relationship not explicitly taken into account.

Page 42: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Q V I Z Q U I Z

Page 43: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Q

V

I Z Q U I Z

Page 44: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Q V I Z Q U I Z

Page 45: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Q V

I

Z Q U I Z

Page 46: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Q V I Z Q U I Z

Page 47: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Q V I

Z

Q U I Z

Page 48: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Q V I Z Q U I Z

Page 49: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Q V I Z

Q U I Z

Page 50: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Q V I Z

Q U I Z

Page 51: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Q V I Z Q U I Z

Page 52: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Q V I Z Q U I Z

Page 53: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Example: Disparity map estimation

Why not to predict every variable separately:

Image Independent Prediction Structured Prediction

Page 54: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Prediction estimates a complex object

Image segmentation (estimate a labeling)Sentence parsing (estimate a parse tree)Protein folding (estimate a protein structure)Stereo vision (estimate a disparity map)

x → y = (y1, . . . , yD)

Page 55: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Image segmentation (estimate a labeling)

Sentence parsing (estimate a parse tree)Protein folding (estimate a protein structure)Stereo vision (estimate a disparity map)

x → y = (y1, . . . , yD)

Page 56: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Image segmentation (estimate a labeling)Sentence parsing (estimate a parse tree)

Protein folding (estimate a protein structure)Stereo vision (estimate a disparity map)

x → y = (y1, . . . , yD)

I saw the man withthe telescope.

S

NP VP

VP PP

V NP P NP

Det N Det N

I saw the man with the telescope

Page 57: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Image segmentation (estimate a labeling)Sentence parsing (estimate a parse tree)Protein folding (estimate a protein structure)

Stereo vision (estimate a disparity map)

x → y = (y1, . . . , yD)

Page 58: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Image segmentation (estimate a labeling)Sentence parsing (estimate a parse tree)Protein folding (estimate a protein structure)Stereo vision (estimate a disparity map)

x → y = (y1, . . . , yD)

Page 59: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

“Standard” Prediction: output y ∈ Y is a scalar number

Y = {1, . . . ,K} or Y = R

“Structured” Prediction: output y is a structured output:

Page 60: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

“Standard” Prediction: output y ∈ Y is a scalar number

Y = {1, . . . ,K} or Y = R

“Structured” Prediction: output y is a structured output:

Page 61: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Formally:y = (y1, . . . , yD) yi = {1, . . . ,K}

Inference/Prediction:

y∗ = arg maxy

F (y1, . . . , yD, x ,w)

How many possibilities do we have to store and explore?

K D

That’s a problem. What can we do?

Page 62: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y∗ = arg maxy

F (y1, . . . , yD, x ,w)

K D

Page 63: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y∗ = arg maxy

F (y1, . . . , yD, x ,w)

K D

Page 64: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y∗ = arg maxy

F (y1, . . . , yD, x ,w)

K D

Page 65: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Separate prediction:

maxy

F (y1, . . . , yD, x ,w) =D∑

i=1

maxyi

fi(yi , x ,w)

Why not predict every variable yi from y = (y1, . . . , yD) separately?

Page 66: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Separate prediction:

maxy

F (y1, . . . , yD, x ,w) =D∑

i=1

maxyi

fi(yi , x ,w)

Why not predict every variable yi from y = (y1, . . . , yD) separately?

Page 67: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Discriminant function decomposes:

F (y1, . . . , yD, x ,w) =∑

r

fr (yr , x ,w)

Restriction: every r ⊆ {1, . . . ,D}

Discrete domain:

f{1,2}(y{1,2}) = f{1,2}(y1, y2) =[f{1,2}(1,1), f{1,2}(1,2), . . .

]Q U I Z V

Q 0 0.8 0.2 0.1 0.1

U. . . · · ·

I...

ZV

Page 68: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

F (y1, . . . , yD, x ,w) =∑

r

fr (yr , x ,w)

Restriction: every r ⊆ {1, . . . ,D}Discrete domain:

f{1,2}(y{1,2}) = f{1,2}(y1, y2) =[f{1,2}(1,1), f{1,2}(1,2), . . .

]

Q U I Z VQ 0 0.8 0.2 0.1 0.1

U. . . · · ·

I...

ZV

Page 69: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

F (y1, . . . , yD, x ,w) =∑

r

fr (yr , x ,w)

Restriction: every r ⊆ {1, . . . ,D}Discrete domain:

f{1,2}(y{1,2}) = f{1,2}(y1, y2) =[f{1,2}(1,1), f{1,2}(1,2), . . .

]Q U I Z V

Q 0 0.8 0.2 0.1 0.1

U. . . · · ·

I...

ZV

Page 70: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Q U I Z

Example:

F (y1, . . . , y4, x ,w) = f1(y1, x ,w) + f2(y2, x ,w) + f3(y3, x ,w) + f4(y4, x ,w)

+ f1,2(y1, y2, x ,w) + f2,3(y2, y3, x ,w) + f3,4(y3, y4, x ,w)

How many function values need to be stored if yi ∈ {1, . . . ,26} ∀i?

264 v.s. 3 · 262(+4 · 26)

Page 71: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Q U I Z

Example:

264 v.s. 3 · 262(+4 · 26)

Page 72: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Q U I Z

Example:

264 v.s. 3 · 262(+4 · 26)

Page 73: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Visualization of the decomposition:

y1, y2 y2, y3 y3, y4

y1 y2 y3 y4

Edges denote subset relationship

Page 74: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Visualization of the decomposition:

y1, y2 y2, y3 y3, y4

y1 y2 y3 y4

Edges denote subset relationship

Page 75: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Special casesPredicting every variable separately:

F (y1, . . . , yD, x ,w) =D∑

i=1

fi(yi , x ,w)

y1 y2 . . .

Markov random field with only unary variables

Multi-variate prediction:

F (y1, . . . , yD, x ,w) = f1,...,D(y1,...,D, x ,w)

y1,...,D

Page 76: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

F (y1, . . . , yD, x ,w) =D∑

i=1

fi(yi , x ,w)

y1 y2 . . .

F (y1, . . . , yD, x ,w) = f1,...,D(y1,...,D, x ,w)

y1,...,D

Page 77: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

F (y1, . . . , yD, x ,w) =D∑

i=1

fi(yi , x ,w)

y1 y2 . . .

F (y1, . . . , yD, x ,w) = f1,...,D(y1,...,D, x ,w)

y1,...,D

Page 78: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

F (y1, . . . , yD, x ,w) =D∑

i=1

fi(yi , x ,w)

y1 y2 . . .

F (y1, . . . , yD, x ,w) = f1,...,D(y1,...,D, x ,w)

y1,...,D

Page 79: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

F (y1, . . . , yD, x ,w) =D∑

i=1

fi(yi , x ,w)

y1 y2 . . .

F (y1, . . . , yD, x ,w) = f1,...,D(y1,...,D, x ,w)

y1,...,D

Page 80: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Example: stereo vision

Markov/Conditional random field:

y1 y2 y3 y4

y5 y6 y7 y8

y1,2 y2,3 y3,4

y5,6 y6,7 y7,8

y1,5 y2,6 y3,7 y4,8

F (y, x ,w) =D∑

i=1

fi(yi , x ,w) +∑i,j

fi,j(yi , yj , x ,w)

Unary term:

image evidence

Pairwise term:

smoothness prior

Page 81: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y1 y2 y3 y4

y5 y6 y7 y8

y1,2 y2,3 y3,4

y5,6 y6,7 y7,8

y1,5 y2,6 y3,7 y4,8

F (y, x ,w) =D∑

i=1

Unary term:

image evidence

Pairwise term:

smoothness prior

Page 82: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y1 y2 y3 y4

y5 y6 y7 y8

F (y, x ,w) =D∑

i=1

Unary term:

image evidence

Pairwise term:

smoothness prior

Page 83: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y1 y2 y3 y4

y5 y6 y7 y8

F (y, x ,w) =D∑

i=1

Unary term: image evidencePairwise term:

smoothness prior

Page 84: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y1 y2 y3 y4

y5 y6 y7 y8

F (y, x ,w) =D∑

i=1

Unary term: image evidencePairwise term: smoothness priorA. G. Schwing (UofT) Deep Structured Models II December 12, 2015 24 / 60

Page 85: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Example: semantic segmentation

y1 y2 y3 y4

y5 y6 y7 y8

F (y, x ,w) =D∑

i=1

Unary term:

image evidence

Pairwise term:

smoothness prior

Page 86: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y1 y2 y3 y4

y5 y6 y7 y8

F (y, x ,w) =D∑

i=1

Unary term:

image evidence

Pairwise term:

smoothness prior

Page 87: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y1 y2 y3 y4

y5 y6 y7 y8

F (y, x ,w) =D∑

i=1

Unary term: image evidencePairwise term:

smoothness prior

Page 88: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y1 y2 y3 y4

y5 y6 y7 y8

F (y, x ,w) =D∑

i=1

Unary term: image evidencePairwise term: smoothness priorA. G. Schwing (UofT) Deep Structured Models II December 12, 2015 25 / 60

Page 89: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Inference:y∗ = arg max

yF (y, x ,w)

Probability of a configuration y:

p(y | x ,w) =1

Z (x ,w)exp F (y, x ,w)

Normalization constant/partition function:

Z (x ,w) =∑

y

exp F (y, x ,w)

Inference as probability maximization:

y∗ = arg maxy

p(y | x ,w)

Page 90: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

yF (y, x ,w)

p(y | x ,w) =1

Z (x ,w) =∑

y

exp F (y, x ,w)

y∗ = arg maxy

p(y | x ,w)

Page 91: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

yF (y, x ,w)

p(y | x ,w) =1

Z (x ,w) =∑

y

exp F (y, x ,w)

y∗ = arg maxy

p(y | x ,w)

Page 92: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

yF (y, x ,w)

p(y | x ,w) =1

Z (x ,w) =∑

y

exp F (y, x ,w)

y∗ = arg maxy

p(y | x ,w)

Page 93: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y

∑r

fr (yr , x ,w)

Some inference algorithms:

Exhaustive searchDynamic programmingInteger linear programLinear programming relaxationMessage passingGraph-cut

Page 94: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y

∑r

fr (yr , x ,w)

Some inference algorithms:

Page 95: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Prediction - Exhaustive Search

y∗ = arg maxy

∑r

fr (yr , x ,w)

Algorithm:try all possible configurations y ∈ Ykeep highest scoring element

Advantage: very simple to implementDisadvantage: very slow for reasonably sized problems

Page 96: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y∗ = arg maxy

∑r

fr (yr , x ,w)

Page 97: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y∗ = arg maxy

∑r

fr (yr , x ,w)

Page 98: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Prediction - Dynamic Programming

y1 y2

y1,2

µ1,2→2 (y2 )

maxy

F (y, x ,w) = maxy1,y2

f2(y2) + f1(y1) + f1,2(y1, y2)

= maxy2

f2(y2) + maxy1

{f1(y1) + f1,2(y1, y2)

}

= maxy2

f2(y2) + µ1,2→2(y2)

Page 99: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y1 y2

y1,2

µ1,2→2 (y2 )

maxy

f2(y2) + f1(y1) + f1,2(y1, y2)

= maxy2

f2(y2) + maxy1

{f1(y1) + f1,2(y1, y2)

}

= maxy2

f2(y2) + µ1,2→2(y2)

Page 100: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y1 y2

y1,2 µ1,2→2 (y2 )

maxy

f2(y2) + f1(y1) + f1,2(y1, y2)

= maxy2

f2(y2) + maxy1

{f1(y1) + f1,2(y1, y2)

}︸︷︷︸

µ1,2→2(y2)

= maxy2

f2(y2) + µ1,2→2(y2)

Page 101: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y1 y2

y1,2 µ1,2→2 (y2 )

maxy

f2(y2) + f1(y1) + f1,2(y1, y2)

= maxy2

f2(y2) + maxy1

{f1(y1) + f1,2(y1, y2)

}︸︷︷︸

µ1,2→2(y2)

= maxy2

f2(y2) + µ1,2→2(y2)

Page 102: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

We can reorganize terms whenever the graph is a tree.

What to do for general loopy graphs?

Dynamic programming extensions (message passing)Graph cut algorithms

Page 103: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

We can reorganize terms whenever the graph is a tree.

What to do for general loopy graphs?

Dynamic programming extensions (message passing)Graph cut algorithms

Page 104: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Prediction - Integer Linear Program

Example: y∗ = arg maxy

∑r

fr (yr )

y1 y2

y1,2

Integer Linear Program (LP) equivalence: variables br (yr )

maxb1,b2,b12

b1(1)b1(2)b2(1)b2(2)

b12(1,1)b12(2,1)b12(1,2)b12(2,2)

>

f1(1)f1(2)f2(1)f2(2)

f12(1,1)f12(2,1)f12(1,2)f12(2,2)

s.t.

br (yr ) ∈ {0,1}∑yr

br (yr ) = 1

∑yp\yr

bp(yp) = br (yr )

Page 105: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

∑r

fr (yr )

y1 y2

y1,2

maxb1,b2,b12

b1(1)b1(2)b2(1)b2(2)

b12(1,1)b12(2,1)b12(1,2)b12(2,2)

>

f1(1)f1(2)f2(1)f2(2)

f12(1,1)f12(2,1)f12(1,2)f12(2,2)

s.t.

br (yr ) ∈ {0,1}

∑yr

br (yr ) = 1

∑yp\yr

bp(yp) = br (yr )

Page 106: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

∑r

fr (yr )

y1 y2

y1,2

maxb1,b2,b12

b1(1)b1(2)b2(1)b2(2)

b12(1,1)b12(2,1)b12(1,2)b12(2,2)

>

f1(1)f1(2)f2(1)f2(2)

f12(1,1)f12(2,1)f12(1,2)f12(2,2)

s.t.

br (yr ) ∈ {0,1}∑yr

br (yr ) = 1

∑yp\yr

bp(yp) = br (yr )

Page 107: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

∑r

fr (yr )

y1 y2

y1,2

maxb1,b2,b12

b1(1)b1(2)b2(1)b2(2)

b12(1,1)b12(2,1)b12(1,2)b12(2,2)

>

f1(1)f1(2)f2(1)f2(2)

f12(1,1)f12(2,1)f12(1,2)f12(2,2)

s.t.

br (yr ) ∈ {0,1}∑yr

br (yr ) = 1

∑yp\yr

bp(yp) = br (yr )

Page 108: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y∗ = arg maxy

∑r

fr (yr )

Integer linear program:

maxb1,b2,b12

b1(1)b1(2)b2(1)b2(2)

b12(1,1)b12(2,1)b12(1,2)b12(2,2)

>

f1(1)f1(2)f2(1)f2(2)

f12(1,1)f12(2,1)f12(1,2)f12(2,2)

s.t.

br (yr ) ∈ {0,1}br (yr ) ≥ 0∑yr

br (yr ) = 1

∑yp\yr

bp(yp) = br (yr )

Page 109: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y∗ = arg maxy

∑r

fr (yr )

maxbr

∑r ,yr

br (yr )fr (yr ) s.t.

br (yr ) ∈ {0,1}br (yr ) ≥ 0∑yr

br (yr ) = 1

∑yp\yr

bp(yp) = br (yr )

Page 110: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y∗ = arg maxy

∑r

fr (yr )

maxbr

∑r ,yr

br (yr ) ∈ {0,1}br (yr ) ≥ 0∑yr

br (yr ) = 1

Marginalization

Page 111: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y∗ = arg maxy

∑r

fr (yr )

maxbr

∑r ,yr

br (yr ) ∈ {0,1}

Local probability br

Marginalization

Page 112: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y∗ = arg maxy

∑r

fr (yr )

maxbr

∑r ,yr

br (yr ) ∈ {0,1}

Marginalization

• Advantage: very good solvers available• Disadvantage: very slow for larger problems

Page 113: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Prediction - Linear Programming Relaxation

y∗ = arg maxy

∑r

fr (yr )

LP relaxation:

maxbr

∑r ,yr

((((((((hhhhhhhhbr (yr ) ∈ {0,1}

Marginalization

︸︷︷︸s.t. b∈C

Page 114: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Prediction - Linear Programming Relaxation

y∗ = arg maxy

∑r

fr (yr )

LP relaxation:

maxbr

∑r ,yr

((((((((hhhhhhhhbr (yr ) ∈ {0,1}

Marginalization

︸︷︷︸s.t. b∈C

• Advantage: very good solvers available• Disadvantage: slow for larger problems

Page 115: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Prediction - Message Passing

Graph structure defined via marginalization constraints

y1 y2

y1,2

Message passing solvers:

Advantage: Efficient due to analytically computable sub-problemsProblem: Special care required to find global optimum

Page 116: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Prediction - Message Passing

Graph structure defined via marginalization constraints

y1 y2

y1,2

Message passing solvers:

Advantage: Efficient due to analytically computable sub-problemsProblem: Special care required to find global optimum

Page 117: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Prediction - Graph-cut Solvers

For yi ∈ {1,2}:

Convert scoring function F into auxiliary graph

Compute a weighted cut cost corresponding to the labeling score

What are the nodes and what are the weights on the edges?

Page 118: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

For yi ∈ {1,2}:

Convert scoring function F into auxiliary graphCompute a weighted cut cost corresponding to the labeling score

Page 119: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

For yi ∈ {1,2}:

Convert scoring function F into auxiliary graphCompute a weighted cut cost corresponding to the labeling score

Page 120: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

What are the nodes?

s

t

y3y1 y5

y4y2 y6

Two special nodes called ‘source’ and ‘terminal’Variables yi as nodes

Page 121: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

What are the nodes?

s

t

y3y1 y5

y4y2 y6

Two special nodes called ‘source’ and ‘terminal’

Variables yi as nodes

Page 122: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

What are the nodes?

s

t

y3y1 y5

y4y2 y6

Page 123: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

What are the nodes?

s

t

y3y1 y5

y4y2 y6

Page 124: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

What weights do we assign to edges?

Recall that local scoring functions are arrays:[f1(y1 = 1) f1(y1 = 2)

]Graph-cut solvers compute a min-cut:

−f1(y1 = 2)

−f1(y1 = 1)

s

t

y1

Page 125: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

What weights do we assign to edges?Recall that local scoring functions are arrays:[

f1(y1 = 1) f1(y1 = 2)]

Graph-cut solvers compute a min-cut:

−f1(y1 = 2)

−f1(y1 = 1)

s

t

y1

Page 126: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

f1(y1 = 1) f1(y1 = 2)]

−f1(y1 = 2)

−f1(y1 = 1)

s

t

y1

Page 127: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

f12(1, 1) f12(1, 2)f12(2, 1) f12(2, 2)

]=

f (1, 1)− f (2, 1) + f (2, 2)

+

[0 0

f (2, 1)− f (1, 1) f (2, 1)− f (1, 1)

]+

[f (2, 1)− f (2, 2) 0f (2, 1)− f (2, 2) 0

]+

[0 f (1, 2) + f (2, 1)− f (1, 1)− f (2, 2)0 0

t

y1 y2

s

f (1,1)− f (2,1)

f (1,1) + f (2,2)− f (1,2)− f (2,1)

f (2,2)− f (2,1)

Page 128: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

f12(1, 1) f12(1, 2)f12(2, 1) f12(2, 2)

]= f (1, 1)− f (2, 1) + f (2, 2)

+

[0 0

f (2, 1)− f (1, 1) f (2, 1)− f (1, 1)

]+

[f (2, 1)− f (2, 2) 0f (2, 1)− f (2, 2) 0

]+

[0 f (1, 2) + f (2, 1)− f (1, 1)− f (2, 2)0 0

]

t

y1 y2

s

f (1,1)− f (2,1)

f (1,1) + f (2,2)− f (1,2)− f (2,1)

f (2,2)− f (2,1)

Page 129: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

f12(1, 1) f12(1, 2)f12(2, 1) f12(2, 2)

]= f (1, 1)− f (2, 1) + f (2, 2)

+

[0 0

f (2, 1)− f (1, 1) f (2, 1)− f (1, 1)

]+

[f (2, 1)− f (2, 2) 0f (2, 1)− f (2, 2) 0

]+

[0 f (1, 2) + f (2, 1)− f (1, 1)− f (2, 2)0 0

t

y1 y2

s

f (1,1)− f (2,1)

f (1,1) + f (2,2)− f (1,2)− f (2,1)

f (2,2)− f (2,1)

Page 130: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

f12(1, 1) f12(1, 2)f12(2, 1) f12(2, 2)

]= f (1, 1)− f (2, 1) + f (2, 2)

+

[0 0

f (2, 1)− f (1, 1) f (2, 1)− f (1, 1)

]+

[f (2, 1)− f (2, 2) 0f (2, 1)− f (2, 2) 0

]+

[0 f (1, 2) + f (2, 1)− f (1, 1)− f (2, 2)0 0

t

y1 y2

s

f (1,1)− f (2,1)

f (1,1) + f (2,2)− f (1,2)− f (2,1)

f (2,2)− f (2,1)

Page 131: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

f12(1, 1) f12(1, 2)f12(2, 1) f12(2, 2)

]= f (1, 1)− f (2, 1) + f (2, 2)

+

[0 0

f (2, 1)− f (1, 1) f (2, 1)− f (1, 1)

]+

[f (2, 1)− f (2, 2) 0f (2, 1)− f (2, 2) 0

]+

[0 f (1, 2) + f (2, 1)− f (1, 1)− f (2, 2)0 0

t

y1 y2

s

f (1,1)− f (2,1)

f (1,1) + f (2,2)− f (1,2)− f (2,1)

f (2,2)− f (2,1)

Page 132: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

f12(1, 1) f12(1, 2)f12(2, 1) f12(2, 2)

]= f (1, 1)− f (2, 1) + f (2, 2)

+

[0 0

f (2, 1)− f (1, 1) f (2, 1)− f (1, 1)

]+

[f (2, 1)− f (2, 2) 0f (2, 1)− f (2, 2) 0

]+

[0 f (1, 2) + f (2, 1)− f (1, 1)− f (2, 2)0 0

t

y1 y2

s

f (1,1)− f (2,1)

f (1,1) + f (2,2)− f (1,2)− f (2,1)

f (2,2)− f (2,1)

Page 133: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

t

y1 y2

s

f (1,1)− f (2,1)

f (1,1) + f (2,2)− f (1,2)− f (2,1)

f (2,2)− f (2,1)

Requirement for optimality:

Pairwise edge weights are positive

f (1,1) + f (2,2)− f (1,2)− f (2,1) ≥ 0 sub-modularity

For higher order functions? More complicated graph constructionsFor more than two labels? Move making algorithms

Page 134: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

t

y1 y2

s

f (1,1)− f (2,1)

f (1,1) + f (2,2)− f (1,2)− f (2,1)

f (2,2)− f (2,1)

Requirement for optimality: Pairwise edge weights are positive

Page 135: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

t

y1 y2

s

f (1,1)− f (2,1)

f (1,1) + f (2,2)− f (1,2)− f (2,1)

f (2,2)− f (2,1)

For higher order functions?

More complicated graph constructionsFor more than two labels? Move making algorithms

Page 136: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

t

y1 y2

s

f (1,1)− f (2,1)

f (1,1) + f (2,2)− f (1,2)− f (2,1)

f (2,2)− f (2,1)

For higher order functions? More complicated graph constructions

For more than two labels? Move making algorithms

Page 137: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

t

y1 y2

s

f (1,1)− f (2,1)

f (1,1) + f (2,2)− f (1,2)− f (2,1)

f (2,2)− f (2,1)

For higher order functions? More complicated graph constructionsFor more than two labels?

Move making algorithms

Page 138: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

t

y1 y2

s

f (1,1)− f (2,1)

f (1,1) + f (2,2)− f (1,2)− f (2,1)

f (2,2)− f (2,1)

Page 139: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

y

∑r

fr (yr , x ,w)

Efficiency and accuracy of inference algorithms is problem dependent:

Page 140: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

How to learn in structured spaces?

Page 141: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Binary SVM

x1

x2 Given: dataset

D = {(x , y)}, N = |D|

containing independent andidentically drawn pairs

Page 142: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Binary SVM

Intuitively:

x1

x2

2‖w‖ w>x = 0

w>x = −1

w>x = 1

Maximize margin 2‖w‖ :

minw

12‖w‖22 s.t. yw>x ≥ 1 ∀(x , y) ∈ D

Issue: what if data not linearly separable?

Page 143: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Binary SVM

Intuitively:

x1

x2

2‖w‖ w>x = 0

w>x = −1

w>x = 1

Maximize margin 2‖w‖ :

minw

12‖w‖22 s.t. yw>x ≥ 1 ∀(x , y) ∈ D

Issue: what if data not linearly separable?A. G. Schwing (UofT) Deep Structured Models II December 12, 2015 43 / 60

Page 144: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Binary SVM

Introduce slack variables ξ:

minw ,ξx≥0

12‖w‖22 +

CN

∑(x ,y)∈D

ξx s.t. yw>x ≥ 1− ξx ∀(x , y) ∈ D

Intuitively:

x1

x2

2‖w‖ w>x = 0

w>x = −1

w>x = 1

ξx

Page 145: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Binary SVM

minw ,ξx≥0

12‖w‖22 +

CN

∑(x ,y)∈D

ξx s.t. yw>x ≥ 1− ξx ∀(x , y) ∈ D

Equivalent:

minw

12‖w‖22 +

CN

∑(x ,y)∈D

max{0,1− yw>x}

Generally:min

wR(w) +

CN

∑(x ,y)∈D

¯(x , y ,w)

Hinge-Loss ¯:

−2 0 2 40

1

2

3

4

Hinge loss

t

max{0,1− t}

Page 146: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Binary SVM

minw ,ξx≥0

12‖w‖22 +

CN

∑(x ,y)∈D

ξx s.t. yw>x ≥ 1− ξx ∀(x , y) ∈ D

Equivalent:

minw

12‖w‖22 +

CN

∑(x ,y)∈D

max{0,1− yw>x}

Generally:min

wR(w) +

CN

∑(x ,y)∈D

¯(x , y ,w)

Hinge-Loss ¯:

−2 0 2 40

1

2

3

4

Hinge loss

t

max{0,1− t}

Page 147: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Binary SVM

minw ,ξx≥0

12‖w‖22 +

CN

∑(x ,y)∈D

ξx s.t. yw>x ≥ 1− ξx ∀(x , y) ∈ D

Equivalent:

minw

12‖w‖22 +

CN

∑(x ,y)∈D

max{0,1− yw>x}

Generally:min

wR(w) +

CN

∑(x ,y)∈D

¯(x , y ,w)

Hinge-Loss ¯:

−2 0 2 40

1

2

3

4

Hinge loss

t

max{0,1− t}

Page 148: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Binary SVM

minw ,ξx≥0

12‖w‖22 +

CN

∑(x ,y)∈D

ξx s.t. yw>x ≥ 1− ξx ∀(x , y) ∈ D

Equivalent:

minw

12‖w‖22 +

CN

∑(x ,y)∈D

max{0,1− yw>x}

Generally:min

wR(w) +

CN

∑(x ,y)∈D

¯(x , y ,w)

Hinge-Loss ¯:

−2 0 2 40

1

2

3

4

Hinge loss

t

max{0,1− t}

Page 149: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

Recap of Inference:y ∈ {1,2, . . . ,K}

y∗ = arg maxy

w1...

wK

> xδ(y = 1)

...xδ(y = K )

More generally:

y∗ = arg maxy

w>φ(x , y)

y∗ = arg maxy

F (y , x ,w)

Page 150: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

y∗ = arg maxy

w1...

wK

> xδ(y = 1)

...xδ(y = K )

More generally:

y∗ = arg maxy

w>φ(x , y)

y∗ = arg maxy

F (y , x ,w)

Page 151: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

y∗ = arg maxy

w1...

wK

> xδ(y = 1)

...xδ(y = K )

More generally:

y∗ = arg maxy

w>φ(x , y)

y∗ = arg maxy

F (y , x ,w)

Page 152: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

y∗ = arg maxy

w1...

wK

> xδ(y = 1)

...xδ(y = K )

More generally:

y∗ = arg maxy

w>φ(x , y)

y∗ = arg maxy

F (y , x ,w)

Page 153: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

What do we want?

Groundtruth y scores higher than any other y

w>φ(x , y) ≥ w>φ(x , y) ∀(x , y) ∈ D, y ∈ {1, . . . ,K}

Let’s employ margin and slack:

minw ,ξx≥0

12‖w‖22 +

CN

∑(x ,y)

ξx

s.t. w> (φ(x , y)− φ(x , y)) ≥ 1− ξx ∀(x , y), y

Page 154: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

What do we want? Groundtruth y scores higher than any other y

w>φ(x , y) ≥ w>φ(x , y) ∀(x , y) ∈ D, y ∈ {1, . . . ,K}

minw ,ξx≥0

12‖w‖22 +

CN

∑(x ,y)

ξx

s.t. w> (φ(x , y)− φ(x , y)) ≥ 1− ξx ∀(x , y), y

Page 155: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

w>φ(x , y) ≥ w>φ(x , y) ∀(x , y) ∈ D, y ∈ {1, . . . ,K}

minw ,ξx≥0

12‖w‖22 +

CN

∑(x ,y)

ξx

s.t. w> (φ(x , y)− φ(x , y)) ≥ 1− ξx ∀(x , y), y

Page 156: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

w>φ(x , y) ≥ w>φ(x , y) ∀(x , y) ∈ D, y ∈ {1, . . . ,K}

minw ,ξx≥0

12‖w‖22 +

CN

∑(x ,y)

ξx

s.t. w> (φ(x , y)− φ(x , y)) ≥ 1− ξx ∀(x , y), y

Page 157: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

minw ,ξx≥0

12‖w‖22 +

CN

∑(x ,y)

ξx

s.t. w> (φ(x , y)− φ(x , y)) ≥ 1− ξx ∀(x , y), y

Properties:

Quadratic cost functionLinear constraintsConvex program

Algorithms:Primal algorithms (Sub-gradient, cutting plane)Dual algorithms

How does this fit into regularized surrogate loss minimization?

minw

R(w) +CN

∑(x ,y)

¯(x , y ,w)

Page 158: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

minw ,ξx≥0

12‖w‖22 +

CN

∑(x ,y)

ξx

s.t. w> (φ(x , y)− φ(x , y)) ≥ 1− ξx ∀(x , y), y

Properties:Quadratic cost functionLinear constraintsConvex program

Algorithms:

Primal algorithms (Sub-gradient, cutting plane)Dual algorithms

minw

R(w) +CN

∑(x ,y)

¯(x , y ,w)

Page 159: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

minw ,ξx≥0

12‖w‖22 +

CN

∑(x ,y)

ξx

s.t. w> (φ(x , y)− φ(x , y)) ≥ 1− ξx ∀(x , y), y

minw

R(w) +CN

∑(x ,y)

¯(x , y ,w)

Page 160: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

minw ,ξx≥0

12‖w‖22 +

CN

∑(x ,y)

ξx

s.t. w> (φ(x , y)− φ(x , y)) ≥ 1− ξx ∀(x , y), y

minw

R(w) +CN

∑(x ,y)

¯(x , y ,w)

Page 161: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

minw ,ξx≥0

12‖w‖22 +

CN

∑(x ,y)

ξx

s.t. w> (φ(x , y)− φ(x , y)) ≥ 1− ξx ∀(x , y), y

Let’s get rid of the slack variable:

minw

12‖w‖22 +

CN

∑(x ,y)

max{

0,maxy

(1− w>

(φ(x , y)− φ(x , y)

))}

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(1 + w>φ(x , y)

)− w>φ(x , y)

Page 162: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

minw ,ξx≥0

12‖w‖22 +

CN

∑(x ,y)

ξx

s.t. w> (φ(x , y)− φ(x , y)) ≥ 1− ξx ∀(x , y), y

minw

12‖w‖22 +

CN

∑(x ,y)

max{

0,maxy

(1− w>

(φ(x , y)− φ(x , y)

))}

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(1 + w>φ(x , y)

)− w>φ(x , y)

Page 163: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

minw ,ξx≥0

12‖w‖22 +

CN

∑(x ,y)

ξx

s.t. w> (φ(x , y)− φ(x , y)) ≥ 1− ξx ∀(x , y), y

minw

12‖w‖22 +

CN

∑(x ,y)

max{

0,maxy

(1− w>

(φ(x , y)− φ(x , y)

))︸︷︷︸

≥0

}

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(1 + w>φ(x , y)

)− w>φ(x , y)

Page 164: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

minw ,ξx≥0

12‖w‖22 +

CN

∑(x ,y)

ξx

s.t. w> (φ(x , y)− φ(x , y)) ≥ 1− ξx ∀(x , y), y

minw

12‖w‖22 +

CN

∑(x ,y)

max{

0,maxy

(1− w>

(φ(x , y)− φ(x , y)

))︸︷︷︸

≥0

}

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(1 + w>φ(x , y)

)− w>φ(x , y)

Page 165: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

minw ,ξx≥0

12‖w‖22 +

CN

∑(x ,y)

ξx

s.t. w> (φ(x , y)− φ(x , y)) ≥ 1− ξx ∀(x , y), y

minw

12‖w‖22 +

CN

∑(x ,y)

max{

0,maxy

(1− w>

(φ(x , y)− φ(x , y)

))︸︷︷︸

≥0

}

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(1 + w>φ(x , y)

)︸︷︷︸Loss-augmented inference

− w>φ(x , y)

Page 166: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(1 + w>φ(x , y)

−w>φ(x , y)

Fits into the general formulation:

minw

R(w) +CN

∑(x ,y)

¯(x , y ,w)

Page 167: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(1 + w>φ(x , y)

−w>φ(x , y)

Fits into the general formulation:

minw

R(w) +CN

∑(x ,y)

¯(x , y ,w)

Page 168: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

How to optimize this?

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(1 + w>φ(x , y)

−w>φ(x , y)

E.g., with gradient descent:

Iterate:

1 Loss-augmented inference:

arg maxyi

(1 + w>φ(x , yi)

)

2 Perform gradient step:

w ← w − α∇wL

How complicated is this?

Solve one multi-class inference task per sample per iteration.

Page 169: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(1 + w>φ(x , y)

−w>φ(x , y)

Iterate:1 Loss-augmented inference:

arg maxyi

(1 + w>φ(x , yi)

)

w ← w − α∇wL

Page 170: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(1 + w>φ(x , y)

−w>φ(x , y)

arg maxyi

(1 + w>φ(x , yi)

)2 Perform gradient step:

w ← w − α∇wL

Page 171: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Multiclass SVM

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(1 + w>φ(x , y)

−w>φ(x , y)

arg maxyi

(1 + w>φ(x , yi)

w ← w − α∇wL

Solve one multi-class inference task per sample per iteration.A. G. Schwing (UofT) Deep Structured Models II December 12, 2015 51 / 60

Page 172: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured SVM

From Multiclass SVM:

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(1 + w>φ(x , y)

)− w>φ(x , y)

via general losses:

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(∆(y , y) + w>φ(x , y)

)− w>φ(x , y)

to general functions and structures:

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(∆(y, y) + F (y, x ,w)

)− F (y, x ,w)

Page 173: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured SVM

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(1 + w>φ(x , y)

)− w>φ(x , y)

via general losses:

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(∆(y , y) + w>φ(x , y)

)− w>φ(x , y)

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(∆(y, y) + F (y, x ,w)

)− F (y, x ,w)

Page 174: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured SVM

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(1 + w>φ(x , y)

)− w>φ(x , y)

via general losses:

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(∆(y , y) + w>φ(x , y)

)− w>φ(x , y)

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(∆(y, y) + F (y, x ,w)

)− F (y, x ,w)

Page 175: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured SVM

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(1 + w>φ(x , y)

)− w>φ(x , y)

via general losses:

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(∆(y , y) + w>φ(x , y)

)− w>φ(x , y)

minw

12‖w‖22 +

CN

∑(x ,y)

maxy

(∆(y, y) + F (y, x ,w)

)︸︷︷︸

Loss-augmented inference

− F (y, x ,w)

Page 176: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured SVM

minw

12‖w‖22 +

CN

∑(x ,y)∈D

maxy

(∆(y, y) + F (y, x ,w)

)︸︷︷︸

−F (y, x ,w)

Iterate:

1 Loss-augmented inference:

arg maxy

(∆(y, y) + F (y, x ,w)

)

w ← w − α∇wL

Solve one structured prediction task per sample per iteration.

Page 177: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured SVM

minw

12‖w‖22 +

CN

∑(x ,y)∈D

maxy

(∆(y, y) + F (y, x ,w)

)︸︷︷︸

−F (y, x ,w)

arg maxy

(∆(y, y) + F (y, x ,w)

)

w ← w − α∇wL

Page 178: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured SVM

minw

12‖w‖22 +

CN

∑(x ,y)∈D

maxy

(∆(y, y) + F (y, x ,w)

)︸︷︷︸

−F (y, x ,w)

arg maxy

(∆(y, y) + F (y, x ,w)

w ← w − α∇wL

Page 179: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured SVM

minw

12‖w‖22 +

CN

∑(x ,y)∈D

maxy

(∆(y, y) + F (y, x ,w)

)︸︷︷︸

−F (y, x ,w)

arg maxy

(∆(y, y) + F (y, x ,w)

w ← w − α∇wL

Solve one structured prediction task per sample per iteration.A. G. Schwing (UofT) Deep Structured Models II December 12, 2015 53 / 60

Page 180: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Learning

Soft-max extension:

ε ln∑

i

expxi

ε

ε→0−−→ maxi

xi

0 5 100

1

2

3

4

i

xi

0 0.5 13.2

3.4

3.6

3.8

4

ε

ε ln∑

i exp xiε

Page 181: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Learning

Soft-max extension:

ε ln∑

i

expxi

ε

xi

0 5 100

1

2

3

4

i

xi

0 0.5 13.2

3.4

3.6

3.8

4

ε

ε ln∑

i exp xiε

Page 182: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Learning

Soft-max extension:

ε ln∑

i

expxi

ε

xi

Let’s generalize

minw

12‖w‖22 +

CN

∑(x ,y)∈D

maxy

(∆(y, y) + F (y, x ,w)

)− F (y, x ,w)

even further

minw

12‖w‖22 +

CN

∑(x ,y)∈D

ε ln∑

y

exp∆(y, y) + F (y, x ,w)

ε− F (y, x ,w)

Page 183: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Learning

Soft-max extension:

ε ln∑

i

expxi

ε

xi

Let’s generalize

minw

12‖w‖22 +

CN

∑(x ,y)∈D

maxy

(∆(y, y) + F (y, x ,w)

)− F (y, x ,w)

even further

minw

12‖w‖22 +

CN

∑(x ,y)∈D

ε ln∑

y

exp∆(y, y) + F (y, x ,w)

ε− F (y, x ,w)

Page 184: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Learning

A pretty general formulation:

minw

12‖w‖22 +

CN

∑(x ,y)∈D

ε ln∑

y

exp∆(y, y) + F (y, x ,w)

ε− F (y, x ,w)

Generalizes:any form of SVMconditional random fieldsmaximum-likelihoodlogistic regressionconvolutional neural networks

How does this correspond to a binary SVM?

Reverse slides.How does this correspond to logistic regression? Up next...

Page 185: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Learning

minw

12‖w‖22 +

CN

∑(x ,y)∈D

ε ln∑

y

exp∆(y, y) + F (y, x ,w)

ε− F (y, x ,w)

Page 186: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Learning

minw

12‖w‖22 +

CN

∑(x ,y)∈D

ε ln∑

y

exp∆(y, y) + F (y, x ,w)

ε− F (y, x ,w)

Page 187: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Learning

minw

12‖w‖22 +

CN

∑(x ,y)∈D

ε ln∑

y

exp∆(y, y) + F (y, x ,w)

ε− F (y, x ,w)

How does this correspond to a binary SVM? Reverse slides.

How does this correspond to logistic regression? Up next...

Page 188: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Learning

minw

12‖w‖22 +

CN

∑(x ,y)∈D

ε ln∑

y

exp∆(y, y) + F (y, x ,w)

ε− F (y, x ,w)

How does this correspond to a binary SVM? Reverse slides.How does this correspond to logistic regression?

Up next...

Page 189: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Learning

minw

12‖w‖22 +

CN

∑(x ,y)∈D

ε ln∑

y

exp∆(y, y) + F (y, x ,w)

ε− F (y, x ,w)

How does this correspond to a binary SVM? Reverse slides.How does this correspond to logistic regression? Up next...

Page 190: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Learning

minw

12‖w‖22 +

CN

∑(x ,y)∈D

ε ln∑

y

exp∆(y, y) + F (y, x ,w)

ε− F (y, x ,w)

Equivalent formulation:

minw

12‖w‖22 −

CN

∑(x ,y)∈D

ε lnexp ∆(y,y)+F (y,x ,w)

ε∑y exp ∆(y,y)+F (y,x ,w)

ε

Annealed loss-augmented probability:

p∆ε (y | x ,w) =

(exp ∆(y,y)+F (y,x ,w)

ε

)ε

Equivalent maximum-likelihood formulation:

minw

12‖w‖22 −

CN

ln∏

(x ,y)∈D

p∆ε (y | x ,w)

Page 191: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Learning

minw

12‖w‖22 +

CN

∑(x ,y)∈D

ε ln∑

y

exp∆(y, y) + F (y, x ,w)

ε− F (y, x ,w)

minw

12‖w‖22 −

CN

∑(x ,y)∈D

ε

p∆ε (y | x ,w) =

ε

)ε

minw

12‖w‖22 −

CN

ln∏

(x ,y)∈D

p∆ε (y | x ,w)

Page 192: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Learning

minw

12‖w‖22 +

CN

∑(x ,y)∈D

ε ln∑

y

exp∆(y, y) + F (y, x ,w)

ε− F (y, x ,w)

minw

12‖w‖22 −

CN

∑(x ,y)∈D

ε

p∆ε (y | x ,w) =

ε

)ε

minw

12‖w‖22 −

CN

ln∏

(x ,y)∈D

p∆ε (y | x ,w)

Page 193: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Learning

minw

12‖w‖22 +

CN

∑(x ,y)∈D

ε ln∑

y

exp∆(y, y) + F (y, x ,w)

ε− F (y, x ,w)

minw

12‖w‖22 −

CN

∑(x ,y)∈D

ε

p∆ε (y | x ,w) =

ε

)ε

minw

12‖w‖22 −

CN

ln∏

(x ,y)∈D

p∆ε (y | x ,w)

Page 194: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

minw

12‖w‖22 −

CN

∑(x ,y)∈D

ε

Simplifications:

Binary classification: y ∈ {−1,1}Simple feature-function: F (y, x ,w) = yw>xEpsilon: ε = 1

minw

12‖w‖22 −

CN

ln∏

(x ,y)

exp yw>xexp(w>x) + exp(−w>x)

minw

12‖w‖22 −

CN

ln∏

(x ,y)

11 + exp (−2yw>x)

minw

12‖w‖22 −

CN

∑(x ,y)

lnσ(

2yw>x)

with σ(z) =1

1 + exp(−z)

Page 195: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

minw

12‖w‖22 −

CN

∑(x ,y)∈D

ε

Simplifications:Binary classification: y ∈ {−1,1}Simple feature-function: F (y, x ,w) = yw>xEpsilon: ε = 1

minw

12‖w‖22 −

CN

ln∏

(x ,y)

minw

12‖w‖22 −

CN

ln∏

(x ,y)

11 + exp (−2yw>x)

minw

12‖w‖22 −

CN

∑(x ,y)

lnσ(

2yw>x)

with σ(z) =1

1 + exp(−z)

Page 196: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

minw

12‖w‖22 −

CN

∑(x ,y)∈D

ε

minw

12‖w‖22 −

CN

ln∏

(x ,y)

minw

12‖w‖22 −

CN

ln∏

(x ,y)

11 + exp (−2yw>x)

minw

12‖w‖22 −

CN

∑(x ,y)

lnσ(

2yw>x)

with σ(z) =1

1 + exp(−z)

Page 197: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

minw

12‖w‖22 −

CN

∑(x ,y)∈D

ε

minw

12‖w‖22 −

CN

ln∏

(x ,y)

minw

12‖w‖22 −

CN

ln∏

(x ,y)

11 + exp (−2yw>x)

minw

12‖w‖22 −

CN

∑(x ,y)

lnσ(

2yw>x)

with σ(z) =1

1 + exp(−z)

Page 198: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

minw

12‖w‖22 −

CN

∑(x ,y)∈D

ε

minw

12‖w‖22 −

CN

ln∏

(x ,y)

minw

12‖w‖22 −

CN

ln∏

(x ,y)

11 + exp (−2yw>x)

minw

12‖w‖22 −

CN

∑(x ,y)

lnσ(

2yw>x)

with σ(z) =1

1 + exp(−z)

Page 199: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

We started with hinge loss:

minw

12‖w‖22 +

CN

n∑(x ,y)

max{0,1− yw>x}

Derived the general structured prediction framework:

minw

12‖w‖22 +

CN

∑(x ,y)∈D

ε ln∑

y

exp∆(y, y) + F (y, x ,w)

ε− F (y, x ,w)

And showed how to simplify it to logistic regression:

minw

12‖w‖22 −

CN

∑(x ,y)

lnσ(

2yw>x)

minw

R(w)+CN

∑(x ,y)

¯(x , y ,w)

−2 0 2 40

1

2

3

4

5

6

Hinge lossLog loss

t

max{0,1− t}

− lnσ (2t)

Page 200: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

minw

12‖w‖22 +

CN

n∑(x ,y)

max{0,1− yw>x}

minw

12‖w‖22 +

CN

∑(x ,y)∈D

ε ln∑

y

exp∆(y, y) + F (y, x ,w)

ε− F (y, x ,w)

minw

12‖w‖22 −

CN

∑(x ,y)

lnσ(

2yw>x)

minw

R(w)+CN

∑(x ,y)

¯(x , y ,w)

−2 0 2 40

1

2

3

4

5

6

Hinge lossLog loss

t

max{0,1− t}

− lnσ (2t)

Page 201: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

minw

12‖w‖22 +

CN

n∑(x ,y)

max{0,1− yw>x}

minw

12‖w‖22 +

CN

∑(x ,y)∈D

ε ln∑

y

exp∆(y, y) + F (y, x ,w)

ε− F (y, x ,w)

minw

12‖w‖22 −

CN

∑(x ,y)

lnσ(

2yw>x)

minw

R(w)+CN

∑(x ,y)

¯(x , y ,w)

−2 0 2 40

1

2

3

4

5

6

Hinge lossLog loss

t

max{0,1− t}

− lnσ (2t)

Page 202: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

minw

12‖w‖22 +

CN

n∑(x ,y)

max{0,1− yw>x}

minw

12‖w‖22 +

CN

∑(x ,y)∈D

ε ln∑

y

exp∆(y, y) + F (y, x ,w)

ε− F (y, x ,w)

minw

12‖w‖22 −

CN

∑(x ,y)

lnσ(

2yw>x)

minw

R(w)+CN

∑(x ,y)

¯(x , y ,w)

−2 0 2 40

1

2

3

4

5

6

Hinge lossLog loss

t

max{0,1− t}

− lnσ (2t)

Page 203: Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex Schwing & Raquel Urtasun University of Toronto December 12, 2015 A. G. Schwing (UofT)

Structured Learning

How does this framework simplify to neural networks?

Multiclass setting: y ∈ {1, . . . ,K}Composite feature function:

F (y , x ,w) = f1(

y ,w1, f2(w2, f3(. . .)

))Epsilon: ε = 1

Stay tuned for part III

Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex...

Documents

Transcript of Part II: Structured Prediction - Department of Computer ... · Part II: Structured Prediction Alex...