Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online .
-
Upload
barry-garrett -
Category
Documents
-
view
213 -
download
0
Transcript of Learning from Big Data Lecture 5 M. Pawan Kumar oval/ Slides available online .
Learning from Big Data Lecture 5
M. Pawan Kumar
http://www.robots.ox.ac.uk/~oval/
Slides available online http://mpawankumar.info
• Structured Output Prediction
• Structured Output SVM
• Optimization
• Results
Outline
Is this an urban or rural area?
Input: x Output: y {-1,+1}
Image Classification
Is this scan healthy or unhealthy?
Input: x Output: y {-1,+1}
Image Classification
y
xObserved input
Unobserved output
Label -1
Label +1Probabilistic
GraphicalModel
Image Classification
Feature Vector
x
FeatureΦ(x)
conv1 conv2 conv3 conv4 conv5fc6
fc7
Feature Vector
x
FeatureΦ(x)
Pre-Trained CNN
Joint Feature Vector
Input: x Output: y {-1,+1}
Ψ(x,y)
Joint Feature Vector
Input: x Output: y {-1,+1}
Ψ(x,-1)
Φ(x)
0
=
Joint Feature Vector
Input: x Output: y {-1,+1}
Ψ(x,+1)
0
Φ(x)
=
Score Function
Input: x Output: y {-1,+1}
Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)
Prediction
Input: x Output: y {-1,+1}
Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)
y* = argmaxy f(Ψ(x,y))
Maximize the score over all possible outputs
• Structured Output Prediction– Binary Output– Multi-label Output– Structured Output– Learning
• Structured Output SVM
• Optimization
• Results
Outline
Which city is this?
Input: x Output: y {1,2,…,C}
Image Classification
What type of tumor does this scan contain?
Input: x Output: y {1,2,…,C}
Image Classification
y
xObserved input
Unobserved output
123
C
GraphicalModel
Image Classification
conv1 conv2 conv3 conv4 conv5fc6
fc7
Feature Vector
x
FeatureΦ(x)
Pre-Trained CNN
Joint Feature Vector
Input: x Output: y {1,2,…,C}
Ψ(x,y)
Joint Feature Vector
Input: x Output: y {1,2,…,C}
Ψ(x,1)
Φ(x)
0=
.
.
.
0
Joint Feature Vector
Input: x Output: y {1,2,…,C}
Ψ(x,2)
0
Φ(x)=
.
.
.
0
Joint Feature Vector
Input: x Output: y {1,2,…,C}
Ψ(x,C)
0
=
.
.
.
Φ(x)
0
Where is the object in the image?
Input: x Output: y {Pixels}
Object Detection
Where is the rupture in the scan?
Input: x Output: y {Pixels}
Object Detection
y
xObserved input
Unobserved output
123
C
GraphicalModel
Object Detection
conv1 conv2 conv3 conv4 conv5fc6
fc7
Joint Feature Vector
x
Ψ(x,y)
Pre-Trained CNNy
conv1 conv2 conv3 conv4 conv5fc6
fc7
Joint Feature Vector
x
Ψ(x,y)
Pre-Trained CNNy
conv1 conv2 conv3 conv4 conv5fc6
fc7
Joint Feature Vector
x
Ψ(x,y)
Pre-Trained CNNy
Score Function
Input: x Output: y {1,2,…,C}
Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)
Prediction
Input: x Output: y {1,2,…,C}
Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)
y* = argmaxy f(Ψ(x,y))
Maximize the score over all possible outputs
• Structured Output Prediction– Binary Output– Multi-label Output– Structured Output– Learning
• Structured Output SVM
• Optimization
• Results
Outline
What is the semantic class of each pixel?
Input: x Output: y {1,2,…,C}m
car
roadgrass
treesky
sky
Segmentation
What is the muscle group of each pixel?
Input: x Output: y {1,2,…,C}m
Segmentation
y1
x1
y2
x2
y3
x3
y4
x4
y5
x5
y6
x6
y7
x7
y8
x8
y9
x9
GraphicalModel
Segmentation
conv1 conv2 conv3 conv4 conv5fc6
fc7
FeatureΦ(x1)
Pre-Trained CNNx1
Feature Vector
Joint Feature Vector
Input: x1 Output: y1 {1,2,…,C}
Ψu(x1,1)
Φ(x1)
0=
.
.
.
0
Joint Feature Vector
Input: x1 Output: y1 {1,2,…,C}
Ψu(x1,2)
0
Φ(x1)=
.
.
.
0
Joint Feature Vector
Input: x1 Output: y1 {1,2,…,C}
Ψu(x1,C)
0
=
.
.
.
Φ(x1)
0
conv1 conv2 conv3 conv4 conv5fc6
fc7
FeatureΦ(x2)
Pre-Trained CNNx2
Feature Vector
Joint Feature Vector
Input: x2 Output: y2 {1,2,…,C}
Ψu(x2,1)
Φ(x2)
0=
.
.
.
0
Joint Feature Vector
Input: x2 Output: y2 {1,2,…,C}
Ψu(x2,2)
0
Φ(x2)=
.
.
.
0
Joint Feature Vector
Input: x2 Output: y2 {1,2,…,C}
Ψu(x2,C)
0
=
.
.
.
Φ(x2)
0
Overall Joint Feature Vector
Input: x Output: y {1,2,…,C}m
Ψu(x,y)
Ψu(x1,y1)
=
.
.
.
Ψu(xm,ym)
Ψu(x2,y2)
Score Function
Input: x Output: y {1,2,…,C}m
Ψu(x,y)f: → (-∞,+∞) wTΨu(x,y)
Prediction
Input: x Output: y {1,2,…,C}m
Ψu(x,y)f: wTΨu(x,y)
y* = argmaxy f(Ψu(x,y))
→ (-∞,+∞)
Prediction
Input: x Output: y {1,2,…,C}m
Ψu(x,y)f: wTΨu(x,y)
y* = argmaxy wTΨu(x,y)
→ (-∞,+∞)
Prediction
Input: x Output: y {1,2,…,C}m
Ψu(x,y)f: wTΨu(x,y)
y* = argmaxy ∑a (wa)TΨu(xa,ya)
Maximize for each a {1,2,…,m} independently
→ (-∞,+∞)
y1
x1
y2
x2
y3
x3
y4
x4
y5
x5
y6
x6
y7
x7
y8
x8
y9
x9
GraphicalModel
Segmentation
Unary Joint Feature Vector
Input: x Output: y {1,2,…,C}m
Ψu(x,y)
Ψu(x1,y1)
=
.
.
.
Ψu(xm,ym)
Ψu(x2,y2)
y1
x1
y2
x2
y3
x3
y4
x4
y5
x5
y6
x6
y7
x7
y8
x8
y9
x9
Pairwise Joint Feature Vector
y1
x1
y2
x2
y3
x3
y4
x4
y5
x5
y6
x6
y7
x7
y8
x8
y9
x9
Ψp(x12,y12) = δ(y1=y2)
Pairwise Joint Feature Vector
y1
x1
y2
x2
y3
x3
y4
x4
y5
x5
y6
x6
y7
x7
y8
x8
y9
x9
Ψp(x23,y23) = δ(y2=y3)
Pairwise Joint Feature Vector
Input: x Output: y {1,2,…,C}m
Ψp(x,y)
Ψp(x12,y12)
=
.
.
.
Ψp(x23,y23)
Pairwise Joint Feature Vector
Overall Joint Feature Vector
Input: x Output: y {1,2,…,C}m
Ψ(x,y)Ψu(x,y)
=Ψp(x,y)
Score Function
Input: x Output: y {1,2,…,C}m
Ψ(x,y)f: → (-∞,+∞) wTΨ(x,y)
Prediction
Input: x Output: y {1,2,…,C}m
Ψ(x,y)f: wTΨ(x,y)
y* = argmaxy f(Ψ(x,y))
→ (-∞,+∞)
Prediction
Input: x Output: y {1,2,…,C}m
Ψ(x,y)f: wTΨ(x,y)
y* = argmaxy wTΨ(x,y)
→ (-∞,+∞)
Prediction
Input: x Output: y {1,2,…,C}m
Ψ(x,y)f: wTΨ(x,y)
y* = argmaxy ∑a (wa)TΨu(xa,ya)
→ (-∞,+∞)
+ ∑a,b (wab)TΨp(xab,yab)
Week 5 “Optimization” lectures
Input x,Outputs{y1,y2,..}
FeaturesΨ(x,yi)
Scoresf(Ψ(x,yi))
Extract Features
ComputeScores
maxyi f(Ψ(x,yi))
Predictiony(f)
How do I fix “f”?
Summary
• Structured Output Prediction– Binary Output– Multi-label Output– Structured Output– Learning
• Structured Output SVM
• Optimization
• Results
Outline
Data distribution P(x,y)
Prediction
f* = argminf EP(x,y) Error(y(f),y)
Ground Truth
Measure of prediction quality
Distribution is unknown
Expectation overdata distribution
Learning Objective
Training data {(xi,yi), i = 1,2,…,n}
Prediction
f* = argminf EP(x,y) Error(y(f),y)
Ground Truth
Measure of prediction quality
Expectation overdata distribution
Learning Objective
Training data {(xi,yi), i = 1,2,…,n}
Prediction
f* = argminf Σi Error(yi(f),yi)
Ground Truth
Measure of prediction quality
Expectation overempirical distribution
Finite samples
Learning Objective
Training data {(xi,yi), i = 1,2,…,n}
f* = argminf Σi Error(yi(f),yi) + λ R(f)
Finite samples
RegularizerRelative weight(hyperparameter)
Learning Objective
Training data {(xi,yi), i = 1,2,…,n}
f* = argminf Σi Error(yi(f),yi) + λ R(f)
Finite samples
Learning Objective
Error can be negative log-likelihood
Probabilistic model
• Structured Output Prediction
• Structured Output SVM
• Optimization
• Results
Outline
Taskar et al. NIPS 2003; Tsochantaridis et al. ICML 2004
Score Function and Prediction
Input: x Output: y
Joint feature vector of input and output: Ψ(x,y)
f(Ψ(x,y)) = wTΨ(x,y)
Prediction: maxy wTΨ(x,y)
Predicted Output: y(w) = argmaxy wTΨ(x,y)
Δ(y,y(w))
Loss or risk of prediction given ground-truth
Error Function
Classification loss?
User specified
“New York” 0
“Paris” 1
Δ(y,y(w)) = δ(y=y(w))
Δ(y,y(w))
Loss or risk of prediction given ground-truth
Error Function
Detection loss?
User specified
Overlap score
Area of intersection
Area of union
Δ(y,y(w))
Loss or risk of prediction given ground-truth
Error Function
Segmentation loss?
User specified
car
roadgrass
treesky Fraction of incorrect pixels
Micro-average
Macro-average
Training data {(xi,yi), i = 1,2,…,n}
Δ(yi,yi(w))
Loss function for i-th sample
Minimize the regularized sum of loss over training data
Highly non-convex in w
Regularization plays no role (overfitting may occur)
Learning Objective
Training data {(xi,yi), i = 1,2,…,n}
Δ(yi,yi(w))wTΨ(xi,yi(w)) + - wTΨ(xi,yi(w))
≤ wTΨ(xi,yi(w)) + Δ(yi,yi(w)) - wTΨ(xi,yi)
≤ maxy { wTΨ(xi,y) + Δ(yi,y) } - wTΨ(xi,yi)
ConvexSensitive to regularization of w
Learning Objective
Training data {(xi,yi), i = 1,2,…,n}
wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi) ≤ ξi for all y
minw ||w||2 + C Σiξi
Learning Objective
Quadratic program with large number of constraints
Many polynomial time algorithms
• Structured Output Prediction
• Structured Output SVM
• Optimization– Stochastic subgradient descent– Conditional gradient aka Frank-Wolfe
• Results
Outline
Shalev-Shwartz et al. Mathematical Programming 2011
Convex function g(z)
Gradient
Gradient s at a point z0
g(z) – g(z0) ≥ sT(z-z0)
g(z) = z2
Gradient? 2z0
minz g(z)
Gradient DescentStart at some point z0
g(z) = z2
Move along the negative gradient direction
zt+1 ← zt – λtg’(zt) Estimate step-size via line search
Convex function g(z)
Gradient
Gradient s at a point z0
g(z) – g(z0) ≥ sT(z-z0)
May not exist
g(z) = |z|
s?
Convex function g(z)
Subgradient
Subgradient s at a point z0
g(z) – g(z0) ≥ sT(z-z0)
May not be unique
g(z) = |z|
minz g(z)
Subgradient DescentStart at some point z0
Move along the negative subgradient direction
zt+1 ← zt – λtg’(zt) Estimate step-size via line search
g(z) = |z|
Doesn’t always work
minz max{z2 + 2z1, z2 - 2z1}
Subgradient Descent
g(z) = 5
g(z) = 4
g(z) = 3
z1
z20
5
-2
1
-λ
5+3λ5
minz g(z)
Subgradient DescentStart at some point z0
Move along the negative subgradient direction
zt+1 ← zt – λtg’(zt) Estimate step-size via line search
g(z) = |z|
Doesn’t always work
minz g(z)
Subgradient DescentStart at some point z0
Move along the negative subgradient direction
zt+1 ← zt – λtg’(zt) limT→∞∑1T λt = ∞
g(z) = |z|
Convergence
limt→∞ λt = 0
Training data {(xi,yi), i = 1,2,…,n}
wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi) ≤ ξi for all y
minw ||w||2 + C Σiξi
Learning Objective
Constrained problem?
Training data {(xi,yi), i = 1,2,…,n}
C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}
minw ||w||2 +
Learning Objective
Subgradient?
g(z) – g(z0) ≥ sT(z-z0)
C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}
Subgradient
Ψ(xi,y) - Ψ(xi,yi)
C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}
Subgradient
Ψ(xi,ŷ) - Ψ(xi,yi)
ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}
Proof?
C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}
Subgradient
Ψ(xi,ŷ) - Ψ(xi,yi)
ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)} Inference
Inference
Classification inference
ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)}
Output: y {1,2,…,C}
Brute-force search
Inference
Detection inference
ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)}
Output: y {1,2,…,C}
Brute-force search
Inference
Segmentation inference
ŷ = argmaxy{wTΨ(xi,y) + Δ(yi,y)}
car
roadgrass
treeskymaxy ∑a (wa)TΨu(xi
a,ya)
+ ∑a,b (wab)TΨp(xiab,yab)
+ ∑a Δ(yia,ya)
Week 5 “Optimization” lectures
Subgradient Descent
Start at some parameter w0
For t = 0 to T
End
s = 2wt
For i = 1 to n
// Number of iterations
// Number of samples
End
ŷ = maxy{wtTΨ(xi,y) + Δ(yi,y)}
s = s + C(Ψ(xi,ŷ) - Ψ(xi,yi))
wt+1 = wt + λtst λt = 1/(t+1)
Subgradient Descent
Start at some parameter w0
For t = 0 to T
End
s = 2wt
For i = 1 to n
// Number of iterations
// Number of samples
End
ŷ = maxy{wtTΨ(xi,y) + Δ(yi,y)}
s = s + C(Ψ(xi,ŷ) - Ψ(xi,yi))
wt+1 = wt + λtst λt = 1/(t+1)
Training data {(xi,yi), i = 1,2,…,n}
C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}
minw ||w||2 +
Learning Objective
Training data {(xi,yi), i = 1,2,…,n}
C Σi maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}
minw ||w||2 +
Stochastic Approximation
Choose a sample ‘i’ with probability 1/n
Training data {(xi,yi), i = 1,2,…,n}
Cn maxy{wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi)}
minw ||w||2 +
Stochastic Approximation
Choose a sample ‘i’ with probability 1/n
Expected value? Original objective function
Stochastic Subgradient Descent
Start at some parameter w0
For t = 0 to T
End
s = 2wt
Choose a sample ‘i’ with probability 1/n
// Number of iterations
ŷ = maxy{wtTΨ(xi,y) + Δ(yi,y)}
s = s + Cn(Ψ(xi,ŷ) - Ψ(xi,yi))
wt+1 = wt + λtst λt = 1/(t+1)
Convergence Rate
Compute an ε-optimal solution
C: SSVM hyperparameter
d: Number of non-zeros in the feature vector
O(dC/ε) iterations
Each iteration requires solving an inference problem
Side Note: Structured Output CNN
conv1 conv2 conv3 conv4 conv5fc6
fc7SSVM
Back-propagate the subgradients
• Structured Output Prediction
• Structured Output SVM
• Optimization– Stochastic subgradient descent– Conditional gradient aka Frank-Wolfe
• Results
Outline
Lacoste-Julien et al. ICML 2013
Slide courtesy Martin Jaggi
Conditional Gradient
Slide courtesy Martin Jaggi
Conditional Gradient
Slide courtesy Martin Jaggi
Conditional Gradient
Slide courtesy Martin Jaggi
Conditional Gradient
SSVM Primal
wTΨ(xi,y) + Δ(yi,y) - wTΨ(xi,yi) ≤ ξi for all y
minw ||w||2 + C Σiξi
Derive dual on board
SSVM Dual
∑y αi(y) = C
maxα ||Mα||2/4 + bTα
for all i
αi(y) ≥ 0 for all i, y
w = Mα/2
bT = [Δ(yi,y)]
Linear Program
∑y αi(y) = C
maxα (Mα)Twt + bTα
for all i
αi(y) ≥ 0 for all i, y
Solve this over all possible α
Standard Frank-Wolfe
Solve this over all possible αi for a sample ‘i’
Block Coordinate Frank-Wolfe
Linear Program
∑y αi(y) = C
maxα (Mα)Twt + bTα
for all i
αi(y) ≥ 0 for all i, y
Vertices?
αi(y) =C, if y = ŷ
0, otherwise
Solution
∑y αi(y) = C
maxα (Mα)Twt + bTα
for all i
αi(y) ≥ 0 for all i, y
ŷ = argmaxy{wtTΨ(xi,y) + Δ(yi,y)}
si(y) =C, if y = ŷ
0, otherwise
Inference
Which one maximizes the linear function?
Update
αt+1 = (1-μ) αt + μs
Standard Frank-Wolfe
s contains the solution for all the samples
Block Coordinate Frank-Wolfe
s contains the solution for sample ‘i’
sj = αtj for all other samples
Step-Size
αt+1 = (1-μ) αt + μs
Maximizing a quadratic function in one variable μ
Analytical computation of optimal step-size
Comparison
OCR Dataset
• Structured Output Prediction
• Structured Output SVM
• Optimization
• Results– Exact Inference– Approximate Inference– Choice of Loss Function
Outline
Optical Character Recognition
Identify each letter in a handwritten word
Taskar, Guestrin and Koller, NIPS 2003
Optical Character Recognition
Taskar, Guestrin and Koller, NIPS 2003
X1 X2 X3 X4
Labels L = {a, b, …., z}
Logistic Regression Multi-Class SVM
Optical Character Recognition
Taskar, Guestrin and Koller, NIPS 2003
X1 X2 X3 X4
Labels L = {a, b, …., z}
Maximum Likelihood Structured Output SVM
Optical Character Recognition
Taskar, Guestrin and Koller, NIPS 2003
Image Segmentation
Szummer, Kohli and Hoiem, ECCV 2006
Image Segmentation
Szummer, Kohli and Hoiem, ECCV 2006
X1 X2 X3
X4 X5 X6
X7 X8 X9
Labels L = {0, 1}
Image Segmentation
Szummer, Kohli and Hoiem, ECCV 2006
Unary Max Likelihood SSVM0
5
10
15
20
25
• Structured Output Prediction
• Structured Output SVM
• Optimization
• Results– Exact Inference– Approximate Inference– Choice of Loss Function
Outline
Scene Dataset
Finley and Joachims, ICML 2008
Greedy LBP Combine Exact LP9.6
9.8
10
10.2
10.4
10.6
10.8
11
11.2
11.4
Reuters Dataset
Finley and Joachims, ICML 2008
Greedy LBP Combine Exact LP0
2
4
6
8
10
12
14
16
18
Yeast Dataset
Finley and Joachims, ICML 2008
Greedy LBP Combine Exact LP0
5
10
15
20
25
30
35
40
45
50
Mediamill Dataset
Finley and Joachims, ICML 2008
Greedy LBP Combine Exact LP0
5
10
15
20
25
30
35
40
• Structured Output Prediction
• Structured Output SVM
• Optimization
• Results– Exact Inference– Approximate Inference– Choice of Loss Function
Outline
“Jumping” Classification
Standard Pipeline
Collect dataset D = {(xi,yi), i = 1, …., n}
Learn your favourite classifier
Classifier assigns a score to each test sample
Threshold the score for classification
“Jumping” RankingRank 1 Rank 2 Rank 3
Rank 4 Rank 5 Rank 6
Average Precision = 1
Ranking vs. ClassificationRank 1 Rank 2 Rank 3
Rank 4 Rank 5 Rank 6
Average Precision = 1 Accuracy = 1= 0.92 = 0.67= 0.81
Standard Pipeline
Collect dataset D = {(xi,yi), i = 1, …., n}
Learn your favourite classifier
Classifier assigns a score to each test sample
Sort the score for ranking
Computes subgradients of the AP loss
Train
ing
Tim
e
0-1
AP
5x slower
Yue, Finley, Radlinski and Joachims, SIGIR 2007
Avera
ge P
reci
sion
0-1 AP
4% improvementfor free
Efficient Optimization ofAverage Precision
Pritish Mohapatra C. V. Jawahar M. Pawan Kumar
Train
ing
Tim
e
0-1
AP
5x slowerAP
Slightly faster
Each iteration for AP optimization is slightly slower
It takes fewer iterations to converge in practice
Questions?