2D Image Processing Introduction to object recognition &...
Transcript of 2D Image Processing Introduction to object recognition &...
INFORMATIK
Prof. Didier Stricker
Kaiserlautern University
http://ags.cs.uni-kl.de/
DFKI – Deutsches Forschungszentrum für Künstliche Intelligenz
http://av.dfki.de
2D Image Processing Introduction to object recognition & SVM
1
INFORMATIK
• Basic recognition tasks
• A statistical learning approach
• Traditional recognition pipeline • Bags of features • Classifiers
• Support Vector Machine
Overview
2
Image classification • outdoor/indoor • city/forest/factory/etc.
3
Image tagging
• street • people • building • mountain • …
4
Object detection • find pedestrians
5
INFORMATIK
• walking • shopping • rolling a cart • sitting • talking • …
Activity recognition
6
INFORMATIK
mountain
building tree
banner
market people
street lamp
sky
building
Image parsing
7
INFORMATIK
This is a busy street in an Asian city. Mountains and a large palace or fortress loom in the background. In the foreground, we see colorful souvenir stalls and people walking around and shopping. One person in the lower left is pushing an empty cart, and a couple of people in the middle are sitting, possibly posing for a photograph.
Image description
8
INFORMATIK
Image classification
9
INFORMATIK
§ Apply a prediction function to a feature representation of the image to get the desired output:
f( ) = “apple” f( ) = “tomato” f( ) = “cow”
The statistical learning framework
10
INFORMATIK
y = f(x)
§ Training: given a training set of labeled examples
{(x1,y1), …, (xN,yN)}, estimate the prediction function f by minimizing the prediction error on the training set
§ Testing: apply f to a never before seen test example x and output the predicted value y = f(x)
The statistical learning framework
output prediction function
Image feature
11
INFORMATIK
Predic'on
Steps Training Labels
Training Images
Training
Training
Image Features
Image Features
Testing
Test Image
Learned model
Learned model
Slide credit: D. Hoiem 12
INFORMATIK
Tradi&onal recogni&on pipeline
Hand-designed feature
extraction
Trainable classifier
Image Pixels
• Features are not learned • Trainable classifier is often generic (e.g. SVM)
Object Class
13
INFORMATIK
Bags of features
14
INFORMATIK
1. Extract local features 2. Learn “visual vocabulary” 3. Quantize local features using visual vocabulary 4. Represent images by frequencies of “visual words”
Traditional features: Bags-of-features
15
INFORMATIK
• Sample patches and extract descriptors
1. Local feature extraction
16
INFORMATIK
2. Learning the visual vocabulary
…
Slide credit: Josef Sivic
Extracted descriptors from the training set
17
INFORMATIK
2. Learning the visual vocabulary
Clustering
…
Slide credit: Josef Sivic 18
INFORMATIK
2. Learning the visual vocabulary
Clustering
…
Slide credit: Josef Sivic
Visual vocabulary
19
INFORMATIK
∑ ∑ −=k
ki
kiMXDcluster
clusterinpoint
2)(),( mx
Review: K-means clustering
• Want to minimize sum of squared Euclidean distances between features xi and their nearest cluster centers mk
Algorithm:
• Randomly ini&alize K cluster centers
• Iterate un&l convergence: § Assign each feature to the nearest center § Recompute each cluster center as the mean of all features assigned to it
20
INFORMATIK
Example visual vocabulary
… Source: B. Leibe
Appearance codebook 21
INFORMATIK
1. Extract local features 2. Learn “visual vocabulary” 3. Quantize local features using visual vocabulary 4. Represent images by frequencies of “visual words”
Bags-of-features: main steps
22
INFORMATIK
• Orderless document representation: frequencies of words from a dictionary Salton & McGill (1983)
Bags of features: Motivation
23
INFORMATIK
• Orderless document representation: frequencies of words from a dictionary Salton & McGill (1983)
Bags of features: Motivation
24
INFORMATIK
• Orderless document representation: frequencies of words from a dictionary Salton & McGill (1983)
Bags of features: Motivation
25
INFORMATIK
US Presidential Speeches Tag Cloud http://chir.ag/projects/preztags/
• Orderless document representation: frequencies of words from a dictionary Salton & McGill (1983)
Bags of features: Motivation
26
INFORMATIK
Spatial pyramids
level 0
Lazebnik, Schmid & Ponce (CVPR 2006) 27
INFORMATIK
Spatial pyramids
level 0 level 1
Lazebnik, Schmid & Ponce (CVPR 2006) 28
INFORMATIK
Spatial pyramids
level 0 level 1 level 2
Lazebnik, Schmid & Ponce (CVPR 2006) 29
INFORMATIK
• Scene classifica'on results
Spatial pyramids
30
INFORMATIK
• Caltech101 classification results
Spatial pyramids
31
INFORMATIK
Tradi&onal recogni&on pipeline
Hand-designed feature
extraction
Trainable classifier
Image Pixels
Object Class
32
INFORMATIK
f(x) = label of the training example nearest to x
§ All we need is a distance function for our inputs § No training required!
Classifiers: Nearest neighbor
Test example
Training examples
from class 1
Training examples
from class 2
33
INFORMATIK
• For a new point, find the k closest points from training data
• Vote for class label with labels of the k points
K-nearest neighbor classifier
k = 5
34
INFORMATIK
§ Which classifier is more robust to outliers?
K-nearest neighbor classifier
Credit: Andrej Karpathy, http://cs231n.github.io/classification/ 35
INFORMATIK
K-nearest neighbor classifier
Credit: Andrej Karpathy, http://cs231n.github.io/classification/ 36
INFORMATIK
Linear classifiers
§ Find a linear func+on to separate the classes:
f(x) = sgn(w ⋅⋅ x + b) 37
INFORMATIK
Visualizing linear classifiers
Source: Andrej Karpathy, http://cs231n.github.io/linear-classify/ 38
INFORMATIK
• NN pros: § Simple to implement § Decision boundaries not necessarily linear § Works for any number of classes § Nonparametric method
• NN cons: § Need good distance function § Slow at test time
• Linear pros: § Low-dimensional parametric representation § Very fast at test time
• Linear cons: § Works for two classes § How to train the linear function? § What if data is not linearly separable?
Nearest neighbor vs. linear classifiers
39
INFORMATIK
40
§ Linear Classifiers can lead to many equally valid choices for the decision boundary
Maximum Margin
Are these really “equally valid”?
INFORMATIK
41
§ How can we pick which is best? § Maximize the size of the margin.
Max Margin
Are these really “equally valid”?
Small Margin
Large Margin
INFORMATIK
42
§ Support Vectors are those input points (vectors) closest to the decision boundary
1. They are vectors 2. They “support” the decision hyperplane
Support Vectors
INFORMATIK
43
§ Define this as a decision problem
§ The decision hyperplane:
§ No fancy math, just the equation of a hyperplane.
Support Vectors
~w
T~x + b = 0
INFORMATIK
44
§ Define this as a decision problem § The decision hyperplane:
§ Decision Function:
Support Vectors
~w
T~x + b = 0
D(~xi) = sign(~w
T~xi + b)
INFORMATIK
45
§ Define this as a decision problem § The decision hyperplane:
§ Margin hyperplanes:
Support Vectors
~w
T~x + b = 0
~w
T~x + b = γ
~w
T~x + b = γ
INFORMATIK
46
§ The decision hyperplane:
§ Scale invariance
Support Vectors
~w
T~x + b = 0
c~w
T~x + cb = 0
INFORMATIK
Support Vectors
§ The decision hyperplane:
§ Scale invariance
~w
T~x + b = 0
47
c~w
T~x + cb = 0
~w
T~x + b = γ
~w
T~x + b = γ
INFORMATIK
Support Vectors
§ The decision hyperplane:
§ Scale invariance
~w
T~x + b = 0
48
This scaling does not change the decision hyperplane, or the support vector hyperplanes. But we will eliminate a variable from the optimization
~
w
⇤T~x + b
⇤ = 1~
w
⇤T~x + b
⇤ = 1
c~w
T~x + cb = 0
INFORMATIK
49
§ We will represent the size of the margin in terms of w.
§ This will allow us to simultaneously § Identify a decision boundary § Maximize the margin
What are we optimizing?
INFORMATIK
50
1. There must at least one point that lies on each support hyperplanes
How do we represent the size of the margin in terms of w?
x1
x2Proof outline: If not, we could define a larger margin support hyperplane that does touch the nearest point(s).
INFORMATIK
51
1. There must at least one point that lies on each support hyperplanes
2. Thus:
How do we represent the size of the margin in terms of w?
x1
x2w
Tx1 + b = 1
w
Tx2 + b = 1
3. And:
w
T (x1 x2) = 2
INFORMATIK
How do we represent the size of the margin in terms of w?
1. There must at least one point that lies on each support hyperplanes
2. Thus:
52
x1
x2w
Tx1 + b = 1
w
Tx2 + b = 1
3. And:
w
T (x1 x2) = 2
w
Tx + b = 0
INFORMATIK
§ The vector w is perpendicular to the decision hyperplane § If the dot product of two vectors equals zero, the two vectors are perpendicular.
How do we represent the size of the margin in terms of w?
53
x1
x2
w
Tx + b = 0
w
w
T (x1 x2) = 2
INFORMATIK
54
§ The margin is the projection of x1 – x2 onto w, the normal of the hyperplane.
How do we represent the size of the margin in terms of w?
x1
x2
w
Tx + b = 0
w
w
T (x1 x2) = 2
INFORMATIK
55
Recall: Vector Projection
~u
~v
✓
cos(✓) =adjacent
hypotenuse
~v · ~uk~uk ~u
~v · ~u = k~vkk~uk cos(✓)
hypotenuse = k~vk adjacent = kgoalk
cos(✓) =k ~
goalkk~vk
k~vkk~ukk~vkk~uk cos(✓) =
k ~
goalkk~vk
~v · ~uk~vkk~uk =
k ~
goalkk~vk
~v · ~uk~uk = k ~
goalk
INFORMATIK
56
§ The margin is the projection of x1 – x2 onto w, the normal of the hyperplane.
How do we represent the size of the margin in terms of w?
x1
x2
w
Tx + b = 0
w
w
T (x1 x2) = 2
~v · ~uk~uk ~u
~w
T (x1 − x2)k~wk ~w
2k~wkSize of the Margin:
Projection:
INFORMATIK
57
§ Goal: maximize the margin
Maximizing the margin
max2
k~wk
min k~wk
where ti(~w
Txi + b) 1
Linear Separability of the data by the decision boundary
~w
Txi + b ≥ 1 if ti = 1
~w
Txi + b 1 if ti = −1
INFORMATIK
58
§ If constraint optimization then Lagrange Multipliers § Optimize the “Primal”
Max Margin Loss Function
L(~w, b) =12
~w · ~wN1X
i=0
↵i[ti((~w · ~xi) + b) 1]
min k~wk where ti(~w
Txi + b) 1
Lagrange Multiplier
INFORMATIK
59
§ Optimize the “Primal”
Max Margin Loss Function
L(~w, b) =12
~w · ~wN1X
i=0
↵i[ti((~w · ~xi) + b) 1]
@L(~w, b)@b
= 0
N1X
i=0
↵iti = 0
Partial wrt b
INFORMATIK
60
§ Optimize the “Primal”
Max Margin Loss Function
L(~w, b) =12
~w · ~wN1X
i=0
↵i[ti((~w · ~xi) + b) 1]
@L(~w, b)@ ~w
= 0
~wN1X
i=0
↵iti ~xi = 0
~w =N1X
i=0
↵iti ~xi
Partial wrt w
INFORMATIK
61
§ Optimize the “Primal”
Max Margin Loss Function
L(~w, b) =12
~w · ~wN1X
i=0
↵i[ti((~w · ~xi) + b) 1]
@L(~w, b)@ ~w
= 0
~wN1X
i=0
↵iti ~xi = 0
~w =N1X
i=0
↵iti ~xi
Partial wrt w
Now have to find αi. Substitute back to the Loss function
INFORMATIK
62
§ Construct the “dual”
Max Margin Loss Function
L(~w, b) =12
~w · ~wN1X
i=0
↵i[ti((~w · ~xi) + b) 1]
~w =N1X
i=0
↵iti ~xi
W (↵) =N1X
i=0
↵i12
N1X
i,j=0
↵i↵jtitj(~xi · ~xj)
where ↵i 0N1X
i=0
↵iti = 0
N1X
i=0
↵iti = 0With:
Then:
INFORMATIK
63
§ Solve this quadratic program to identify the lagrange multipliers and thus the weights
Dual formulation of the error
W (↵) =N1X
i=0
↵i12
N1X
i,j=0
↵i↵jtitj(~xi · ~xj)
where ↵i 0
There exist (rather) fast approaches to quadratic program solving in both C, C++, Python, Java and R
INFORMATIK
64
Quadratic Programming
subject to (one or more) A~x k
B~x = l
minimize f(~x) =12~x
TQ~x + c
T~x
• If Q is positive semi definite, then f(x) is convex.
• If f(x) is convex, then there is a single maximum.
W (↵) =N1X
i=0
↵i12
N1X
i,j=0
↵i↵jtitj(~xi · ~xj)
where ↵i 0
INFORMATIK
Support Vector Expansion
§ When αi is non-‐zero then xi is a support vector
§ When αi is zero xi is not a support vector 65
~w =N1X
i=0
↵iti ~xi
D(~x) = sign(~w
T~x + b)
= sign
0
@"
N1X
i=0
↵iti ~xi
#T
~x + b
1
A
= sign
"N1X
i=0
↵iti(~xiT~x)
#+ b
!
New decision Function Independent of the
Dimension of x!
INFORMATIK
66
§ In constraint optimization: At the optimal solution
§ Constraint * Lagrange Multiplier = 0
Karush-Kuhn-Tucker Conditions
↵i(1 ti(~w
T~xi + b)) = 0
if ↵i 6= 0 ! ti(~w
T~xi + b) = 1
Only points on the decision boundary contribute to the solution!
INFORMATIK
• General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable
Nonlinear SVMs
Φ: x → φ(x)
Image source 67
INFORMATIK
• Linearly separable dataset in 1D:
• Non-separable dataset in 1D:
• We can map the data to a higher-dimensional space:
0 x
0 x
0 x
x2
Nonlinear SVMs
Slide credit: Andrew Moore 68
INFORMATIK
• General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable.
• The kernel trick: instead of explicitly computing the lifting transformation φ(x), define a kernel function K such that
K(x , y) = φ(x) · φ(y) § (to be valid, the kernel function must satisfy
Mercer’s condition)
The kernel trick
69
INFORMATIK
• Linear SVM decision function:
The kernel trick
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998
bybi iii +⋅=+⋅ ∑ xxxw α
Support vector
learned weight
70
INFORMATIK
bKybyi
iiii
iii +=+⋅ ∑∑ ),()()( xxxx αϕϕα
The kernel trick • Linear SVM decision func&on:
• Kernel SVM decision func&on:
• This gives a nonlinear decision boundary in the original feature space
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 1998
bybi iii +⋅=+⋅ ∑ xxxw α
71
INFORMATIK
Polynomial kernel: dcK )(),( yxyx ⋅+=
72
INFORMATIK
• Also known as the radial basis function (RBF) kernel:
Gaussian kernel
||x – y||
K(x, y)
⎟⎠
⎞⎜⎝
⎛ −−=2
2
1exp),( yxyxσ
K
73
INFORMATIK
Gaussian kernel
SV’s
74
INFORMATIK
• Histogram intersection:
• Square root (Bhattacharyya kernel):
Kernels for histograms
K(h1,h2 ) = min(h1(i),h2 (i))i=1
N
∑
K(h1,h2 ) = h1(i)h2 (i)i=1
N
∑
75
INFORMATIK
• Pros § Kernel-based framework is very powerful, flexible § Training is convex optimization, globally optimal solution
can be found § Amenable to theoretical analysis § SVMs work very well in practice, even with very small
training sample sizes
• Cons § No “direct” multi-class SVM, must combine two-class
SVMs (e.g., with one-vs-others) § Computation, memory (esp. for nonlinear SVMs)
SVMs: Pros and cons
76
INFORMATIK
• Generalization refers to the ability to correctly classify never before seen examples
• Can be controlled by turning “knobs” that affect the complexity of the model
Generalization
Training set (labels known) Test set (labels unknown) 77
INFORMATIK
• Training error: how does the model perform on the data on which it was trained?
• Test error: how does it perform on never before seen data?
Diagnosing generalization ability
Source: D. Hoiem
Training error
Test error
Underfitting Overfitting
Model complexity High Low
Err
or
78
INFORMATIK
• Underfitting: training and test error are both high § Model does an equally poor job on the training and the test set § Either the training procedure is ineffective or the model is too
“simple” to represent the data • Overfitting: Training error is low but test error is high
• Model fits irrelevant characteristics (noise) in the training data § Model is too complex or amount of training data is insufficient
Underfitting and overfitting
Underfitting
Overfitting
Good generalization
Figure source 79
INFORMATIK
Effect of training set size
Many training examples
Few training examples
Model complexity High Low
Test
Err
or
Source: D. Hoiem 80
INFORMATIK
Effect of training set size
Many training examples
Few training examples
Model complexity High Low
Test
Err
or
Source: D. Hoiem 81
INFORMATIK
• Split the data into training, validation, and test subsets • Use training set to optimize model parameters • Use validation test to choose the best model • Use test set only to evaluate performance
Validation
Model complexity
Training set loss
Test set loss
Err
or
Validation set loss
Stopping point
82
INFORMATIK
Histogram of Oriented Gradients (HoG)
[Dalal and Triggs, CVPR 2005]
10x10 cells
20x20 cells
HoGify
83
INFORMATIK
Histogram of Oriented Gradients (HoG)
84
INFORMATIK
§ Many examples on Internet:
§ hVps://www.youtube.com/watch?v=b_DSTuxdGDw
Histogram of Oriented Gradients (HoG)
[Dalal and Triggs, CVPR 2005] 85
INFORMATIK
86
INFORMATIK
§ Demo software: http://cs.stanford.edu/people/karpathy/svmjs/demo
§ A useful website: www.kernel-machines.org § Software:
§ LIBSVM: www.csie.ntu.edu.tw/~cjlin/libsvm/ !!§ SVMLight: svmlight.joachims.org/
Resources
87
INFORMATIK
88
Thank you!