Real-Time Computer Vision
Microsoft Computer Vision School
Vincent Lepetit - CVLab - EPFL (Lausanne, Switzerland)
1
demo
2
applications
...
3
• How the demo works (including Randomized Trees);
• More recent work.
4
Background• 3D world to 2D images (projection matrix,
internal parameters, external parameters, homography, ...);
• Robust estimation (non-linear least-squares, RANSAC, robust estimators, ...);
• Feature point matching (affine region detectors, SIFT, ...).
5
From the 3D World to a 2D Image
M
m
World coordinate system
What is the relation between the 3D coordinates of a point M and its correspondent m in the image captured by the camera ?
6
Perspective Projection
C
M
World coordinate system
Camera center
The image formation is modeled as a perspective projection, which is realistic for standard cameras:
The rays passing through a 3D point M and its correspondent m in the image all intersect at a single point C, the camera center.
m
7
Z
C
Expressing M in the Camera Coordinates System
M
m
World coordinate system
Camera coordinate systemX
Y
Mcam
Step 1: Express the coordinates of M in the camera coordinates system as Mcam.
This transformation corresponds to a Euclidean displacement (a rotation plus a translation):
Mcam = RM + Twhere: R is a 3x3 rotation matrix, and T is a 3- vector.
8
Homogeneous Coordinates
Lets replace by the 4- homogeneous vector : Just add a 1 as the fourth coordinate.
Now, the Euclidean displacement can be expressed as an linear transformation instead of an affine one:
Z
C
m
World coordinate system
Camera coordinate systemX
Y
€
M =
XYZ
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ → ˜ M =
XYZ1
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
€
M
€
˜ M
€
Mcam = RM + T →
Xcam
Ycam
Zcam
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟
= RXYZ
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟
+ T →
Xcam
Ycam
Zcam
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟
= R | T( )
XYZ1
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
→ Mcam = R | T( ) ˜ M
Mcam
(R | T ) is a 3x4 matrix.9
Projection
Computation of the coordinates of m in the image plane, from Mcam (expressed in the camera coordinates system): Simply use Thales' theorem:
CX
mX
Mcam
f
Z
mX
XC
f
Mcam
m
Camera coordinate system
Z
Y
€
mX
f=XZ
→ mX = f XZ
10
From Projection to ImageCoordinates of m in pixels ?
C
f
m
Camera coordinate system
€
mX = f XZ, mY = f Y
Z
u
v
Image coordinate system u0
v0
1 pixel
€
1ku
€
1kv
)
)
€
mu = u0 + kumX , mv = v0 + kvmY
11
€
mX = f XZ, mY = f Y
Z
€
mu = u0 + kumX , mv = v0 + kvmY
€
In matrix form :uvw
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟
=
ku f 0 u00 kv f v00 0 1
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟
XYZ
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟
€
uvw
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ defines m in homogeneous coordinates
€
→mu =
uw
= u0 + ku fXZ
mv =vw
= v0 + kv fYZ
⎧
⎨ ⎪
⎩ ⎪
Putting • the perspective projection and• the transformation into pixel coordinatestogether:
12
The Full TransformationThe two transformations are chained to form the full transformation from a
3D point in the world coordinate system to its projection in the image:
The product of the internal calibration matrix and the external calibration matrix is a 3x4 matrix called the "projection matrix".
The projection matrix is defined up to a scale factor.
€
uvw
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟
=
ku f 0 u00 kv f v00 0 1
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟
R11 R13 R13 T1R21 R22 R23 T2R31 R32 R33 T3
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟
XYZ1
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
=
P11 P12 P13 P14P21 P22 P23 P24P31 P32 P33 P34
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟
XYZ1
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
projection matrix
13
The Full Transformation
R, T, and the products kuf and kvf can be extracted from the projection matrix.
€
uvw
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟
=
ku f 0 u00 kv f v00 0 1
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟
R11 R13 R13 T1R21 R22 R23 T2R31 R32 R33 T3
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟
XYZ1
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
=
P11 P12 P13 P14P21 P22 P23 P24P31 P32 P33 P34
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟
XYZ1
⎛
⎝
⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟
projection matrix
14
Homography
m� = PM = [P1P2P3P4]
XYZ1
= [P1P2P3P4]
XY01
= [P1P2P4]
XY1
= H3×3m
m�M/mH3×3
15
Computing a Projection Matrix or a Homography from Point Correspondences
by solving a linear system
m�m
�u v 1 0 0 0 uu� vu� u�
0 0 0 u v 1 uv� vv� v�
�
H11
H12
H13
H21
H22
H23
H31
H32
H33
=�
00
�
m� = Hm
m = [u, v, 1]�,m� = [u�, v�, 1]�
16
• Non-linear least-squares minimization: Minimization of a physical, meaningful error (reprojection error, in pixels)
• Minimization algorithms: Gauss-Newton or Levenberg-Marquardt (very efficient).
minR,T
�
i
dist2�HR,Tmi,m
�i
�
m�m
Computing a Projection Matrix or a Homography from Point Correspondences with a non-linear optimization
minR,T
�
i
dist2�PR,TMi,m�
i
�
MHR,Tmm�
PR,Tm
17
A Look to the Reprojection Error
reprojection error
1D camera under 2D translation
100 "3D points" taken at randomly in
[400;1000]x[-500;+500]
True camera position at (0, 0)
18
Gaussian Noise on the ProjectionsWhite cross: true camera position;Black cross: global minimum of the objective function.
In that case, the global minimum of the objective function is close to the true camera pose.
19
What if there are Outliers ?
M1
M2
m1
m2
C
M3
m3
M4
m4
incorrect measure (outlier)
20
Gaussian Noise on the Projections + 20% outliers
White cross: true camera position;Black cross: global minimum of the objective function.
The global minimum is now far from the true camera pose.
21
What Happened ?
The error on the 2D point locations mi is assumed to have a Gaussian (Normal) distribution with identical covariance matrices σI, and independent;
This assumption is violated when mi is an outlier.
Bayesian interpretation:
M1
M2
m1
m2
C
M3
m3
M4
m4
argminR,T
�i dist2
�PR,Tmi,m�
i
�
= argmaxR,T
�i N
�m�
i;PR,Tmi,σI�
22
Robust estimationIdea: Replace the Normal distribution by a more suitable distribution, or
equivalently replace the least-squares estimator by a "robust estimator" or “M-estimator”:
argminR,T
�i dist2
�PR,Tmi,m�
i
�
→ argminR,T
�i ρ
�dist
�PR,Tmi,m�
i
��
23
Example of an M-estimator:The Tukey Estimator
x2
ρ(x)
The Tukey estimator assumes the measures follow a distribution that is a mixture of:• a Normal distribution, for the inliers,• a uniform distribution, for the outliers.
�if |x| ≤ c ρ(x) = c2
6 (1− (1− (xc )2)3)
if |x| > c ρ(x) = c2
6
24
Normal distribution(inliers)
+ =
Uniform distribution(outliers)
Tukey estimator
-log(.)-log(.)
Least-squares
Mixture
25
Gaussian Noise on the Projections + 20% outliers + Tukey estimator
White cross: true camera position;Black cross: global minimum of the object function.
The global minimum is very close to the true camera pose.BUT: - local minimums;- the objective function is flat where all the correspondences are considered outliers.
26
Gaussian Noise on the Projections + 50% outliers + Tukey estimator
Even more local minimums.Numerical optimization can get trapped into a local minimum.
27
RANSAC
28
How to Optimize ?
Idea: sampling the space of solutions (the camera pose space here):
29
How to Optimize ?
Idea: sampling the space of solutions:
+ Numerical Optimization from the best sampled pose.
Problem: Exhaustive regular sampling is too expensive in 6 dimensions.Can we do a smarter sampling ?
30
RANSACRANSAC: RANdom SAmple Consensus
Line fitting: the "Throwing Out the worst residual" heuristics can fail (Example for the original paper [Fischler81]):
outlier
final least-squares solution
Ideal line
31
RANSACAs before, we could do a regular sampling, but would not be optimal:
Ideal line
32
Idea:
Generate hypotheses from subsets of the measurements.If a subset contains no gross errors, the estimated parameters (the hypothesis) are closed
to the true ones.
Take several subsets at random, retain the best one.
Ideal line
33
The quality of a hypothesis is evaluated by the number of measures that lie "close enough" to the predicted line.
We need to choose a threshold (T) to decide if the measure is "close enough". RANSAC returns the best hypothesis, ie the hypothesis with the largest number of
inliers.
T
€
1 if dist(mi,line p( )) ≤ T0 if dist(mi,line p( )) > T⎧ ⎨ ⎩ i
∑
34
RANSAC for HomographiesTo apply RANSAC to homography estimation, we need a way to compute a
homography from a subset of measurements:
Since RANSAC only provides a solution estimated with a limited number of data, it must be followed by a robust minimization to refine the solution.
�u v 1 0 0 0 uu� vu� u�
0 0 0 u v 1 uv� vv� v�
�
H11
H12
H13
H21
H22
H23
H31
H32
H33
=�
00
�
35
How to Get the Correspondences ?
• Extract Feature Points / Keypoints / Regions (Harris corner detector, extrema of Laplacian, affine region detectors, ...);
• standard approach: Match them based on Euclidean distances between descriptors such as SIFT, SURF, ...
m�m
36
Affine Region Detectors
Hessian-Affine detector MSER detector
37
Affine Normalization
Warp by M11/2 Warp by M2
1/2
We still have to correct for the orientation !
38
Select Canonical Orientation• Create histogram of local gradient directions computed over the image patch;• Each gradient contributes for its norm, weighted by its distance to patch center;• Assign canonical orientation at peak of smoothed histogram.
0 2π
39
Select Canonical Orientation
40
Description Vector
?
...
41
SIFT Description VectorMade of local histograms of gradients:
In practice: 8 orientations x 4 x 4 histograms = 128 dimensions vector.Normalised to be robust to light changes.
...
42
Matching Regions
m�
...
...
...
...
...
...
...
?
43
Matching: Approximate Nearest Neighbour
Best-Bin-First: Approximate nearest-neighbour search in k-d tree
44
Keypoint Matching
Pre-processingMake the actual classification easier
Nearest neighbor classification
The standard approach is a particular case of classification:
Search in the Database
Idea: let’s try another classification method!
45
One Class per KeypointOne class per keypoint: the set of the keypoint’s possible appearances
under various perspective, lighting, noise...
class 1
class 2
46
Training phase
Classifier
Classifierclass 1
class 1
class 2
...
Run-Time
47
Which Classifier ?We want a classifier that:
• can handle many classes;• is very fast;• has reasonable recognition performances (a
very high recognition rate is not an necessary requirement).
48
Which Classifier ?• Randomized Trees [Amit & Geman, 1997];• Random forests [Breiman, 2001].
49
An (Ideal) Single Tree
binary test
binary test
binary test
class #
50
How to Build the Tree ?
binary test ?
training set
51
binary test ?
training set
found by minimizing the entropy after the test:
S
Sleft
Sright
argmintest
|Sleft||S| Entropy(Sleft) + |Sright|
|S| Entropy(Sright)
52
binary test
training set
S
Problem: runs quickly out of training samples for the deeper tests
53
Idea: Use Several Sub-Optimal TreesEach tree is trained with a random subset of the training set.
54
Idea: Use Several Sub-Optimal Trees
The leaves contain the probabilities over the classes, computed from the training set.
55
Classification with Several Sub-Optimal TreesThe test sample is dropped into each tree, and the probabilities in the leaves it reached are averaged:
+ + ) = (13
56
Visual InterpretationEach tree partitions the space in a different way and compute the probability of each class for each cell of the partition:
57
Visual InterpretationCombining the trees gives a fine partition with a better estimate of the class probabilities:
58
For PatchesPossible tests: compare the intensities of two pixels around the keypoint after Gaussian smoothing:
• Very efficient to compute;• Invariant to light change by any raising function.
mfi =
�1 if I(m + dmi,1) ≤ I(m + dmi,2)0 otherwise
m + dmi,1
m + dmi,2 I : image after Gaussian smoothing
59
Results
60
Randomized Trees (and Random Ferns) applied to image patches are becoming a powerful tool for Computer Vision.
61
[Shotton et al, CVPR’11]
Used to infer body parts in the Kinect body tracking system.
The tests rely on the depth map.
62
Tests in [Shotton et al, CVPR’11]Classes are the body parts. The goal is to label each pixel with the label of the part it belongs to.
Tests compare the depth of two pixels around the considered pixel.
The displacements are normalized by the depth of the considered pixel for invariance:
fi(m) =�
1 if depth(m + dm1depth(m) ) ≤ depth(m + dm2
depth(m) )0 otherwise
m
63
3D Pose EstimationMean-Shift is used to find the joint locations from the body parts.
64
Training
“Training 3 trees to depth 20 from 1 million images takes about 1 day on a 1000 core cluster” [Shotton et al, CVPR’11]
Most of the training data is synthetic:
65
A SubtreeAverage of the patches that reach this node
66
[Gall et Lempitsky, CVPR’09; Barinova et al, CVPR’10]
Hough Forest for Object Detection:• Random Forests used to make each patch vote for the object centroid;
• The tests compare the output of filters and histograms-of-gradient between 2 pixels;• The leaves contain the displacement toward the object center.
Accumulated votes from all patches
Final detectionEach patch votes for the object centroid
Votes from the 3 patches
67
Tests used in [Gall et Lempitsky, CVPR’09]
Channels: the 3 color channels, absolute values of the first and second derivatives of the image, and 9 channels from HoG (Histograms-of-Gradients).
fi(m) =�
1 if channeli(m + dm1) < channeli(m + dm2) + τ0 otherwise
HoG
68
[Bosch et al, ICCV’07]Image Classification using Random Forests and Ferns [Bosch et al, ICCV’07]Use a sliding window to detect objects.Much faster than SVMs, recognition performances similar.
69
[Bosch et al, ICCV’07]
Tests:
n and b: random vector and scalar.xm: vector computed from a Pyramidal Histogram-of-Gradients.
fi(m) =�
1 if n�xm + b ≤ 00 otherwise
70
[Kalal et al, CVPR’10]TLD (aka Predator), for Track, Learn, Detect:
• Random Ferns used to speed up detection;
• Trained online: the distributions in the leaves are updated online, using the incoming images.
71
[Kalal et al, CVPR’10]• Tests: 2bit binary patterns• Trained online: the distributions in the leaves are updated online, using the
incoming images.
72
Random Ferns: A Simplified Tree-Like Classifier
73
For Keypoint Recognition, We Can Use Random Tests!
Number of trees
Recognition rate
Comparison of the recognition rates for 200 keypoints:
tests selected by minimizing entropy
tests with random locations
74
We can use random tests • For a small number of classes
– we can try several tests, and– retain the best one according to some criterion.
75
We can use random tests• For a small number of classes
– we can try several tests, and– retain the best one according to some criterion.
• When the number of classes is large– any test does a decent job:
76
Why it is Interesting
• Building the trees takes no time (we still have to estimate the posterior probabilities);
• Allows incremental learning;
• Simplifies the classifier structure.
77
The Tree Structure is not Needed
78
The Tree Structure is not Needed
f1
f2
f3
79
The Tree Structure is not Needed
f1
f2
f3
Results of pixel comparisons (0 or 1) Class Label
The distributions can be expressed simply, as:
80
Compromise:
which is proportional to
but complete representation of the joint distribution infeasible.
Naive Bayesian ignores the correlation:
We are looking for
€
argmaxi
P(C = ci patch)
If patch can be represented by a set of image features { fi }:
€
P(C = ci patch) = P(C = ci f1, f2,… fn, fn+1,… … fN )
81
Training
82
Training
83
Training0
1
1
6
84
Training1
0
0
0
1
1
6
1
85
Training1
0
1
5
1
0
0
0
1
1
6
1
86
Training
87
Training
88
Training Results
Normalize:
€
P( f1, f2,…, fn |C = ci)000001
111
∑ =1
89
Training Results
Normalize:
€
P( f1, f2,…, fn |C = ci)000001
111
∑ =1
90
Recognition
91
Normalization
Normalize:
€
P( f1, f2,…, fn |C = ci)000001
111
∑ =1
92
Subtlety with Normalizationpleaf, class =
Number of samples(leaf, class)Number of samples(class)
too selective:Number of samples(leaf, class) can be 0 simply because the training set is finite.
we use:pleaf, class =
Number of samples(leaf, class)+NregularizationNumber of samples(class)+Number of leaves×Nregularization
This can be done by simply initializing the counters to Nregularization instead of 0.
93
Influence of Nregularization
pleaf, class =Number of samples(leaf, class)+Nregularization
Number of samples(class)+Number of leaves×Nregularization
Nregularization (log scale)
Recognition rate
50%
94
Implementation of Feature Point Recognition with Ferns
1: for(int i = 0; i < H; i++) P[i] = 0.; 2: for(int k = 0; k < M; k++) { 3: int index = 0, * d = D + k * 2 * S; 4: for(int j = 0; j < S; j++) { 5: index <<= 1; 6: if (*(K + d[0]) < *(K + d[1])) 7: index++; 8: d += 2; } 9: p = PF + k * shift2 + index * shift1;10: for(int i = 0; i < H; i++) P[i] += p[i]; }
• Very simple to implement;• No need for orientation, perspective, light correction.
95
Number of inliers for Ferns
Number of inliers for SIFT
each point corresponds to an image from a 1000-frame sequence
Ferns are much faster, sometimes more accurate, but SIFT does not need training.
Ferns versus SIFT
96
Randomized Trees vs Ferns
Ferns more discriminant but more sensitive to outliers.
Ferns with productRT (with random tests) with product
Ferns with averageRT (with random tests) with average
Different combination strategies: average (RT) / product (Ferns)
Rec
ogni
tion
rate
Number of structures
97
Randomized Trees vs FernsInfluence of the number of classes:
Ferns with product
Ferns with averageRec
ogni
tion
rate
98
Memory and Computation Time
• Recognition time grows linearly with the number of Trees/Ferns and the number of classes.
• Recognition time grows linearly with the logarithm of the depth of Trees/Ferns.
• Memory grows linearly with the number of Trees/Ferns and the number of classes.
• Memory grows exponentially with the depth of Trees/Ferns.
• Increasing the depth may result in overfitting.• Increasing the number of Trees/Ferns (usually) improves
recognition.
99
Influence of the Number of FernsFerns with productRT (with random tests) with product
Ferns with averageRT (with random tests) with average
Rec
ogni
tion
rate
Number of structures
Increasing the number of Ferns/Trees improves the recognition rate, but increases the computation time and memory.
100
Number of Ferns / Number of Leaves / Memory / Computation Time
Rec
ogni
tion
Rat
e
Fern size
Number of Ferns
Com
puta
tion
Tim
eFern size
101
Conclusions on Randomized Trees and Ferns
• Simple to implement, Ferns even simpler;
• Both very fast, but dumb: need a lot of training examples to learn.
• Use a lot of memory to store the posterior distributions in the leaves.
102
We now have correspondences between a reference image of the object and the input image:
Some correspondences are correct, some are not.We can estimate the homography between the 2 images by applying RANSAC on subsets of 4 correspondences.
103
Computing a Homography from Point Correspondencesby solving a linear system
�u v 1 0 0 0 uu� vu� u�
0 0 0 u v 1 uv� vv� v�
�
H11
H12
H13
H21
H22
H23
H31
H32
H33
=�
00
�
m�m
�m� = H �m�m = [u, v, 1]�, �m� = [ku�, kv�, k]�
104
Computing a Homography from Point Correspondencesby solving a linear system
m�m
�m� = H �m�m = [u, v, 1]�, �m� = [ku�, kv�, k]�
u� = H11u+H12u+H13H31u+H32u+H33
v� = H21u+H22u+H23H31u+H32u+H33
105
Computing a Homography from Point Correspondencesby solving a linear system
u� = H11u+H12u+H13H31u+H32u+H33
v� = H21u+H22u+H23H31u+H32u+H33
�u v 1 0 0 0 uu� vu� u�
0 0 0 u v 1 uv� vv� v�
�
H11
H12
H13
H21
H22
H23
H31
H32
H33
=�
00
�
Using four correspondences:BX = 08
with X = [H11,H12,H13,H21,H22,H23,H31,H32,H33]�
106
How to Solve this Linear System ?
• X is the null eigenvector of B.
• In practice: the eigenvector corresponding to the smallest eigenvalue.
BX = 08
with X = [H11,H12,H13,H21,H22,H23,H31,H32,H33]�
107
• Non-linear least-squares minimization: Minimization of a physical, meaningful error (reprojection error, in pixels)
• Minimization algorithms: Gauss-Newton or Levenberg-Marquardt (very efficient).
Computing a Homography from Point Correspondences with a non-linear optimization
minR,T
�
i
dist2�HR,Tmi,m
�i
�
m�mHR,Tm
108
Numerical Optimization
p0
Start from an initial guess p0:
p0 can be taken randomly but should be as close as possible to the global minimum:
- pose computed at time t-1;- pose predicted from pose computed at time t-1 and a motion model;- ...
p1p2
109
Numerical OptimizationGeneral methods:• Gradient descent / Steepest Descent;• Conjugate Gradient;• ...
Non-linear Least-squares optimization:• Gauss-Newton;• Levenberg-Marquardt;• ...
110
Numerical OptimizationWe want to find p that minimizes:
where • p is a vector of parameters that define the camera pose (translation vector + parameters of the rotation matrix);• b is a vector made of the measurements (here the m’i);• f is the function that relates the camera pose to these measurements.
f(p) =
u(HR(p),T(p)m1)v(HR(p),T(p)m1)
...
b =
u(m�
1)v(m�
1)...
E(p) =�
i dist2�HR(p),T(p)mi,m�
i
�
= �f(p)− b�2
111
Gradient descent / Steepest Descent
Weaknesses:- How to choose λ ? - Needs a lot of iterations in long and narrow valleys:
€
pi+1 = pi − λ∇E(pi)
E(pi) = f (pi) −b2
= f (pi) −b( )T f (pi) −b( )→∇E(pi) = 2J f (pi) −b( ) with J the Jacobian matrix of f , computed at pi
112
The Gauss-Newton and the Levenberg-Marquardt algorithms
But first, the Linear Least-Squares Case:
If the function f is linear ie f(p) = Ap, p can be estimated as:
p=A+b
where A+ is the pseudo-inverse of A: A+=(ATA)-1AT€
E(p) = f (p) −b 2
113
Non-Linear Least-Squares: The Gauss-Newton algorithm
Iteration steps:
pi+1=pi + ∆i
∆i is chosen to minimize the residual || f(pi+1) – b ||2. It is computed by approximating f to the first order:
€
Δ i = argminΔ
f (pi + Δ) −b 2
= argminΔ
f (pi) + JΔ −b 2 First order approximation: f (pi + Δ) ≈ f (pi) + JΔ
= argminΔ
εi + JΔ 2εi = f (pi) −b denotes the residual at iteration i
Δ i is the solution of the system JΔ = −εi in the least − squares sense :Δ i = −J+εi where J+ is the pseudo - inverse of J
114
Non-Linear Least-Squares: The Levenberg-Marquardt Algorithm
In the Gauss-Newton algorithm:
In the Levenberg-Marquardt algorithm:
Levenberg-Marquardt Algorithm:
0. Initialize λ with a small value: λ = 0.001
1. Compute ∆i and E(pi + ∆i)
2. If E(pi + ∆i) > E(pi): λ ← 10 λ and go back to 1 [happens when the linear approximation of f is too coarse]
3. If E(pi + ∆i) < E(pi): λ ← λ / 10, pi+1 ← pi + ∆i and go back to 1.
Once converged, set λ ← 0 and continue up to convergence.
€
Δ i = − JTJ( )−1JTεi
€
Δ i = − JTJ + λI( )−1JTεi
115
Non-Linear Least-Squares: the Levenberg-Marquardt Algorithm
• When λ is small, LM behaves similarly to the Gauss-Newton algorithm.• When λ becomes large, LM behaves similarly to a steepest descent to guarantee
convergence.
€
Δ i = − JTJ + λI( )−1JTεi
116
Another Way to Refine the Pose:Template Matching
117
Global region tracking by minimizing cross-correlation: •Useful for objects difficult to model using local features;•Accurate.
Template T
Input Image Ip
118
Lucas-Kanade Algorithm
Gauss-Newton step:€
minp
W (I,p)[m j ]−T[m j ]( )2
j∑
€
Δ i = Jp+ ⋅ εp,I
Template T
mj
Input Image I
Pseudo-inverse of the Jacobian of W(I, p) evaluated at p and the mj
€
εp,I = (…,T[m j ]−W (I,p)[m j ],…)T
p
119
Template T
Lucas-Kanade Algorithm
Computing J and J+ is computationally expensive.
p0
p
120
Inverse Compositional Algorithm[Baker et al. IJCV03]
Template T
Input Image It
pi = pi-1 + dpi
dpi = Jp= 0+ εp= 0,I
Jp=0 is a constant matrix and can therefore be precomputed !
-pi-1dpi
121
ESM (Efficient Second-order Method)(1) I = T + Jp=0dp + dpTHp=0dp [second-order Taylor expansion]
(2) Jp=dp = Jp=0 + 2dpTHp=0 [derivation of (1) wrt p]
(3) dpTHp=0 = ½(Jp=dp - Jp=0) [from Equation (2)]
(4) I = T + Jp=0 + ½(Jp=dp - Jp=0)dp [by injecting (3) in (1)]
(5) dp = [½(Jp=0 + Jp=dp)]+ (I - T) [from Equation (4)]
Like Gauss-Newton but replace Jp=0 by ½(Jp=0 + Jp=dp).Need to compute Jp=dp at each iteration, and a pseudo-inverse
at each iteration, but need much less iterations.122
BRIEF [ECCV’10]very fast feature point descriptor
123
Remark
• Moving legacy code to new CPUs does not result in a speed-up anymore;
• Should consider the features of new platforms: parallelism (multi-cores, GPU), locality, ...
124
1
1
0...
0
1
BRIEF descriptor
Gaussian smoothing
125
1
1
0...
0
1
BRIEF descriptor
Gaussian smoothing
Alternatively, using integral images:
126
Integral Images
Integral Image
Integral Image(u, v) =�
i=1..u
�
j=1..v
Image(i, j)
127
=
-
-
+
How to Use Integral Images
128
[Viola & Jones, IJCV’01]
Features computed in constant time
129
Computing Integral Images
IntegralImage[u][v] = IntegralImage[u][v-1] +LineBuffer[u] +Image[u][v]
130
Evaluation
131
Evaluation
132
Computation Speed
For BRIEF, most of the time is spent in Gaussian smoothing.
133
Matching Speed distance(BRIEF descriptor 1, BRIEF descriptor 2)
= Hamming distance(BRIEF descriptor 1, BRIEF descriptor 2)
= number of bits set to 1(BRIEF descriptor 1 xor BRIEF descriptor 2)
= popcount(BRIEF descriptor 1 xor BRIEF descriptor 2)
10- to 15-fold speed increase on Intel's Bloomfield (SSE 4.2) and AMD's Phenom (SSE 4a)
134
Matching Speed
135
Picking the Locations
uniform distribution Gaussian distribution Gaussian distribution for location and length
uniform distribution on Polar coordinates
census transform locations
136
Picking the Locations
uniform distribution Gaussian distribution Gaussian distribution for location and length
uniform distribution on Polar coordinates
census transform locations
137
Rotation and Scale Invariance
138
Rotation and Scale InvarianceDuplicate the Descriptors:18 rotations x 3 scales
...
...
...
139
code released in GPL on CVLab website
140
DOT [CVPR’10]dense descriptor for object detection
Joint work with Stefan Hinterstoisser (TU Munich)
141
Template matching with an efficient representation of the images and the templates.
object detection with a sliding window and template matching
142
143
144
Initial Similarity Measure
145
Making the Similarity Measure Robust to Small Motions
146
Downsampling
147
Ignoring the Dependencies between the Regions...
148
Lists of Dominant Orientations
149
Fast Computation with Bitwise Operations
0000110000010000
150
Code available under LGPL license athttp://campar.in.tum.de/personal/hinterst/index/
151
New Method, LINE[PAMI, under revision]
152
Initial Similarity Measure
ESteger(I,O, c) =�
r
��� cos�orientation(O, r)− orientation(I, c + r)
����
previous measure:153
Making the Similarity Measure Robust to Small Motions
ESteger(I,O, c) =�
r
��� cos�orientation(O, r)− orientation(I, c + r)
����
E(I,O, c) =�
r
�max
t∈region(c+r)
�� cos�orientation(O, r)− orientation(I, t)
���
154
Avoiding to Recompute the max Operator1. spread the gradients
155
2. precompute response mapsBecause• we consider only a discrete set of gradient directions;• we do not consider the gradient norms,we can precompute a response for each region in the image and a gradient direction for the template
in the template
in the template
156
Optimized Version
11010
1. The sets of orientations in the image regions are encoded with a binary representation:
157
Optimized Version2. The binary representation is used as an index to lookup tables with the precomputed responses for each gradient direction in the template:
158
Avoiding Caches MissesThe response maps are re-arranged into linear memories:
159
Using the Linear Memories
The similarity measure can be computed for all the image locations by summing linear memories, shifted by an offset that depends on the template.
160
Advantage of Linearizing the Memory
Speed-up factor
161
DOT [CVPR’10]LINE
162
LINE-MOD [Hinterstoisser et al, ICCV’11]
Extension to the Kinect: the templates combine the image and the depth map.
163
thanks!
164
Top Related