Download - Vincent Lepetit - Real-time computer vision

Real-Time Computer Vision

Microsoft Computer Vision School

Vincent Lepetit - CVLab - EPFL (Lausanne, Switzerland)

1

demo

2

applications

...

3

• How the demo works (including Randomized Trees);

• More recent work.

4

Background• 3D world to 2D images (projection matrix,

internal parameters, external parameters, homography, ...);

• Robust estimation (non-linear least-squares, RANSAC, robust estimators, ...);

• Feature point matching (affine region detectors, SIFT, ...).

5

From the 3D World to a 2D Image

M

m

World coordinate system

What is the relation between the 3D coordinates of a point M and its correspondent m in the image captured by the camera ?

6

Perspective Projection

C

M


Camera center

The image formation is modeled as a perspective projection, which is realistic for standard cameras:

The rays passing through a 3D point M and its correspondent m in the image all intersect at a single point C, the camera center.

m

7

Z

C

Expressing M in the Camera Coordinates System

M

m


Camera coordinate systemX

Y

Mcam

Step 1: Express the coordinates of M in the camera coordinates system as Mcam.

This transformation corresponds to a Euclidean displacement (a rotation plus a translation):

Mcam = RM + Twhere: R is a 3x3 rotation matrix, and T is a 3- vector.

8

Homogeneous Coordinates

Lets replace by the 4- homogeneous vector : Just add a 1 as the fourth coordinate.

Now, the Euclidean displacement can be expressed as an linear transformation instead of an affine one:

Z

C

m


Camera coordinate systemX

Y

€

M =

XYZ

⎛

⎝

⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟ → ˜ M =

XYZ1

⎛

⎝

⎜ ⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟ ⎟

€

M

€

˜ M

€

Mcam = RM + T →

Xcam

Ycam

Zcam

⎛

⎝

⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟

= RXYZ

⎛

⎝

⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟

+ T →

Xcam

Ycam

Zcam

⎛

⎝

⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟

= R | T( )

XYZ1

⎛

⎝

⎜ ⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟ ⎟

→ Mcam = R | T( ) ˜ M

Mcam

(R | T ) is a 3x4 matrix.9

Projection

Computation of the coordinates of m in the image plane, from Mcam (expressed in the camera coordinates system): Simply use Thales' theorem:

CX

mX

Mcam

f

Z

mX

XC

f

Mcam

m

Camera coordinate system

Z

Y

€

mX

f=XZ

→ mX = f XZ

10

From Projection to ImageCoordinates of m in pixels ?

C

f

m

Camera coordinate system

€

mX = f XZ, mY = f Y

Z

u

v

Image coordinate system u0

v0

1 pixel

€

1ku

€

1kv

)

)

€

mu = u0 + kumX , mv = v0 + kvmY

11

€

mX = f XZ, mY = f Y

Z

€

mu = u0 + kumX , mv = v0 + kvmY

€

In matrix form :uvw

⎛

⎝

⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟

=

ku f 0 u00 kv f v00 0 1

⎛

⎝

⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟

XYZ

⎛

⎝

⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟

€

uvw

⎛

⎝

⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟ defines m in homogeneous coordinates

€

→mu =

uw

= u0 + ku fXZ

mv =vw

= v0 + kv fYZ

⎧

⎨ ⎪

⎩ ⎪

Putting • the perspective projection and• the transformation into pixel coordinatestogether:

12

The Full TransformationThe two transformations are chained to form the full transformation from a

3D point in the world coordinate system to its projection in the image:

The product of the internal calibration matrix and the external calibration matrix is a 3x4 matrix called the "projection matrix".

The projection matrix is defined up to a scale factor.

€

uvw

⎛

⎝

⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟

=

ku f 0 u00 kv f v00 0 1

⎛

⎝

⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟

R11 R13 R13 T1R21 R22 R23 T2R31 R32 R33 T3

⎛

⎝

⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟

XYZ1

⎛

⎝

⎜ ⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟ ⎟

=

P11 P12 P13 P14P21 P22 P23 P24P31 P32 P33 P34

⎛

⎝

⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟

XYZ1

⎛

⎝

⎜ ⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟ ⎟

projection matrix

13

The Full Transformation

R, T, and the products kuf and kvf can be extracted from the projection matrix.

€

uvw

⎛

⎝

⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟

=

ku f 0 u00 kv f v00 0 1

⎛

⎝

⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟

R11 R13 R13 T1R21 R22 R23 T2R31 R32 R33 T3

⎛

⎝

⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟

XYZ1

⎛

⎝

⎜ ⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟ ⎟

=

P11 P12 P13 P14P21 P22 P23 P24P31 P32 P33 P34

⎛

⎝

⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟

XYZ1

⎛

⎝

⎜ ⎜ ⎜ ⎜

⎞

⎠

⎟ ⎟ ⎟ ⎟

projection matrix

14

Homography

m� = PM = [P1P2P3P4]

XYZ1

= [P1P2P3P4]

XY01

= [P1P2P4]

XY1

= H3×3m

m�M/mH3×3

15

Computing a Projection Matrix or a Homography from Point Correspondences

by solving a linear system

m�m

�u v 1 0 0 0 uu� vu� u�

0 0 0 u v 1 uv� vv� v�

�

H11

H12

H13

H21

H22

H23

H31

H32

H33

=�

00

�

m� = Hm

m = [u, v, 1]�,m� = [u�, v�, 1]�

16

• Non-linear least-squares minimization: Minimization of a physical, meaningful error (reprojection error, in pixels)

• Minimization algorithms: Gauss-Newton or Levenberg-Marquardt (very efficient).

minR,T

�

i

dist2�HR,Tmi,m

�i

�

m�m

Computing a Projection Matrix or a Homography from Point Correspondences with a non-linear optimization

minR,T

�

i

dist2�PR,TMi,m�

i

�

MHR,Tmm�

PR,Tm

17

A Look to the Reprojection Error

reprojection error

1D camera under 2D translation

100 "3D points" taken at randomly in

[400;1000]x[-500;+500]

True camera position at (0, 0)

18

Gaussian Noise on the ProjectionsWhite cross: true camera position;Black cross: global minimum of the objective function.

In that case, the global minimum of the objective function is close to the true camera pose.

19

What if there are Outliers ?

M1

M2

m1

m2

C

M3

m3

M4

m4

incorrect measure (outlier)

20

Gaussian Noise on the Projections + 20% outliers

White cross: true camera position;Black cross: global minimum of the objective function.

The global minimum is now far from the true camera pose.

21

What Happened ?

The error on the 2D point locations mi is assumed to have a Gaussian (Normal) distribution with identical covariance matrices σI, and independent;

This assumption is violated when mi is an outlier.

Bayesian interpretation:

M1

M2

m1

m2

C

M3

m3

M4

m4

argminR,T

�i dist2

�PR,Tmi,m�

i

�

= argmaxR,T

�i N

�m�

i;PR,Tmi,σI�

22

Robust estimationIdea: Replace the Normal distribution by a more suitable distribution, or

equivalently replace the least-squares estimator by a "robust estimator" or “M-estimator”:

argminR,T

�i dist2

�PR,Tmi,m�

i

�

→ argminR,T

�i ρ

�dist

�PR,Tmi,m�

i

��

23

Example of an M-estimator:The Tukey Estimator

x2

ρ(x)

The Tukey estimator assumes the measures follow a distribution that is a mixture of:• a Normal distribution, for the inliers,• a uniform distribution, for the outliers.

�if |x| ≤ c ρ(x) = c2

6 (1− (1− (xc )2)3)

if |x| > c ρ(x) = c2

6

24

Normal distribution(inliers)

+ =

Uniform distribution(outliers)

Tukey estimator

-log(.)-log(.)

Least-squares

Mixture

25

Gaussian Noise on the Projections + 20% outliers + Tukey estimator

White cross: true camera position;Black cross: global minimum of the object function.

The global minimum is very close to the true camera pose.BUT: - local minimums;- the objective function is flat where all the correspondences are considered outliers.

26

Gaussian Noise on the Projections + 50% outliers + Tukey estimator

Even more local minimums.Numerical optimization can get trapped into a local minimum.

27

RANSAC

28

How to Optimize ?

Idea: sampling the space of solutions (the camera pose space here):

29

How to Optimize ?

Idea: sampling the space of solutions:

+ Numerical Optimization from the best sampled pose.

Problem: Exhaustive regular sampling is too expensive in 6 dimensions.Can we do a smarter sampling ?

30

RANSACRANSAC: RANdom SAmple Consensus

Line fitting: the "Throwing Out the worst residual" heuristics can fail (Example for the original paper [Fischler81]):

outlier

final least-squares solution

Ideal line

31

RANSACAs before, we could do a regular sampling, but would not be optimal:

Ideal line

32

Idea:

Generate hypotheses from subsets of the measurements.If a subset contains no gross errors, the estimated parameters (the hypothesis) are closed

to the true ones.

Take several subsets at random, retain the best one.

Ideal line

33

The quality of a hypothesis is evaluated by the number of measures that lie "close enough" to the predicted line.

We need to choose a threshold (T) to decide if the measure is "close enough". RANSAC returns the best hypothesis, ie the hypothesis with the largest number of

inliers.

T

€

1 if dist(mi,line p( )) ≤ T0 if dist(mi,line p( )) > T⎧ ⎨ ⎩ i

∑

34

RANSAC for HomographiesTo apply RANSAC to homography estimation, we need a way to compute a

homography from a subset of measurements:

Since RANSAC only provides a solution estimated with a limited number of data, it must be followed by a robust minimization to refine the solution.

�u v 1 0 0 0 uu� vu� u�

0 0 0 u v 1 uv� vv� v�

�

H11

H12

H13

H21

H22

H23

H31

H32

H33

=�

00

�

35

How to Get the Correspondences ?

• Extract Feature Points / Keypoints / Regions (Harris corner detector, extrema of Laplacian, affine region detectors, ...);

• standard approach: Match them based on Euclidean distances between descriptors such as SIFT, SURF, ...

m�m

36

Affine Region Detectors

Hessian-Affine detector MSER detector

37

Affine Normalization

Warp by M11/2 Warp by M2

1/2

We still have to correct for the orientation !

38

Select Canonical Orientation• Create histogram of local gradient directions computed over the image patch;• Each gradient contributes for its norm, weighted by its distance to patch center;• Assign canonical orientation at peak of smoothed histogram.

0 2π

39

Select Canonical Orientation

40

Description Vector

?

...

41

SIFT Description VectorMade of local histograms of gradients:

In practice: 8 orientations x 4 x 4 histograms = 128 dimensions vector.Normalised to be robust to light changes.

...

42

Matching Regions

m�

...

...

...

...

...

...

...

?

43

Matching: Approximate Nearest Neighbour

Best-Bin-First: Approximate nearest-neighbour search in k-d tree

44

Keypoint Matching

Pre-processingMake the actual classification easier

Nearest neighbor classification

The standard approach is a particular case of classification:

Search in the Database

Idea: let’s try another classification method!

45

One Class per KeypointOne class per keypoint: the set of the keypoint’s possible appearances

under various perspective, lighting, noise...

class 1

class 2

46

Training phase

Classifier

Classifierclass 1

class 1

class 2

...

Run-Time

47

Which Classifier ?We want a classifier that:

• can handle many classes;• is very fast;• has reasonable recognition performances (a

very high recognition rate is not an necessary requirement).

48

Which Classifier ?• Randomized Trees [Amit & Geman, 1997];• Random forests [Breiman, 2001].

49

An (Ideal) Single Tree

binary test

binary test

binary test

class #

50

How to Build the Tree ?

binary test ?

training set

51

binary test ?

training set

found by minimizing the entropy after the test:

S

Sleft

Sright

argmintest

|Sleft||S| Entropy(Sleft) + |Sright|

|S| Entropy(Sright)

52

binary test

training set

S

Problem: runs quickly out of training samples for the deeper tests

53

Idea: Use Several Sub-Optimal TreesEach tree is trained with a random subset of the training set.

54

Idea: Use Several Sub-Optimal Trees

The leaves contain the probabilities over the classes, computed from the training set.

55

Classification with Several Sub-Optimal TreesThe test sample is dropped into each tree, and the probabilities in the leaves it reached are averaged:

+ + ) = (13

56

Visual InterpretationEach tree partitions the space in a different way and compute the probability of each class for each cell of the partition:

57

Visual InterpretationCombining the trees gives a fine partition with a better estimate of the class probabilities:

58

For PatchesPossible tests: compare the intensities of two pixels around the keypoint after Gaussian smoothing:

• Very efficient to compute;• Invariant to light change by any raising function.

mfi =

�1 if I(m + dmi,1) ≤ I(m + dmi,2)0 otherwise

m + dmi,1

m + dmi,2 I : image after Gaussian smoothing

59

Results

60

Randomized Trees (and Random Ferns) applied to image patches are becoming a powerful tool for Computer Vision.

61

[Shotton et al, CVPR’11]

Used to infer body parts in the Kinect body tracking system.

The tests rely on the depth map.

62

Tests in [Shotton et al, CVPR’11]Classes are the body parts. The goal is to label each pixel with the label of the part it belongs to.

Tests compare the depth of two pixels around the considered pixel.

The displacements are normalized by the depth of the considered pixel for invariance:

fi(m) =�

1 if depth(m + dm1depth(m) ) ≤ depth(m + dm2

depth(m) )0 otherwise

m

63

3D Pose EstimationMean-Shift is used to find the joint locations from the body parts.

64

Training

“Training 3 trees to depth 20 from 1 million images takes about 1 day on a 1000 core cluster” [Shotton et al, CVPR’11]

Most of the training data is synthetic:

65

A SubtreeAverage of the patches that reach this node

66

[Gall et Lempitsky, CVPR’09; Barinova et al, CVPR’10]

Hough Forest for Object Detection:• Random Forests used to make each patch vote for the object centroid;

• The tests compare the output of filters and histograms-of-gradient between 2 pixels;• The leaves contain the displacement toward the object center.

Accumulated votes from all patches

Final detectionEach patch votes for the object centroid

Votes from the 3 patches

67

Tests used in [Gall et Lempitsky, CVPR’09]

Channels: the 3 color channels, absolute values of the first and second derivatives of the image, and 9 channels from HoG (Histograms-of-Gradients).

fi(m) =�

1 if channeli(m + dm1) < channeli(m + dm2) + τ0 otherwise

HoG

68

[Bosch et al, ICCV’07]Image Classification using Random Forests and Ferns [Bosch et al, ICCV’07]Use a sliding window to detect objects.Much faster than SVMs, recognition performances similar.

69

[Bosch et al, ICCV’07]

Tests:

n and b: random vector and scalar.xm: vector computed from a Pyramidal Histogram-of-Gradients.

fi(m) =�

1 if n�xm + b ≤ 00 otherwise

70

[Kalal et al, CVPR’10]TLD (aka Predator), for Track, Learn, Detect:

• Random Ferns used to speed up detection;

• Trained online: the distributions in the leaves are updated online, using the incoming images.

71

[Kalal et al, CVPR’10]• Tests: 2bit binary patterns• Trained online: the distributions in the leaves are updated online, using the

incoming images.

72

Random Ferns: A Simplified Tree-Like Classifier

73

For Keypoint Recognition, We Can Use Random Tests!

Number of trees

Recognition rate

Comparison of the recognition rates for 200 keypoints:

tests selected by minimizing entropy

tests with random locations

74

We can use random tests • For a small number of classes

– we can try several tests, and– retain the best one according to some criterion.

75

We can use random tests• For a small number of classes

– we can try several tests, and– retain the best one according to some criterion.

• When the number of classes is large– any test does a decent job:

76

Why it is Interesting

• Building the trees takes no time (we still have to estimate the posterior probabilities);

• Allows incremental learning;

• Simplifies the classifier structure.

77

The Tree Structure is not Needed

78


f1

f2

f3

79


f1

f2

f3

Results of pixel comparisons (0 or 1) Class Label

The distributions can be expressed simply, as:

80

Compromise:

which is proportional to

but complete representation of the joint distribution infeasible.

Naive Bayesian ignores the correlation:

We are looking for

€

argmaxi

P(C = ci patch)

If patch can be represented by a set of image features { fi }:

€

P(C = ci patch) = P(C = ci f1, f2,… fn, fn+1,… … fN )

81

Training

82

Training

83

Training0

1

1

6

84

Training1

0

0

0

1

1

6

1

85

Training1

0

1

5

1

0

0

0

1

1

6

1

86

Training

87

Training

88

Training Results

Normalize:

€

P( f1, f2,…, fn |C = ci)000001

111

∑ =1

89

Training Results

Normalize:

€

P( f1, f2,…, fn |C = ci)000001

111

∑ =1

90

Recognition

91

Normalization

Normalize:

€

P( f1, f2,…, fn |C = ci)000001

111

∑ =1

92

Subtlety with Normalizationpleaf, class =

Number of samples(leaf, class)Number of samples(class)

too selective:Number of samples(leaf, class) can be 0 simply because the training set is finite.

we use:pleaf, class =

Number of samples(leaf, class)+NregularizationNumber of samples(class)+Number of leaves×Nregularization

This can be done by simply initializing the counters to Nregularization instead of 0.

93

Influence of Nregularization

pleaf, class =Number of samples(leaf, class)+Nregularization

Number of samples(class)+Number of leaves×Nregularization

Nregularization (log scale)

Recognition rate

50%

94

Implementation of Feature Point Recognition with Ferns

1: for(int i = 0; i < H; i++) P[i] = 0.; 2: for(int k = 0; k < M; k++) { 3: int index = 0, * d = D + k * 2 * S; 4: for(int j = 0; j < S; j++) { 5: index <<= 1; 6: if (*(K + d[0]) < *(K + d[1])) 7: index++; 8: d += 2; } 9: p = PF + k * shift2 + index * shift1;10: for(int i = 0; i < H; i++) P[i] += p[i]; }

• Very simple to implement;• No need for orientation, perspective, light correction.

95

Number of inliers for Ferns

Number of inliers for SIFT

each point corresponds to an image from a 1000-frame sequence

Ferns are much faster, sometimes more accurate, but SIFT does not need training.

Ferns versus SIFT

96

Randomized Trees vs Ferns

Ferns more discriminant but more sensitive to outliers.

Ferns with productRT (with random tests) with product

Ferns with averageRT (with random tests) with average

Different combination strategies: average (RT) / product (Ferns)

Rec

ogni

tion

rate

Number of structures

97

Randomized Trees vs FernsInfluence of the number of classes:

Ferns with product

Ferns with averageRec

ogni

tion

rate

98

Memory and Computation Time

• Recognition time grows linearly with the number of Trees/Ferns and the number of classes.

• Recognition time grows linearly with the logarithm of the depth of Trees/Ferns.

• Memory grows linearly with the number of Trees/Ferns and the number of classes.

• Memory grows exponentially with the depth of Trees/Ferns.

• Increasing the depth may result in overfitting.• Increasing the number of Trees/Ferns (usually) improves

recognition.

99

Influence of the Number of FernsFerns with productRT (with random tests) with product

Ferns with averageRT (with random tests) with average

Rec

ogni

tion

rate

Number of structures

Increasing the number of Ferns/Trees improves the recognition rate, but increases the computation time and memory.

100

Number of Ferns / Number of Leaves / Memory / Computation Time

Rec

ogni

tion

Rat

e

Fern size

Number of Ferns

Com

puta

tion

Tim

eFern size

101

Conclusions on Randomized Trees and Ferns

• Simple to implement, Ferns even simpler;

• Both very fast, but dumb: need a lot of training examples to learn.

• Use a lot of memory to store the posterior distributions in the leaves.

102

We now have correspondences between a reference image of the object and the input image:

Some correspondences are correct, some are not.We can estimate the homography between the 2 images by applying RANSAC on subsets of 4 correspondences.

103

Computing a Homography from Point Correspondencesby solving a linear system

�u v 1 0 0 0 uu� vu� u�

0 0 0 u v 1 uv� vv� v�

�

H11

H12

H13

H21

H22

H23

H31

H32

H33

=�

00

�

m�m

�m� = H �m�m = [u, v, 1]�, �m� = [ku�, kv�, k]�

104


m�m

�m� = H �m�m = [u, v, 1]�, �m� = [ku�, kv�, k]�

u� = H11u+H12u+H13H31u+H32u+H33

v� = H21u+H22u+H23H31u+H32u+H33

105


u� = H11u+H12u+H13H31u+H32u+H33

v� = H21u+H22u+H23H31u+H32u+H33

�u v 1 0 0 0 uu� vu� u�

0 0 0 u v 1 uv� vv� v�

�

H11

H12

H13

H21

H22

H23

H31

H32

H33

=�

00

�

Using four correspondences:BX = 08

with X = [H11,H12,H13,H21,H22,H23,H31,H32,H33]�

106

How to Solve this Linear System ?

• X is the null eigenvector of B.

• In practice: the eigenvector corresponding to the smallest eigenvalue.

BX = 08

with X = [H11,H12,H13,H21,H22,H23,H31,H32,H33]�

107

• Non-linear least-squares minimization: Minimization of a physical, meaningful error (reprojection error, in pixels)

• Minimization algorithms: Gauss-Newton or Levenberg-Marquardt (very efficient).

Computing a Homography from Point Correspondences with a non-linear optimization

minR,T

�

i

dist2�HR,Tmi,m

�i

�

m�mHR,Tm

108

Numerical Optimization

p0

Start from an initial guess p0:

p0 can be taken randomly but should be as close as possible to the global minimum:

- pose computed at time t-1;- pose predicted from pose computed at time t-1 and a motion model;- ...

p1p2

109

Numerical OptimizationGeneral methods:• Gradient descent / Steepest Descent;• Conjugate Gradient;• ...

Non-linear Least-squares optimization:• Gauss-Newton;• Levenberg-Marquardt;• ...

110

Numerical OptimizationWe want to find p that minimizes:

where • p is a vector of parameters that define the camera pose (translation vector + parameters of the rotation matrix);• b is a vector made of the measurements (here the m’i);• f is the function that relates the camera pose to these measurements.

f(p) =

u(HR(p),T(p)m1)v(HR(p),T(p)m1)

...

b =

u(m�

1)v(m�

1)...

E(p) =�

i dist2�HR(p),T(p)mi,m�

i

�

= �f(p)− b�2

111

Gradient descent / Steepest Descent

Weaknesses:- How to choose λ ? - Needs a lot of iterations in long and narrow valleys:

€

pi+1 = pi − λ∇E(pi)

E(pi) = f (pi) −b2

= f (pi) −b( )T f (pi) −b( )→∇E(pi) = 2J f (pi) −b( ) with J the Jacobian matrix of f , computed at pi

112

The Gauss-Newton and the Levenberg-Marquardt algorithms

But first, the Linear Least-Squares Case:

If the function f is linear ie f(p) = Ap, p can be estimated as:

p=A+b

where A+ is the pseudo-inverse of A: A+=(ATA)-1AT€

E(p) = f (p) −b 2

113

Non-Linear Least-Squares: The Gauss-Newton algorithm

Iteration steps:

pi+1=pi + ∆i

∆i is chosen to minimize the residual || f(pi+1) – b ||2. It is computed by approximating f to the first order:

€

Δ i = argminΔ

f (pi + Δ) −b 2

= argminΔ

f (pi) + JΔ −b 2 First order approximation: f (pi + Δ) ≈ f (pi) + JΔ

= argminΔ

εi + JΔ 2εi = f (pi) −b denotes the residual at iteration i

Δ i is the solution of the system JΔ = −εi in the least − squares sense :Δ i = −J+εi where J+ is the pseudo - inverse of J

114

Non-Linear Least-Squares: The Levenberg-Marquardt Algorithm

In the Gauss-Newton algorithm:

In the Levenberg-Marquardt algorithm:

Levenberg-Marquardt Algorithm:

0. Initialize λ with a small value: λ = 0.001

1. Compute ∆i and E(pi + ∆i)

2. If E(pi + ∆i) > E(pi): λ ← 10 λ and go back to 1 [happens when the linear approximation of f is too coarse]

3. If E(pi + ∆i) < E(pi): λ ← λ / 10, pi+1 ← pi + ∆i and go back to 1.

Once converged, set λ ← 0 and continue up to convergence.

€

Δ i = − JTJ( )−1JTεi

€

Δ i = − JTJ + λI( )−1JTεi

115

Non-Linear Least-Squares: the Levenberg-Marquardt Algorithm

• When λ is small, LM behaves similarly to the Gauss-Newton algorithm.• When λ becomes large, LM behaves similarly to a steepest descent to guarantee

convergence.

€

Δ i = − JTJ + λI( )−1JTεi

116

Another Way to Refine the Pose:Template Matching

117

Global region tracking by minimizing cross-correlation: •Useful for objects difficult to model using local features;•Accurate.

Template T

Input Image Ip

118

Lucas-Kanade Algorithm

Gauss-Newton step:€

minp

W (I,p)[m j ]−T[m j ]( )2

j∑

€

Δ i = Jp+ ⋅ εp,I

Template T

mj

Input Image I

Pseudo-inverse of the Jacobian of W(I, p) evaluated at p and the mj

€

εp,I = (…,T[m j ]−W (I,p)[m j ],…)T

p

119

Template T

Lucas-Kanade Algorithm

Computing J and J+ is computationally expensive.

p0

p

120

Inverse Compositional Algorithm[Baker et al. IJCV03]

Template T

Input Image It

pi = pi-1 + dpi

dpi = Jp= 0+ εp= 0,I

Jp=0 is a constant matrix and can therefore be precomputed !

-pi-1dpi

121

ESM (Efficient Second-order Method)(1) I = T + Jp=0dp + dpTHp=0dp [second-order Taylor expansion]

(2) Jp=dp = Jp=0 + 2dpTHp=0 [derivation of (1) wrt p]

(3) dpTHp=0 = ½(Jp=dp - Jp=0) [from Equation (2)]

(4) I = T + Jp=0 + ½(Jp=dp - Jp=0)dp [by injecting (3) in (1)]

(5) dp = [½(Jp=0 + Jp=dp)]+ (I - T) [from Equation (4)]

Like Gauss-Newton but replace Jp=0 by ½(Jp=0 + Jp=dp).Need to compute Jp=dp at each iteration, and a pseudo-inverse

at each iteration, but need much less iterations.122

BRIEF [ECCV’10]very fast feature point descriptor

123

Remark

• Moving legacy code to new CPUs does not result in a speed-up anymore;

• Should consider the features of new platforms: parallelism (multi-cores, GPU), locality, ...

124

1

1

0...

0

1

BRIEF descriptor

Gaussian smoothing

125

1

1

0...

0

1

BRIEF descriptor

Gaussian smoothing

Alternatively, using integral images:

126

Integral Images

Integral Image

Integral Image(u, v) =�

i=1..u

�

j=1..v

Image(i, j)

127

=

-

-

+

How to Use Integral Images

128

[Viola & Jones, IJCV’01]

Features computed in constant time

129

Computing Integral Images

IntegralImage[u][v] = IntegralImage[u][v-1] +LineBuffer[u] +Image[u][v]

130

Evaluation

131

Evaluation

132

Computation Speed

For BRIEF, most of the time is spent in Gaussian smoothing.

133

Matching Speed distance(BRIEF descriptor 1, BRIEF descriptor 2)

= Hamming distance(BRIEF descriptor 1, BRIEF descriptor 2)

= number of bits set to 1(BRIEF descriptor 1 xor BRIEF descriptor 2)

= popcount(BRIEF descriptor 1 xor BRIEF descriptor 2)

10- to 15-fold speed increase on Intel's Bloomfield (SSE 4.2) and AMD's Phenom (SSE 4a)

134

Matching Speed

135

Picking the Locations

uniform distribution Gaussian distribution Gaussian distribution for location and length

uniform distribution on Polar coordinates

census transform locations

136

Picking the Locations

uniform distribution Gaussian distribution Gaussian distribution for location and length

uniform distribution on Polar coordinates

census transform locations

137

Rotation and Scale Invariance

138

Rotation and Scale InvarianceDuplicate the Descriptors:18 rotations x 3 scales

...

...

...

139

code released in GPL on CVLab website

140

DOT [CVPR’10]dense descriptor for object detection

Joint work with Stefan Hinterstoisser (TU Munich)

141

Template matching with an efficient representation of the images and the templates.

object detection with a sliding window and template matching

142

Initial Similarity Measure

145

Making the Similarity Measure Robust to Small Motions

146

Downsampling

147

Ignoring the Dependencies between the Regions...

148

Lists of Dominant Orientations

149

Fast Computation with Bitwise Operations

0000110000010000

150

Code available under LGPL license athttp://campar.in.tum.de/personal/hinterst/index/

151

http://campar.in.tum.de/personal/hinterst/index/

http://campar.in.tum.de/personal/hinterst/index/

New Method, LINE[PAMI, under revision]

152

Initial Similarity Measure

ESteger(I,O, c) =�

r

�� cos�orientation(O, r)− orientation(I, c + r)

��

previous measure:153

Making the Similarity Measure Robust to Small Motions

ESteger(I,O, c) =�

r

�� cos�orientation(O, r)− orientation(I, c + r)

��

E(I,O, c) =�

r

�max

t∈region(c+r)

�� cos�orientation(O, r)− orientation(I, t)

��

154

Avoiding to Recompute the max Operator1. spread the gradients

155

2. precompute response mapsBecause• we consider only a discrete set of gradient directions;• we do not consider the gradient norms,we can precompute a response for each region in the image and a gradient direction for the template

in the template

in the template

156

Optimized Version

11010

1. The sets of orientations in the image regions are encoded with a binary representation:

157

Optimized Version2. The binary representation is used as an index to lookup tables with the precomputed responses for each gradient direction in the template:

158

Avoiding Caches MissesThe response maps are re-arranged into linear memories:

159

Using the Linear Memories

The similarity measure can be computed for all the image locations by summing linear memories, shifted by an offset that depends on the template.

160

Advantage of Linearizing the Memory

Speed-up factor

161

DOT [CVPR’10]LINE

162

LINE-MOD [Hinterstoisser et al, ICCV’11]

Extension to the Kinect: the templates combine the image and the depth map.

163

thanks!

164