MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

140
6.870 Object Recognition and Scene Understanding student presentation MIT

description

MIT 6.870 Object Recognition and Scene Understanding (Fall 2008)http://people.csail.mit.edu/torralba/courses/6.870/6.870.recognition.htmThis class will review and discuss current approaches to object recognition and scene understanding in computer vision. The course will cover bag of words models, part based models, classifier based models, multiclass object recognition and transfer learning, concurrent recognition and segmentation, context models for object recognition, grammars for scene understanding and large datasets for semi supervised and unsupervised discovery of object and scene categories. We will be reading a mixture of papers from computer vision and influential works from cognitive psychology on object and scene recognition.

Transcript of MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Page 1: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

6.870

Object Recognition and Scene Understanding

student presentationMIT

Page 2: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

6.870

Template matching and histograms

Nicolas Pinto

Page 3: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Introduction

Page 4: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Hosts

Page 5: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Hostsa guy...

(who has big arms)

Page 6: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

HostsAntonio T...

(who knows a lot about vision)

a guy...

(who has big arms)

Page 7: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

HostsAntonio T...

(who knows a lot about vision)

a frog...

(who has big eyes)

a guy...

(who has big arms)

Page 8: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

HostsAntonio T...

(who knows a lot about vision)

a frog...

(who has big eyes)and thus should knowa lot about vision...

a guy...

(who has big arms)

Page 9: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

3 papers

yey!!

Page 10: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Object Recognition from Local Scale-Invariant Features

David G. Lowe

Computer Science Department

University of British Columbia

Vancouver, B.C., V6T 1Z4, Canada

[email protected]

Abstract

Proc. of the International Conference on

Computer Vision, Corfu (Sept. 1999)

An object recognition system has been developed that uses a

new class of local image features. The features are invariant

to image scaling, translation, and rotation, and partially in-

variant to illumination changes and affine or 3D projection.

These features share similar properties with neurons in in-

ferior temporal cortex that are used for object recognition

in primate vision. Features are efficiently detected through

a staged filtering approach that identifies stable points in

scale space. Image keys are created that allow for local ge-

ometric deformations by representing blurred image gradi-

ents in multiple orientation planes and at multiple scales.

The keys are used as input to a nearest-neighbor indexing

method that identifies candidate object matches. Final veri-

fication of each match is achieved by finding a low-residual

least-squares solution for the unknown model parameters.

Experimental results show that robust object recognition

can be achieved in cluttered partially-occluded images with

a computation time of under 2 seconds.

1. Introduction

Object recognition in cluttered real-world scenes requires

local image features that are unaffected by nearby clutter or

partial occlusion. The features must be at least partially in-

variant to illumination, 3D projective transforms, and com-

mon object variations. On the other hand, the features must

also be sufficiently distinctive to identify specific objects

among many alternatives. The difficulty of the object recog-

nition problem is due in large part to the lack of success in

finding such image features. However, recent research on

the use of dense local features (e.g., Schmid & Mohr [19])

has shown that efficient recognition can often be achieved

by using local image descriptors sampled at a large number

of repeatable locations.

This paper presents a new method for image feature gen-

eration called the Scale Invariant Feature Transform (SIFT).

This approach transforms an image into a large collection

of local feature vectors, each of which is invariant to image

translation, scaling, and rotation, and partially invariant to

illumination changes and affine or 3D projection. Previous

approaches to local feature generation lacked invariance to

scale and were more sensitive to projective distortion and

illumination change. The SIFT features share a number of

properties in common with the responses of neurons in infe-

rior temporal (IT) cortex in primate vision. This paper also

describes improved approaches to indexing and model ver-

ification.

The scale-invariant features are efficiently identified by

using a staged filtering approach. The first stage identifies

key locations in scale space by looking for locations that

are maxima orminima of a difference-of-Gaussian function.

Each point is used to generate a feature vector that describes

the local image region sampled relative to its scale-space co-

ordinate frame. The features achieve partial invariance to

local variations, such as affine or 3D projections, by blur-

ring image gradient locations. This approach is based on a

model of the behavior of complex cells in the cerebral cor-

tex of mammalian vision. The resulting feature vectors are

called SIFT keys. In the current implementation, each im-

age generates on the order of 1000 SIFT keys, a process that

requires less than 1 second of computation time.

The SIFT keys derived from an image are used in a

nearest-neighbour approach to indexing to identify candi-

date object models. Collections of keys that agree on a po-

tential model pose are first identified through a Hough trans-

formhash table, and then througha least-squares fit to a final

estimate of model parameters. When at least 3 keys agree

on the model parameters with low residual, there is strong

evidence for the presence of the object. Since there may be

dozens of SIFT keys in the image of a typical object, it is

possible to have substantial levels of occlusion in the image

and yet retain high levels of reliability.

The current object models are represented as 2D loca-

tions of SIFT keys that can undergo affine projection. Suf-

ficient variation in feature location is allowed to recognize

perspective projection of planar shapes at up to a 60 degree

rotation away from the camera or to allow up to a 20 degree

rotation of a 3D object.

1

Lowe(1999)

3 papers

yey!!

Page 11: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Object Recognition from Local Scale-Invariant Features

David G. Lowe

Computer Science Department

University of British Columbia

Vancouver, B.C., V6T 1Z4, Canada

[email protected]

Abstract

Proc. of the International Conference on

Computer Vision, Corfu (Sept. 1999)

An object recognition system has been developed that uses a

new class of local image features. The features are invariant

to image scaling, translation, and rotation, and partially in-

variant to illumination changes and affine or 3D projection.

These features share similar properties with neurons in in-

ferior temporal cortex that are used for object recognition

in primate vision. Features are efficiently detected through

a staged filtering approach that identifies stable points in

scale space. Image keys are created that allow for local ge-

ometric deformations by representing blurred image gradi-

ents in multiple orientation planes and at multiple scales.

The keys are used as input to a nearest-neighbor indexing

method that identifies candidate object matches. Final veri-

fication of each match is achieved by finding a low-residual

least-squares solution for the unknown model parameters.

Experimental results show that robust object recognition

can be achieved in cluttered partially-occluded images with

a computation time of under 2 seconds.

1. Introduction

Object recognition in cluttered real-world scenes requires

local image features that are unaffected by nearby clutter or

partial occlusion. The features must be at least partially in-

variant to illumination, 3D projective transforms, and com-

mon object variations. On the other hand, the features must

also be sufficiently distinctive to identify specific objects

among many alternatives. The difficulty of the object recog-

nition problem is due in large part to the lack of success in

finding such image features. However, recent research on

the use of dense local features (e.g., Schmid & Mohr [19])

has shown that efficient recognition can often be achieved

by using local image descriptors sampled at a large number

of repeatable locations.

This paper presents a new method for image feature gen-

eration called the Scale Invariant Feature Transform (SIFT).

This approach transforms an image into a large collection

of local feature vectors, each of which is invariant to image

translation, scaling, and rotation, and partially invariant to

illumination changes and affine or 3D projection. Previous

approaches to local feature generation lacked invariance to

scale and were more sensitive to projective distortion and

illumination change. The SIFT features share a number of

properties in common with the responses of neurons in infe-

rior temporal (IT) cortex in primate vision. This paper also

describes improved approaches to indexing and model ver-

ification.

The scale-invariant features are efficiently identified by

using a staged filtering approach. The first stage identifies

key locations in scale space by looking for locations that

are maxima orminima of a difference-of-Gaussian function.

Each point is used to generate a feature vector that describes

the local image region sampled relative to its scale-space co-

ordinate frame. The features achieve partial invariance to

local variations, such as affine or 3D projections, by blur-

ring image gradient locations. This approach is based on a

model of the behavior of complex cells in the cerebral cor-

tex of mammalian vision. The resulting feature vectors are

called SIFT keys. In the current implementation, each im-

age generates on the order of 1000 SIFT keys, a process that

requires less than 1 second of computation time.

The SIFT keys derived from an image are used in a

nearest-neighbour approach to indexing to identify candi-

date object models. Collections of keys that agree on a po-

tential model pose are first identified through a Hough trans-

formhash table, and then througha least-squares fit to a final

estimate of model parameters. When at least 3 keys agree

on the model parameters with low residual, there is strong

evidence for the presence of the object. Since there may be

dozens of SIFT keys in the image of a typical object, it is

possible to have substantial levels of occlusion in the image

and yet retain high levels of reliability.

The current object models are represented as 2D loca-

tions of SIFT keys that can undergo affine projection. Suf-

ficient variation in feature location is allowed to recognize

perspective projection of planar shapes at up to a 60 degree

rotation away from the camera or to allow up to a 20 degree

rotation of a 3D object.

1

Lowe(1999)

Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs

INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr

AbstractWe study the question of feature sets for robust visual ob-

ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale gradients, fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated human images witha large range of pose variations and backgrounds.

1 IntroductionDetecting humans in images is a challenging task owing

to their variable appearance and the wide range of poses thatthey can adopt. The first need is a robust feature set thatallows the human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for human detection, showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17,22]. The proposeddescriptors are reminiscent of edge orientation histograms[4,5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking “pedestrian detection” (the detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on theMIT pedestrian test set [18,17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and backgrounds.Ongoing work suggests that our feature set performs equallywell for other shape-based object classes.

We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.

2 Previous WorkThere is an extensive literature on object detection, but

here we mention just a few relevant papers on human detec-tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou etal [18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortere et algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian detection system [7]. Viola et al [22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfard etal [19] build an articulated body detector by incorporatingSVM based limb classifiers over 1st and 2nd order Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9]. Mikolajczyk et al [16] use combinations of orientation-position histograms with binary-thresholdedgradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single detection window, but appears to givesignificantly higher performance on pedestrian images.

3 Overview of the MethodThis section gives an overview of our feature extraction

chain, which is summarized in fig. 1. Implementation detailsare postponed until §6. The method is based on evaluatingwell-normalized local histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4,5,12,15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity gradients or

1

Nalal and Triggs (2005)

3 papers

yey!!

Page 12: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Object Recognition from Local Scale-Invariant Features

David G. Lowe

Computer Science Department

University of British Columbia

Vancouver, B.C., V6T 1Z4, Canada

[email protected]

Abstract

Proc. of the International Conference on

Computer Vision, Corfu (Sept. 1999)

An object recognition system has been developed that uses a

new class of local image features. The features are invariant

to image scaling, translation, and rotation, and partially in-

variant to illumination changes and affine or 3D projection.

These features share similar properties with neurons in in-

ferior temporal cortex that are used for object recognition

in primate vision. Features are efficiently detected through

a staged filtering approach that identifies stable points in

scale space. Image keys are created that allow for local ge-

ometric deformations by representing blurred image gradi-

ents in multiple orientation planes and at multiple scales.

The keys are used as input to a nearest-neighbor indexing

method that identifies candidate object matches. Final veri-

fication of each match is achieved by finding a low-residual

least-squares solution for the unknown model parameters.

Experimental results show that robust object recognition

can be achieved in cluttered partially-occluded images with

a computation time of under 2 seconds.

1. Introduction

Object recognition in cluttered real-world scenes requires

local image features that are unaffected by nearby clutter or

partial occlusion. The features must be at least partially in-

variant to illumination, 3D projective transforms, and com-

mon object variations. On the other hand, the features must

also be sufficiently distinctive to identify specific objects

among many alternatives. The difficulty of the object recog-

nition problem is due in large part to the lack of success in

finding such image features. However, recent research on

the use of dense local features (e.g., Schmid & Mohr [19])

has shown that efficient recognition can often be achieved

by using local image descriptors sampled at a large number

of repeatable locations.

This paper presents a new method for image feature gen-

eration called the Scale Invariant Feature Transform (SIFT).

This approach transforms an image into a large collection

of local feature vectors, each of which is invariant to image

translation, scaling, and rotation, and partially invariant to

illumination changes and affine or 3D projection. Previous

approaches to local feature generation lacked invariance to

scale and were more sensitive to projective distortion and

illumination change. The SIFT features share a number of

properties in common with the responses of neurons in infe-

rior temporal (IT) cortex in primate vision. This paper also

describes improved approaches to indexing and model ver-

ification.

The scale-invariant features are efficiently identified by

using a staged filtering approach. The first stage identifies

key locations in scale space by looking for locations that

are maxima orminima of a difference-of-Gaussian function.

Each point is used to generate a feature vector that describes

the local image region sampled relative to its scale-space co-

ordinate frame. The features achieve partial invariance to

local variations, such as affine or 3D projections, by blur-

ring image gradient locations. This approach is based on a

model of the behavior of complex cells in the cerebral cor-

tex of mammalian vision. The resulting feature vectors are

called SIFT keys. In the current implementation, each im-

age generates on the order of 1000 SIFT keys, a process that

requires less than 1 second of computation time.

The SIFT keys derived from an image are used in a

nearest-neighbour approach to indexing to identify candi-

date object models. Collections of keys that agree on a po-

tential model pose are first identified through a Hough trans-

formhash table, and then througha least-squares fit to a final

estimate of model parameters. When at least 3 keys agree

on the model parameters with low residual, there is strong

evidence for the presence of the object. Since there may be

dozens of SIFT keys in the image of a typical object, it is

possible to have substantial levels of occlusion in the image

and yet retain high levels of reliability.

The current object models are represented as 2D loca-

tions of SIFT keys that can undergo affine projection. Suf-

ficient variation in feature location is allowed to recognize

perspective projection of planar shapes at up to a 60 degree

rotation away from the camera or to allow up to a 20 degree

rotation of a 3D object.

1

Lowe(1999)

Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs

INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr

AbstractWe study the question of feature sets for robust visual ob-

ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale gradients, fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated human images witha large range of pose variations and backgrounds.

1 IntroductionDetecting humans in images is a challenging task owing

to their variable appearance and the wide range of poses thatthey can adopt. The first need is a robust feature set thatallows the human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for human detection, showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17,22]. The proposeddescriptors are reminiscent of edge orientation histograms[4,5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking “pedestrian detection” (the detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on theMIT pedestrian test set [18,17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and backgrounds.Ongoing work suggests that our feature set performs equallywell for other shape-based object classes.

We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.

2 Previous WorkThere is an extensive literature on object detection, but

here we mention just a few relevant papers on human detec-tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou etal [18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortere et algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian detection system [7]. Viola et al [22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfard etal [19] build an articulated body detector by incorporatingSVM based limb classifiers over 1st and 2nd order Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9]. Mikolajczyk et al [16] use combinations of orientation-position histograms with binary-thresholdedgradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single detection window, but appears to givesignificantly higher performance on pedestrian images.

3 Overview of the MethodThis section gives an overview of our feature extraction

chain, which is summarized in fig. 1. Implementation detailsare postponed until §6. The method is based on evaluatingwell-normalized local histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4,5,12,15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity gradients or

1

Nalal and Triggs (2005)

3 papers

A Discriminatively Trained, Multiscale, Deformable Part Model

Pedro FelzenszwalbUniversity of [email protected]

David McAllesterToyota Technological Institute at Chicago

[email protected]

Deva RamananUC Irvine

[email protected]

Abstract

This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCALchallenge. Our system also relies heavily on new methodsfor discriminative training. We combine a margin-sensitiveapproach for data mining hard negative examples with aformalism we call latent SVM. A latent SVM, like a hid-den CRF, leads to a non-convex training problem. How-ever, a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified forthe positive examples. We believe that our training meth-ods will eventually make possible the effective use of morelatent information such as hierarchical (grammar) modelsand models involving latent three dimensional pose.

1. IntroductionWe consider the problem of detecting and localizing ob-

jects of a generic category, such as people or cars, in staticimages. We have developed a new multiscale deformablepart model for solving this problem. The models are trainedusing a discriminative procedure that only requires bound-ing box labels for the positive examples. Using these mod-els we implemented a detection system that is both highlyefficient and accurate, processing an image in about 2 sec-onds and achieving recognition rates that are significantlybetter than previous systems.

Our system achieves a two-fold improvement in averageprecision over the winning system [5] in the 2006 PASCALperson detection challenge. The system also outperformsthe best results in the 2007 challenge in ten out of twenty

This material is based upon work supported by the National ScienceFoundation under Grant No. 0534820 and 0535174.

Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolutionpart templates and a spatial model for the location of each part.

object categories. Figure 1 shows an example detection ob-tained with our person model.

The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework forrepresenting object categories [1–3, 6, 10, 12, 13,15, 16, 22].While these models are appealing from a conceptual pointof view, it has been difficult to establish their value in prac-tice. On difficult datasets, deformable models are often out-performed by “conceptually weaker” models such as rigidtemplates [5] or bag-of-features [23]. One of our main goalsis to address this performance gap.

Our models include both a coarse global template cov-ering an entire object and higher resolution part templates.The templates represent histogram of gradient features [5].As in [14, 19, 21], we train models discriminatively. How-ever, our system is semi-supervised, trained with a max-margin framework, and does not rely on feature detection.We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data. In contrast to computa-tionally demanding approaches such as [4], we can learn amodel in 3 hours on a single CPU.

Another contribution of our work is a new methodologyfor discriminative training. We generalize SVMs for han-dling latent variables such as part positions, and introduce anew method for data mining “hard negative” examples dur-ing training. We believe that handling partially labeled datais a significant issue in machine learning for computer vi-sion. For example, the PASCAL dataset only specifies a

1

Felzenszwalb et al.(2008)

yey!!

Page 13: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Object Recognition from Local Scale-Invariant Features

David G. Lowe

Computer Science Department

University of British Columbia

Vancouver, B.C., V6T 1Z4, Canada

[email protected]

Abstract

Proc. of the International Conference on

Computer Vision, Corfu (Sept. 1999)

An object recognition system has been developed that uses a

new class of local image features. The features are invariant

to image scaling, translation, and rotation, and partially in-

variant to illumination changes and affine or 3D projection.

These features share similar properties with neurons in in-

ferior temporal cortex that are used for object recognition

in primate vision. Features are efficiently detected through

a staged filtering approach that identifies stable points in

scale space. Image keys are created that allow for local ge-

ometric deformations by representing blurred image gradi-

ents in multiple orientation planes and at multiple scales.

The keys are used as input to a nearest-neighbor indexing

method that identifies candidate object matches. Final veri-

fication of each match is achieved by finding a low-residual

least-squares solution for the unknown model parameters.

Experimental results show that robust object recognition

can be achieved in cluttered partially-occluded images with

a computation time of under 2 seconds.

1. Introduction

Object recognition in cluttered real-world scenes requires

local image features that are unaffected by nearby clutter or

partial occlusion. The features must be at least partially in-

variant to illumination, 3D projective transforms, and com-

mon object variations. On the other hand, the features must

also be sufficiently distinctive to identify specific objects

among many alternatives. The difficulty of the object recog-

nition problem is due in large part to the lack of success in

finding such image features. However, recent research on

the use of dense local features (e.g., Schmid & Mohr [19])

has shown that efficient recognition can often be achieved

by using local image descriptors sampled at a large number

of repeatable locations.

This paper presents a new method for image feature gen-

eration called the Scale Invariant Feature Transform (SIFT).

This approach transforms an image into a large collection

of local feature vectors, each of which is invariant to image

translation, scaling, and rotation, and partially invariant to

illumination changes and affine or 3D projection. Previous

approaches to local feature generation lacked invariance to

scale and were more sensitive to projective distortion and

illumination change. The SIFT features share a number of

properties in common with the responses of neurons in infe-

rior temporal (IT) cortex in primate vision. This paper also

describes improved approaches to indexing and model ver-

ification.

The scale-invariant features are efficiently identified by

using a staged filtering approach. The first stage identifies

key locations in scale space by looking for locations that

are maxima orminima of a difference-of-Gaussian function.

Each point is used to generate a feature vector that describes

the local image region sampled relative to its scale-space co-

ordinate frame. The features achieve partial invariance to

local variations, such as affine or 3D projections, by blur-

ring image gradient locations. This approach is based on a

model of the behavior of complex cells in the cerebral cor-

tex of mammalian vision. The resulting feature vectors are

called SIFT keys. In the current implementation, each im-

age generates on the order of 1000 SIFT keys, a process that

requires less than 1 second of computation time.

The SIFT keys derived from an image are used in a

nearest-neighbour approach to indexing to identify candi-

date object models. Collections of keys that agree on a po-

tential model pose are first identified through a Hough trans-

formhash table, and then througha least-squares fit to a final

estimate of model parameters. When at least 3 keys agree

on the model parameters with low residual, there is strong

evidence for the presence of the object. Since there may be

dozens of SIFT keys in the image of a typical object, it is

possible to have substantial levels of occlusion in the image

and yet retain high levels of reliability.

The current object models are represented as 2D loca-

tions of SIFT keys that can undergo affine projection. Suf-

ficient variation in feature location is allowed to recognize

perspective projection of planar shapes at up to a 60 degree

rotation away from the camera or to allow up to a 20 degree

rotation of a 3D object.

1

Lowe(1999)

Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs

INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr

AbstractWe study the question of feature sets for robust visual ob-

ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computation

We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.

2 Previous WorkThere is an extensive literature on object detection, but

Nalal and Triggs (2005)

A Discriminatively Trained, Multiscale, Deformable Part Model

Pedro FelzenszwalbUniversity of [email protected]

David McAllesterToyota Technological Institute at Chicago

[email protected]

Deva RamananUC Irvine

[email protected]

Abstract

This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCAL

Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolution

Felzenszwalb et al.(2008)

Page 14: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Scale-Invariant Feature Transform (SIFT)

adapted from Kucuktunc

Page 15: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Scale-Invariant Feature Transform (SIFT)

adapted from Brown, ICCV 2003

Page 16: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

SIFT local features are

invariant...

adapted from David Lee

Page 17: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

like me they are robust...

Text

Page 18: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

like me they are robust...

Text

... to changes in illumination, noise, viewpoint, occlusion, etc.

Page 19: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

I am sure you want to know

how to build them

Text

Page 20: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

I am sure you want to know

how to build them

Text1. find interest points or “keypoints”

Page 21: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

I am sure you want to know

how to build them

Text1. find interest points or “keypoints”

2. find their dominant orientation

Page 22: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

I am sure you want to know

how to build them

Text1. find interest points or “keypoints”

2. find their dominant orientation

3. compute their descriptor

Page 23: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

I am sure you want to know

how to build them

Text1. find interest points or “keypoints”

2. find their dominant orientation

3. compute their descriptor

4. match them on other images

Page 24: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Text1. find interest points or “keypoints”

Page 25: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Text

keypoints are taken as maxima/minima

of a DoG pyramid

in this settings, extremas are invariant to scale...

Page 26: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

a DoG (Difference of Gaussians) pyramidis simple to compute...

even him can do it!

before after

adapted from Pallus and Fleishman

Page 27: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

then we just have to find neighborhood extremasin this 3D DoG space

Page 28: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

then we just have to find neighborhood extremasin this 3D DoG space

if a pixel is an extremain its neighboring regionhe becomes a candidate

keypoint

Page 29: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

too manykeypoints?

adapted from wikipedia

Page 30: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

too manykeypoints?

1. remove low contrast

adapted from wikipedia

Page 31: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

too manykeypoints?

1. remove low contrast

adapted from wikipedia

Page 32: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

too manykeypoints?

1. remove low contrast

2. remove edges

adapted from wikipedia

Page 33: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

too manykeypoints?

1. remove low contrast

2. remove edges

adapted from wikipedia

Page 34: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Text

2. find their dominant orientation

Page 35: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

each selected keypoint is assigned to one or more“dominant” orientations...

Page 36: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

each selected keypoint is assigned to one or more“dominant” orientations...

... this step is important to achieve rotation invariance

Page 37: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

How?

Page 38: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

How?using the DoG pyramid to achieve scale invariance:

Page 39: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

How?using the DoG pyramid to achieve scale invariance:

a. compute image gradient magnitude and orientation

Page 40: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

How?using the DoG pyramid to achieve scale invariance:

a. compute image gradient magnitude and orientation

b. build an orientation histogram

Page 41: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

How?using the DoG pyramid to achieve scale invariance:

a. compute image gradient magnitude and orientation

b. build an orientation histogram

c. keypoint’s orientation(s) = peak(s)

Page 42: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

a. compute image gradient magnitude and orientation

Page 43: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

a. compute image gradient magnitude and orientation

Page 44: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

b. build an orientation histogram

adapted from Ofir Pele

Page 45: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

c. keypoint’s orientation(s) = peak(s)

*

* the peak ;-)

Page 46: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Text

3. compute their descriptor

Page 47: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

SIFT descriptor= a set of orientation histograms

4x4 array x 8 bins= 128 dimensions (normalized)

16x16 neighborhood of pixel gradients

Page 48: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Text

4. match them on other images

Page 49: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

How to atch?

Page 50: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

How to atch?

nearest neighbor

Page 51: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

How to atch?

nearest neighborhough transform voting

Page 52: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

How to atch?

nearest neighborhough transform votingleast-squares fit

Page 53: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

How to atch?

nearest neighborhough transform votingleast-squares fitetc.

Page 54: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

SIFT is great!

Text

Page 55: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

SIFT is great!

Text\\ invariant to affine transformations

Page 56: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

SIFT is great!

Text\\ invariant to affine transformations

\\ easy to understand

Page 57: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

SIFT is great!

Text\\ invariant to affine transformations

\\ easy to understand

\\ fast to compute

Page 58: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Extension example: Spatial Pyramid Matching using SIFT

Text

Beyond Bags of Features: Spatial Pyramid Matchingfor Recognizing Natural Scene Categories

Svetlana Lazebnik1

[email protected] Institute

University of Illinois

Cordelia Schmid2

[email protected] Rhone-AlpesMontbonnot, France

Jean Ponce1,3

[email protected] Normale Superieure

Paris, France

CVPR 2006

Page 59: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Object Recognition from Local Scale-Invariant Features

David G. Lowe

Computer Science Department

University of British Columbia

Vancouver, B.C., V6T 1Z4, Canada

[email protected]

Abstract

An object recognition system has been developed that uses a

new class of local image features. The features are invariant

to image scaling, translation, and rotation, and partially in-

variant to illumination changes and affine or 3D projection.

translation, scaling, and rotation, and partially invariant to

illumination changes and affine or 3D projection. Previous

approaches to local feature generation lacked invariance to

scale and were more sensitive to projective distortion and

illumination change. The SIFT features share a number of

properties in common with the responses of neurons in infe-

Lowe(1999)

Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs

INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr

AbstractWe study the question of feature sets for robust visual ob-

ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale gradients, fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated human images witha large range of pose variations and backgrounds.

1 IntroductionDetecting humans in images is a challenging task owing

to their variable appearance and the wide range of poses thatthey can adopt. The first need is a robust feature set thatallows the human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for human detection, showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17,22]. The proposeddescriptors are reminiscent of edge orientation histograms[4,5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking “pedestrian detection” (the detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on theMIT pedestrian test set [18,17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and backgrounds.Ongoing work suggests that our feature set performs equallywell for other shape-based object classes.

We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.

2 Previous WorkThere is an extensive literature on object detection, but

here we mention just a few relevant papers on human detec-tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou etal [18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortere et algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian detection system [7]. Viola et al [22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfard etal [19] build an articulated body detector by incorporatingSVM based limb classifiers over 1st and 2nd order Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9]. Mikolajczyk et al [16] use combinations of orientation-position histograms with binary-thresholdedgradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single detection window, but appears to givesignificantly higher performance on pedestrian images.

3 Overview of the MethodThis section gives an overview of our feature extraction

chain, which is summarized in fig. 1. Implementation detailsare postponed until §6. The method is based on evaluatingwell-normalized local histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4,5,12,15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity gradients or

1

Nalal and Triggs (2005)

A Discriminatively Trained, Multiscale, Deformable Part Model

Pedro FelzenszwalbUniversity of [email protected]

David McAllesterToyota Technological Institute at Chicago

[email protected]

Deva RamananUC Irvine

[email protected]

Abstract

This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCAL

Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolution

Felzenszwalb et al.(2008)

Page 60: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs

INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr

AbstractWe study the question of feature sets for robust visual ob-

ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale gradients, fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated human images witha large range of pose variations and backgrounds.

1 IntroductionDetecting humans in images is a challenging task owing

to their variable appearance and the wide range of poses thatthey can adopt. The first need is a robust feature set thatallows the human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for human detection, showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17,22]. The proposeddescriptors are reminiscent of edge orientation histograms[4,5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking “pedestrian detection” (the detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on theMIT pedestrian test set [18,17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and backgrounds.Ongoing work suggests that our feature set performs equallywell for other shape-based object classes.

We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.

2 Previous WorkThere is an extensive literature on object detection, but

here we mention just a few relevant papers on human detec-tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou etal [18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortere et algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian detection system [7]. Viola et al [22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfard etal [19] build an articulated body detector by incorporatingSVM based limb classifiers over 1st and 2nd order Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9]. Mikolajczyk et al [16] use combinations of orientation-position histograms with binary-thresholdedgradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single detection window, but appears to givesignificantly higher performance on pedestrian images.

3 Overview of the MethodThis section gives an overview of our feature extraction

chain, which is summarized in fig. 1. Implementation detailsare postponed until §6. The method is based on evaluatingwell-normalized local histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4,5,12,15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity gradients or

1

first of all, let me put this paper in context

Page 61: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs

INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr

AbstractWe study the question of feature sets for robust visual ob-

ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale gradients, fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated human images witha large range of pose variations and backgrounds.

1 IntroductionDetecting humans in images is a challenging task owing

to their variable appearance and the wide range of poses thatthey can adopt. The first need is a robust feature set thatallows the human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for human detection, showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17,22]. The proposeddescriptors are reminiscent of edge orientation histograms[4,5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking “pedestrian detection” (the detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on theMIT pedestrian test set [18,17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and backgrounds.Ongoing work suggests that our feature set performs equallywell for other shape-based object classes.

We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.

2 Previous WorkThere is an extensive literature on object detection, but

here we mention just a few relevant papers on human detec-tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou etal [18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortere et algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian detection system [7]. Viola et al [22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfard etal [19] build an articulated body detector by incorporatingSVM based limb classifiers over 1st and 2nd order Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9]. Mikolajczyk et al [16] use combinations of orientation-position histograms with binary-thresholdedgradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single detection window, but appears to givesignificantly higher performance on pedestrian images.

3 Overview of the MethodThis section gives an overview of our feature extraction

chain, which is summarized in fig. 1. Implementation detailsare postponed until §6. The method is based on evaluatingwell-normalized local histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4,5,12,15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity gradients or

1

histograms of local image measurement have been quite successful

Swain & Ballard 1991 - Color Histograms

Schiele & Crowley 1996 - Receptive Fields Histograms

Lowe 1999 - SIFT

Schneiderman & Kanade 2000 - Localized Histograms of Wavelets

Leung & Malik 2001 - Texton Histograms

Belongie et al. 2002 - Shape Context

Dalal & Triggs 2005 - Dense Orientation Histograms

...

λ λ λ

Page 62: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs

INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr

AbstractWe study the question of feature sets for robust visual ob-

ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale gradients, fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated human images witha large range of pose variations and backgrounds.

1 IntroductionDetecting humans in images is a challenging task owing

to their variable appearance and the wide range of poses thatthey can adopt. The first need is a robust feature set thatallows the human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for human detection, showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17,22]. The proposeddescriptors are reminiscent of edge orientation histograms[4,5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking “pedestrian detection” (the detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on theMIT pedestrian test set [18,17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and backgrounds.Ongoing work suggests that our feature set performs equallywell for other shape-based object classes.

We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.

2 Previous WorkThere is an extensive literature on object detection, but

here we mention just a few relevant papers on human detec-tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou etal [18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortere et algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian detection system [7]. Viola et al [22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfard etal [19] build an articulated body detector by incorporatingSVM based limb classifiers over 1st and 2nd order Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9]. Mikolajczyk et al [16] use combinations of orientation-position histograms with binary-thresholdedgradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single detection window, but appears to givesignificantly higher performance on pedestrian images.

3 Overview of the MethodThis section gives an overview of our feature extraction

chain, which is summarized in fig. 1. Implementation detailsare postponed until §6. The method is based on evaluatingwell-normalized local histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4,5,12,15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity gradients or

1

tons of “feature sets” have been proposed

features

Gravrila & Philomen 1999 - Edge Templates + Nearest Neighbor

Papageorgiou & Poggio 2000, Mohan et al. 2001, DePoortere et al. 2002 - Haar Wavelets + SVM

Viola & Jones 2001 - Rectangular Differential Features + AdaBoost

Mikolajczyk et al. 2004 - Parts Based Histograms + AdaBoost

Ke & Sukthankar 2004 - PCA-SIFT

...

Page 63: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs

INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr

AbstractWe study the question of feature sets for robust visual ob-

ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computationon performance, concluding that fine-scale gradients, fineorientation binning, relatively coarse spatial binning, andhigh-quality local contrast normalization in overlapping de-scriptor blocks are all important for good results. The newapproach gives near-perfect separation on the original MITpedestrian database, so we introduce a more challengingdataset containing over 1800 annotated human images witha large range of pose variations and backgrounds.

1 IntroductionDetecting humans in images is a challenging task owing

to their variable appearance and the wide range of poses thatthey can adopt. The first need is a robust feature set thatallows the human form to be discriminated cleanly, even incluttered backgrounds under difficult illumination. We studythe issue of feature sets for human detection, showing that lo-cally normalized Histogram of Oriented Gradient (HOG) de-scriptors provide excellent performance relative to other ex-isting feature sets including wavelets [17,22]. The proposeddescriptors are reminiscent of edge orientation histograms[4,5], SIFT descriptors [12] and shape contexts [1], but theyare computed on a dense grid of uniformly spaced cells andthey use overlapping local contrast normalizations for im-proved performance. We make a detailed study of the effectsof various implementation choices on detector performance,taking “pedestrian detection” (the detection of mostly visiblepeople in more or less upright poses) as a test case. For sim-plicity and speed, we use linear SVM as a baseline classifierthroughout the study. The new detectors give essentially per-fect results on theMIT pedestrian test set [18,17], so we havecreated a more challenging set containing over 1800 pedes-trian images with a large range of poses and backgrounds.Ongoing work suggests that our feature set performs equallywell for other shape-based object classes.

We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.

2 Previous WorkThere is an extensive literature on object detection, but

here we mention just a few relevant papers on human detec-tion [18,17,22,16,20]. See [6] for a survey. Papageorgiou etal [18] describe a pedestrian detector based on a polynomialSVM using rectified Haar wavelets as input descriptors, witha parts (subwindow) based variant in [17]. Depoortere et algive an optimized version of this [2]. Gavrila & Philomen[8] take a more direct approach, extracting edge images andmatching them to a set of learned exemplars using chamferdistance. This has been used in a practical real-time pedes-trian detection system [7]. Viola et al [22] build an efficientmoving person detector, using AdaBoost to train a chain ofprogressively more complex region rejection rules based onHaar-like wavelets and space-time differences. Ronfard etal [19] build an articulated body detector by incorporatingSVM based limb classifiers over 1st and 2nd order Gaussianfilters in a dynamic programming framework similar to thoseof Felzenszwalb & Huttenlocher [3] and Ioffe & Forsyth[9]. Mikolajczyk et al [16] use combinations of orientation-position histograms with binary-thresholdedgradient magni-tudes to build a parts based method containing detectors forfaces, heads, and front and side profiles of upper and lowerbody parts. In contrast, our detector uses a simpler archi-tecture with a single detection window, but appears to givesignificantly higher performance on pedestrian images.

3 Overview of the MethodThis section gives an overview of our feature extraction

chain, which is summarized in fig. 1. Implementation detailsare postponed until §6. The method is based on evaluatingwell-normalized local histograms of image gradient orienta-tions in a dense grid. Similar features have seen increasinguse over the past decade [4,5,12,15]. The basic idea is thatlocal object appearance and shape can often be characterizedrather well by the distribution of local intensity gradients or

1

localizing humans in images is achallenging task...

difficult!

Wide variety of articulated poses

Variable appearance/clothing

Complex backgrounds

Unconstrained illuminations

Occlusions

Different scales

...

Page 64: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Approach

Page 65: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Approach

•robust feature set (HOG)

Page 66: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Approach

•robust feature set (HOG)

Page 67: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Approach

•robust feature set (HOG)

•simple classifier (linear SVM)

Page 68: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Approach

•robust feature set (HOG)

•simple classifier (linear SVM)

•fast detection (sliding window)

Page 69: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

adapted from Bill Triggs

Page 70: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

• Gamma normalization

• Space: RGB, LAB or Gray

• Method: SQRT or LOG

Page 71: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

• Filtering with simple masks

uncentered

centered

cubic-corrected

diagonal

Sobel

uncentered

centered

cubic-corrected

diagonal

Sobel

* centered performs the best

*

Page 72: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

• Filtering with simple masks

uncentered

centered

cubic-corrected

diagonal

Sobel

remember SIFT ?

Page 73: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

...after filtering, each “pixel” represents an oriented gradient...

Page 74: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

...pixels are regrouped in “cells”, they cast a weighted vote for an orientation histogram...

HOG (Histogram of Oriented Gradients)

Page 75: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

a window can be represented like that

Page 76: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

then, cells are locally normalized using overlapping “blocks”

Page 77: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

they used two types of blocks

Page 78: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

they used two types of blocks

• rectangular

• similar to SIFT (but dense)

Page 79: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

they used two types of blocks

• rectangular

• similar to SIFT (but dense)

• circular

• similar to Shape Context

Page 80: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

and four different types of block normalization

Page 81: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

and four different types of block normalization

Page 82: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

like SIFT, they gain invariance...

...to illuminations, small deformations, etc.

Page 83: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

finally, a sliding window is

classified by a simple linear SVM

Page 84: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

during the learning phase, the algorithm “looked” for hard examples

Training

adapted from Martial Hebert

Page 85: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

average gradients

positive weights negative weights

Page 86: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Example

Page 87: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Example

adapted from Bill Triggs

Page 88: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Example

adapted from Martial Hebert

Page 89: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Further Development

Page 90: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Further Development

• Detection on Pascal VOC (2006)

Page 91: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Further Development

• Detection on Pascal VOC (2006)

• Human Detection in Movies (ECCV 2006)

Page 92: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Further Development

• Detection on Pascal VOC (2006)

• Human Detection in Movies (ECCV 2006)

• US Patent by MERL (2006)

Page 93: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Further Development

• Detection on Pascal VOC (2006)

• Human Detection in Movies (ECCV 2006)

• US Patent by MERL (2006)

• Stereo Vision HoG (ICVES 2008)

Page 94: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Extension example:

Pyramid HoG++

Page 95: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Extension example:

Pyramid HoG++

Page 96: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Extension example:

Pyramid HoG++

Page 97: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

A simple demo...

Page 98: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

A simple demo...

Page 99: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

A simple demo...

VIDEO HERE

Page 100: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

A simple demo...

VIDEO HERE

Page 101: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)
Page 102: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

so, it doesn’t work ?!?

Page 103: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

so, it doesn’t work ?!?

no no, it works...

Page 104: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

so, it doesn’t work ?!?

no no, it works...

...it just doesn’t work well...

Page 105: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Object Recognition from Local Scale-Invariant Features

David G. Lowe

Computer Science Department

University of British Columbia

Vancouver, B.C., V6T 1Z4, Canada

[email protected]

Abstract

An object recognition system has been developed that uses a

new class of local image features. The features are invariant

to image scaling, translation, and rotation, and partially in-

variant to illumination changes and affine or 3D projection.

translation, scaling, and rotation, and partially invariant to

illumination changes and affine or 3D projection. Previous

approaches to local feature generation lacked invariance to

scale and were more sensitive to projective distortion and

illumination change. The SIFT features share a number of

properties in common with the responses of neurons in infe-

Lowe(1999)

Histograms of Oriented Gradients for Human DetectionNavneet Dalal and Bill Triggs

INRIA Rhone-Alps, 655 avenue de l’Europe, Montbonnot 38334, France{Navneet.Dalal,Bill.Triggs}@inrialpes.fr, http://lear.inrialpes.fr

AbstractWe study the question of feature sets for robust visual ob-

ject recognition, adopting linear SVM based human detec-tion as a test case. After reviewing existing edge and gra-dient based descriptors, we show experimentally that gridsof Histograms of Oriented Gradient (HOG) descriptors sig-nificantly outperform existing feature sets for human detec-tion. We study the influence of each stage of the computation

We briefly discuss previous work on human detection in§2, give an overview of our method §3, describe our datasets in §4 and give a detailed description and experimentalevaluation of each stage of the process in §5–6. The mainconclusions are summarized in §7.

2 Previous WorkThere is an extensive literature on object detection, but

Nalal and Triggs (2005)

A Discriminatively Trained, Multiscale, Deformable Part Model

Pedro FelzenszwalbUniversity of [email protected]

David McAllesterToyota Technological Institute at Chicago

[email protected]

Deva RamananUC Irvine

[email protected]

Abstract

This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCALchallenge. Our system also relies heavily on new methodsfor discriminative training. We combine a margin-sensitiveapproach for data mining hard negative examples with aformalism we call latent SVM. A latent SVM, like a hid-den CRF, leads to a non-convex training problem. How-ever, a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified forthe positive examples. We believe that our training meth-ods will eventually make possible the effective use of morelatent information such as hierarchical (grammar) modelsand models involving latent three dimensional pose.

1. IntroductionWe consider the problem of detecting and localizing ob-

jects of a generic category, such as people or cars, in staticimages. We have developed a new multiscale deformablepart model for solving this problem. The models are trainedusing a discriminative procedure that only requires bound-ing box labels for the positive examples. Using these mod-els we implemented a detection system that is both highlyefficient and accurate, processing an image in about 2 sec-onds and achieving recognition rates that are significantlybetter than previous systems.

Our system achieves a two-fold improvement in averageprecision over the winning system [5] in the 2006 PASCALperson detection challenge. The system also outperformsthe best results in the 2007 challenge in ten out of twenty

This material is based upon work supported by the National ScienceFoundation under Grant No. 0534820 and 0535174.

Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolutionpart templates and a spatial model for the location of each part.

object categories. Figure 1 shows an example detection ob-tained with our person model.

The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework forrepresenting object categories [1–3, 6, 10, 12, 13,15, 16, 22].While these models are appealing from a conceptual pointof view, it has been difficult to establish their value in prac-tice. On difficult datasets, deformable models are often out-performed by “conceptually weaker” models such as rigidtemplates [5] or bag-of-features [23]. One of our main goalsis to address this performance gap.

Our models include both a coarse global template cov-ering an entire object and higher resolution part templates.The templates represent histogram of gradient features [5].As in [14, 19, 21], we train models discriminatively. How-ever, our system is semi-supervised, trained with a max-margin framework, and does not rely on feature detection.We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data. In contrast to computa-tionally demanding approaches such as [4], we can learn amodel in 3 hours on a single CPU.

Another contribution of our work is a new methodologyfor discriminative training. We generalize SVMs for han-dling latent variables such as part positions, and introduce anew method for data mining “hard negative” examples dur-ing training. We believe that handling partially labeled datais a significant issue in machine learning for computer vi-sion. For example, the PASCAL dataset only specifies a

1

Felzenszwalb et al.(2008)

Page 106: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

This paper describes one of the best algorithm in object detection...

Page 107: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

They used the following methods:

HOG Features

Deformable Part Model

Latent SVM

Page 108: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

They used the following methods:

HOG Features

Introduced by Dalal & Triggs (2005)

Page 109: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

They used the following methods:

Deformable Part Model

Introduced by Fischler & Elschlager (1973)

Page 110: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

They used the following methods:

Latent SVM

Introduced by the authors

Page 111: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

HOG Features

Page 112: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Model Overviewdetection root filter part filters

deformation models

Page 113: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

HOG Features

// 8x8 pixel blocks window

// features computed at different resolutions (pyramid)

Page 114: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

HOG Pyramid

Page 115: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Deformable Part Model

Page 116: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Deformable Part Model

// each part is a local property

// springs capture spatial relationships

// here, the springs can be “negative”

Page 117: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Deformable Part Model

detection score =sum of filter responses - deformation cost

Page 118: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

root filter

Deformable Part Model

detection score =sum of filter responses - deformation cost

Page 119: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

root filter

part filters

Deformable Part Model

detection score =sum of filter responses - deformation cost

Page 120: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

root filter

part filtersdeformable

model

Deformable Part Model

detection score =sum of filter responses - deformation cost

Page 121: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Deformable Part Model

filters feature vector(at position p

in the pyramid H)

position relativeto the root location

coefficients of a quadratic function on

the placement

score of a placement

Page 122: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Latent SVM

Page 123: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Latent SVM

filters and deformation parameters

features part displacements

Page 124: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Latent SVM

Page 125: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Bonus

// Data Mining Hard Negatives

// Model Initialization

Page 126: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Results

Pascal VOC 2006

Page 127: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Results

Models learned

Page 128: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Experiments

~ Dalal’s model~ Dalal’s + LSVM

Page 129: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Examples

errors

Page 130: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

A simple demo...

Page 131: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

A simple demo...

Page 132: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

A simple demo...

Page 133: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

A simple demo...

Page 134: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Conclusions

Page 135: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Conclusions

so, it doesn’t work ?!?

Page 136: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Conclusions

so, it doesn’t work ?!?

no no, it works...

Page 137: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Conclusions

so, it doesn’t work ?!?

no no, it works...

...it just doesn’t work well...

Page 138: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Conclusions

so, it doesn’t work ?!?

no no, it works...

...it just doesn’t work well...

...or there is a problem with the seat-computer interface...

Page 139: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)

Conclusion

Page 140: MIT 6.870 - Template Matching and Histograms (Nicolas Pinto, MIT)