Categorization by Learning and Combing Object Parts

Categorization by Learning Categorization by Learning and Combing Object Parts and Combing Object Parts

B. Heisele, T. Serre, M. Pontil, T. Vetter, T. Poggio. B. Heisele, T. Serre, M. Pontil, T. Vetter, T. Poggio.

Presented by Manish Jethwa

OverviewOverview

Learn discriminatory components of Learn discriminatory components of objects with Support Vector Machine objects with Support Vector Machine (SVM) classifiers.(SVM) classifiers.

BackgroundBackground

Global ApproachGlobal Approach– Attempt to classify the entire objectAttempt to classify the entire object– Successful when applied to problems in which Successful when applied to problems in which

the object pose is fixed.the object pose is fixed.

Component-based techniquesComponent-based techniques– Individual components vary less when object Individual components vary less when object

pose changes than whole objectpose changes than whole object– Useable even when some of the components Useable even when some of the components

are occluded.are occluded.

Linear Support Vector MachinesLinear Support Vector Machines

Linear SVMs are used to discriminate Linear SVMs are used to discriminate between two classes by determining the between two classes by determining the separating hyperplane.separating hyperplane.

Support Vectors

Decision functionDecision function

The decision function of the SVM has the The decision function of the SVM has the form:form:

Number of training data points Training data points

Class label {-1,1}Adjustable coefficients-solution of quadratic programming problem-positive weights for Support Vectors-zero for all other data points

New data pointf f ((xx) defines a hyperplane dividing ) defines a hyperplane dividing The data. The sign of The data. The sign of ff ((xx) indicates) indicatesthe class of the class of xx. .

Bias

i=1

lf f ((xx)=)= ∑∑ααii y yi i < < xxi i . . xx> + b> + b

Significance of Significance of ααii

Correspond to the weights of the support vectors. Correspond to the weights of the support vectors.

Learned from training data set.Learned from training data set.

Used to compute the margin M of the support Used to compute the margin M of the support vectors to the hyperplane.vectors to the hyperplane.

Margin M = (√∑il i )

-1

Non-separable DataNon-separable Data

The notion of a margin extends to non-separable data The notion of a margin extends to non-separable data also. also.

Misclassified points result in errors.Misclassified points result in errors.

The hyperplane is now defined by maximizing the margin The hyperplane is now defined by maximizing the margin while minimizing the summed error.while minimizing the summed error.

The expected error probability of the SVM satisfies the The expected error probability of the SVM satisfies the following bound:following bound:

EPerr ≤l -1E[D2/M2]

Diameter of sphere containing all training data

Measuring ErrorMeasuring Error

Error is dependant on the following ratio:Error is dependant on the following ratio:

ρρ= = DD22/M/M22

Renders Renders ρρ invariant to scale invariant to scale

DD11

MM11

DD22

MM22

==DD22

MM22

DD11

MM11

ρρ1= = ρρ22

Learning ComponentsLearning Components

Expansion left

ρ

Learning Facial ComponentsLearning Facial Components

Extracting face components is time consumingExtracting face components is time consuming– Requires manually extracting each component from all training Requires manually extracting each component from all training

images.images.

Use textured head models insteadUse textured head models instead– Automatically produce a large number of faces under differing Automatically produce a large number of faces under differing

illumination and posesillumination and poses

Seven textured head models used to generate 2,457 face images Seven textured head models used to generate 2,457 face images of size 58x58of size 58x58

Negative Training SetNegative Training Set

Use extract 58x58 patches from 502 non-face Use extract 58x58 patches from 502 non-face images to give 10,209 negative training points.images to give 10,209 negative training points.

Train SVM classifier on this data, then add false Train SVM classifier on this data, then add false positives to the negative training set.positives to the negative training set.

Increases negative training set with those Increases negative training set with those images which look most like faces. images which look most like faces.

Learned ComponentsLearned Components

Start with fourteen manually selected 5x5 Start with fourteen manually selected 5x5 seed regions.seed regions.

The eyes (17x17 pixels)

The nose (15x20 pixels)

The mouth (31x15 pixels)The cheeks (21x20 pixels)The lip (13x16 pixels)

The nostrils (22x12 pixels)The corners of the mouth (15x20 pixels)The eyebrows (15x20 pixels)The bridge of the nose (15x20 pixels)

Combining ComponentsCombining Components

CombiningClassifier

Linear SVM

Shift 58x58 window over input image

Determinemaximum output and its location

Final decisionface / background

Shift componentExperts over 58x58 window

Left Eye Expert

Linear SVM

Nose Expert

Linear SVM

Mouth Expert

Linear SVM

ExperimentsExperiments

Training data for 3 classifiers:Training data for 3 classifiers:– 2,457 faces, 13,654 non-face grey images2,457 faces, 13,654 non-face grey images

Test data Test data – 1,834 faces, 24,464 non-face grey images1,834 faces, 24,464 non-face grey images

Components vs. Whole faceComponents vs. Whole face- Components method performs better than Components method performs better than

benchmark: whole face detector, with 2benchmark: whole face detector, with 2ndnd degree polynomial SVM. degree polynomial SVM.

ResultsResults

Faces detected by the component-based classifier

Computer ExamplesComputer Examples

The DataThe Data

Used images from MIT CBCL faces dataset: Used images from MIT CBCL faces dataset: – Image are 19x19 pixelsImage are 19x19 pixels– GreyscaleGreyscale– Histogram equalizedHistogram equalized

Full training set contains Full training set contains – 2,429 faces, 4,548 non-faces2,429 faces, 4,548 non-faces– Only used 100 face examples, 300 non-faceOnly used 100 face examples, 300 non-face

Test set containsTest set contains– 472 faces, 23,573 non-faces 472 faces, 23,573 non-faces – All 24,045 images used All 24,045 images used

Learning ComponentsLearning Components

Left eye component:

Extract three negative examples for every positive one

– This provides a bias to correctly classify non-face examples. Minimize false positives.

Training set contains 400 examples:

– 100 left eye examples

– 300 non-left eye examples

ResultsResults

Left Eye learned using 400 examples:Left Eye learned using 400 examples:– Training set classified 97% correctlyTraining set classified 97% correctly– Test set classified 96.2% correctly (of 24,045)Test set classified 96.2% correctly (of 24,045)

Right Eye learned using 400 examples:Right Eye learned using 400 examples:– Training set classified 100% correctlyTraining set classified 100% correctly– Test set classified 96.5% correctly (of 24,045)Test set classified 96.5% correctly (of 24,045)

Left Eye learned using 400 examples:Left Eye learned using 400 examples:– Training set classified 95% correctlyTraining set classified 95% correctly– Test set classified 95.7% correctly (of 24,045)Test set classified 95.7% correctly (of 24,045)

Locating Learned ComponentsLocating Learned Components

0 10 20 30 40

2

6

10

14

18

0

20

40

60

80

100

120

2 4 6 8 10 12 14 16 18

Left eye is reasonably well localized

Locating Learned ComponentsLocating Learned Components

0 10 20 30 40

2

7

12

17

Y C

ente

rFrequency

0

5

10

15

20

25

X Center

Frequ

ency Histograms are fairly flat

compared to face images

Face detectorFace detector

Learned componentsLearned components– Left eye 8x6 pixelsLeft eye 8x6 pixels– Right eye 8x6 pixelsRight eye 8x6 pixels– Mouth 11x 4 pixelsMouth 11x 4 pixels

Each learned from 400 examples Each learned from 400 examples – 100 positive100 positive– 300 negative300 negative

Trained SVM with fixed location for componentsTrained SVM with fixed location for components– No shiftingNo shifting

Face Detector TrainingFace Detector Training

The output The output OOii from each component from each component ii (distance for (distance for

the hyperplane) is used together with the center the hyperplane) is used together with the center ( X( Xi i , Y, Yi i )) location of corresponding component. location of corresponding component.

All components are combined into one input All components are combined into one input feature vector:feature vector:

X = (OX = (Oleftleft, X, Xleftleft ,Y ,Yleft left , O, Oright right , X, Xrightright ,Y ,Yright right , O, Omouth mouth , , XXmouthmouth ,Y ,Ymouth mouth ))

ResultsResults

The resulting SVM correct classified all 400 The resulting SVM correct classified all 400 training examplestraining examples

For non-face examples:For non-face examples:

– Placing component centers in random locations Placing component centers in random locations resulted in 100% correct classification.resulted in 100% correct classification.

– Placing component centers in identical positions to Placing component centers in identical positions to face examples resulted in 98.4% accuracy.face examples resulted in 98.4% accuracy.

Is this the best face detector ever?Is this the best face detector ever?

Performed well on the given dataset.Performed well on the given dataset.Low resolutionLow resolution

But…But…Dataset contains centered facesDataset contains centered facesComponent positions were givenComponent positions were given– Did not shift component window and look for Did not shift component window and look for

maximummaximum

Did not have the opportunity to test against other Did not have the opportunity to test against other algorithmalgorithm

So…So…We will never know.We will never know.

Categorization by Learning and Combing Object Parts

Documents

Transcript of Categorization by Learning and Combing Object Parts