Eigen Representations: Detecting faces in images
-
Upload
judah-potts -
Category
Documents
-
view
22 -
download
1
description
Transcript of Eigen Representations: Detecting faces in images
11-755 Machine Learning for Signal Processing
Eigen Representations:Detecting faces in imagesClass 6. 15 Sep 2011
Instructor: Bhiksha Raj
20 Sep 2011 111755/18797
11755/18797
Administrivia Project teams? Project proposals?
TAs have updated timings and locations (on webpage)
20 Sep 2011 2
11755/18797
Last Lecture: Representing Audio
Basic DFT Computing a Spectrogram Computing additional features from a
spectrogram
20 Sep 2011 3
11755/18797
What about images?
DCT of small segments 8x8 Each image becomes a matrix of DCT vectors
DCT of the image Haar transform (checkerboard) Or data-driven representations..
DCT
Npixels / 64 columns
20 Sep 2011 4
11755/18797
Returning to Eigen Computation
A collection of faces All normalized to 100x100 pixels
What is common among all of them? Do we have a common descriptor?
20 Sep 2011 5
11755/18797
A least squares typical face
Can we do better than a blank screen to find the most common portion of faces? The first checkerboard; the zeroth frequency component..
Assumption: There is a “typical” face that captures most of what is common to all faces Every face can be represented by a scaled version of a typical face
What is this face?
Approximate every face f as f = wf V Estimate V to minimize the squared error
How? What is V?
The typical face
20 Sep 2011 6
11755/18797
A collection of least squares typical faces
Assumption: There are a set of K “typical” faces that captures most of all faces
Approximate every face f as f = wf,1 V1+ wf,2 V2 + wf,3 V3 +.. + wf,k Vk V2 is used to “correct” errors resulting from using only V1
So the total energy in wf,2 ( wf,22) must be lesser than the total energy in wf,1 ( wf,1
2)
V3 corrects errors remaining after correction with V2
The total energy in wf,3 must be lesser than that even in wf,2
And so on.. V = [V1 V2 V3]
Estimate V to minimize the squared error How? What is V?
20 Sep 2011 7
11755/18797
A recollection
M =
W =
V=PINV(W)*M
?U =
20 Sep 2011 8
11755/18797
How about the other way?
W = M * Pinv(V)
M =
W = ??
V =
U =
20 Sep 2011 9
11755/18797
How about the other way?
W V \approx = M
M =
W = ??
V =
U =
?
20 Sep 2011 10
11755/18797
Eigen Faces!
Here W, V and U are ALL unknown and must be determined Such that the squared error between U and M is minimum
Eigen analysis allows you to find W and V such that U = WV has the least squared error with respect to the original data M
If the original data are a collection of faces, the columns of W represent the space of eigen faces.
M = Data Matrix
U = Approximation
V
W
20 Sep 2011 11
11755/18797
Eigen faces
Lay all faces side by side in vector form to form a matrix In my example: 300 faces. So the matrix is 10000 x 300
Multiply the matrix by its transpose The correlation matrix is 10000x10000
M = Data Matrix
MT =
Tra
nspo
sed
Dat
a M
atrix
Correlation=
10000x300
300x10000
10000x10000
20 Sep 2011 12
11755/18797
Eigen faces
Compute the eigen vectors Only 300 of the 10000 eigen values are non-zero
Why? Retain eigen vectors with high eigen values (>0)
Could use a higher threshold
[U,S] = eig(correlation)
10000
2
1
.0.0
.....
.....
0.00
0.0.
S
U
eige
nfac
e1ei
genf
ace2
20 Sep 2011 13
11755/18797
Eigen Faces
The eigen vector with the highest eigen value is the first typical face
The vector with the second highest eigen value is the second typical face.
Etc.
U
eige
nfac
e1ei
genf
ace2
eigenface1eigenface2
eigenface3
20 Sep 2011 14
11755/18797
Representing a face
The weights with which the eigen faces must be combined to compose the face are used to represent the face!
= w1 + w2 + w3
Representation = [w1 w2 w3 …. ]T
20 Sep 2011 15
11755/18797
Principal Component Analysis
Eigen analysis: Computing the “Principal” directions of a data What do they mean Why do we care
20 Sep 2011 16
11755/18797
Principal Components == Eigen Vectors
Principal Component Analysis is the same as Eigen analysis
The “Principal Components” are the Eigen Vectors
20 Sep 2011 17
11755/18797
Principal Component Analysis
Which line through the mean leads to the smallest reconstruction error (sum of squared lengths of the blue lines) ?
20 Sep 2011 18
11755/18797
Principal Components
The first principal component is the first Eigen (“typical”) vector X = 1(X)E1
The first Eigen face For non-zero-mean data sets, the average of the data
The second principal component is the second “typical” (or correction) vector X = 1(X)E1 + 2(X)E2
E1
E2
X 1
2
X = 1E1 + 2E2
20 Sep 2011 19
11755/18797
SVD instead of Eigen
Do we need to compute a 10000 x 10000 correlation matrix and then perform Eigen analysis? Will take a very long time on your laptop
SVD Only need to perform “Thin” SVD. Very fast
U = 10000 x 300 The columns of U are the eigen faces! The Us corresponding to the “zero” eigen values are not computed
S = 300 x 300 V = 300 x 300
M = Data Matrix
10000x300
U=10000x300S=300x300 V=300x300=
U
eige
nfac
e1ei
genf
ace2
20 Sep 2011 20
11755/18797
NORMALIZING OUT VARIATIONS
20 Sep 2011 21
11755/18797
Images: Accounting for variations
What are the obvious differences in the above images
How can we capture these differences Hint – image histograms..
20 Sep 2011 22
11755/18797
Images -- Variations
Pixel histograms: what are the differences
20 Sep 2011 23
11755/18797
Normalizing Image Characteristics Normalize the pictures
Eliminate lighting/contrast variations All pictures must have “similar” lighting
How?
Lighting and contrast are represented in the image histograms:
20 Sep 2011 24
11755/18797
Histogram Equalization Normalize histograms of images
Maximize the contrast Contrast is defined as the “flatness” of the histogram For maximal contrast, every greyscale must happen as frequently as every other
greyscale
Maximizing the contrast: Flattening the histogram Doing it for every image ensures that every image has the same constrast
I.e. exactly the same histogram of pixel values Which should be flat
0 255
20 Sep 2011 25
11755/18797
Histogram Equalization
Modify pixel values such that histogram becomes “flat”.
For each pixel New pixel value = f(old pixel value) What is f()?
Easy way to compute this function: map cumulative counts
20 Sep 2011 26
11755/18797
Cumulative Count Function
The histogram (count) of a pixel value X is the number of pixels in the image that have value X E.g. in the above image, the count of pixel value 180 is
about 110 The cumulative count at pixel value X is the total
number of pixels that have values in the range 0 <= x <= X CCF(X) = H(1) + H(2) + .. H(X)
20 Sep 2011 27
11755/18797
Cumulative Count Function
The cumulative count function of a uniform histogram is a line
We must modify the pixel values of the image so that its cumulative count is a line
20 Sep 2011 28
11755/18797
Mapping CCFs
CCF(f(x)) -> a*f(x) [of a*(f(x)+1) if pixels can take value 0] x = pixel value f() is the function that converts the old pixel value to a new
(normalized) pixel value a = (total no. of pixels in image) / (total no. of pixel levels)
The no. of pixel levels is 256 in our examples Total no. of pixels is 10000 in a 100x100 image
Move x axis levels around until the plot to the leftlooks like the plot to the right
20 Sep 2011 29
11755/18797
Mapping CCFs
For each pixel value x: Find the location on the red line that has the closet Y value
to the observed CCF at x 20 Sep 2011 30
11755/18797
Mapping CCFs
For each pixel value x: Find the location on the red line that has the closet Y value
to the observed CCF at x
x1
x2
f(x1) = x2
x3
x4
f(x3) = x4
Etc.
20 Sep 2011 31
11755/18797
Mapping CCFs
For each pixel in the image to the left The pixel has a value x Find the CCF at that pixel value CCF(x) Find x’ such that CCF(x’) in the function to the right equals
CCF(x) x’ such that CCF_flat(x’) = CCF(x)
Modify the pixel value to x’
Move x axis levels around until the plot to the leftlooks like the plot to the right
20 Sep 2011 32
11755/18797
Doing it Formulaically
CCFmin is the smallest non-zero value of CCF(x) The value of the CCF at the smallest observed pixel value
Npixels is the total no. of pixels in the image 10000 for a 100x100 image
Max.pixel.value is the highest pixel value 255 for 8-bit pixel representations
valuepixelMaxCCFNpixels
CCFxCCFroundxf ..
)()(
min
min
20 Sep 2011 33
11755/18797
Or even simpler Matlab:
Newimage = histeq(oldimage)
20 Sep 2011 34
11755/18797
Histogram Equalization
Left column: Original image Right column: Equalized image All images now have similar contrast levels
20 Sep 2011 35
11755/18797
Eigenfaces after Equalization
Left panel : Without HEQ Right panel: With HEQ
Eigen faces are more face like.. Need not always be the case
20 Sep 2011 36
11755/18797
Detecting Faces in Images
20 Sep 2011 37
11755/18797
Detecting Faces in Images
Finding face like patterns How do we find if a picture has faces in it Where are the faces?
A simple solution: Define a “typical face” Find the “typical face” in the image
20 Sep 2011 38
11755/18797
Finding faces in an image
Picture is larger than the “typical face” E.g. typical face is 100x100, picture is 600x800
First convert to greyscale R + G + B Not very useful to work in color
20 Sep 2011 39
11755/18797
Finding faces in an image
Goal .. To find out if and where images that look like the “typical” face occur in the picture
20 Sep 2011 40
11755/18797
Finding faces in an image
Try to “match” the typical face to each location in the picture
20 Sep 2011 41
11755/18797
Finding faces in an image
Try to “match” the typical face to each location in the picture
20 Sep 2011 42
11755/18797
Finding faces in an image
Try to “match” the typical face to each location in the picture
20 Sep 2011 43
11755/18797
Finding faces in an image
Try to “match” the typical face to each location in the picture
20 Sep 2011 44
11755/18797
Finding faces in an image
Try to “match” the typical face to each location in the picture
20 Sep 2011 45
11755/18797
Finding faces in an image
Try to “match” the typical face to each location in the picture
20 Sep 2011 46
11755/18797
Finding faces in an image
Try to “match” the typical face to each location in the picture
20 Sep 2011 47
11755/18797
Finding faces in an image
Try to “match” the typical face to each location in the picture
20 Sep 2011 48
11755/18797
Finding faces in an image
Try to “match” the typical face to each location in the picture
20 Sep 2011 49
11755/18797
Finding faces in an image
Try to “match” the typical face to each location in the picture
The “typical face” will explain some spots on the image much better than others These are the spots at which we probably have a
face!
20 Sep 2011 50
11755/18797
How to “match”
What exactly is the “match” What is the match “score”
The DOT Product Express the typical face as a vector Express the region of the image being evaluated as a vector
But first histogram equalize the region Just the section being evaluated, without considering the rest of the image
Compute the dot product of the typical face vector and the “region” vector
20 Sep 2011 51
11755/18797
What do we get
The right panel shows the dot product a various loctions Redder is higher
The locations of peaks indicate locations of faces!
20 Sep 2011 52
11755/18797
What do we get
The right panel shows the dot product a various loctions Redder is higher
The locations of peaks indicate locations of faces! Correctly detects all three faces
Likes George’s face most He looks most like the typical face
Also finds a face where there is none! A false alarm
20 Sep 2011 53
11755/18797
Scaling and Rotation Problems
Scaling Not all faces are the same size Some people have bigger faces The size of the face on the image
changes with perspective Our “typical face” only represents
one of these sizes
Rotation The head need not always be
upright! Our typical face image was
upright
20 Sep 2011 54
11755/18797
Solution
Create many “typical faces” One for each scaling factor One for each rotation
How will we do this? Match them all
Does this work Kind of .. Not well enough at all We need more sophisticated models
20 Sep 2011 55
11755/18797
Face Detection: A Quick Historical Perspective
Many more complex methods Use edge detectors and search for face like patterns Find “feature” detectors (noses, ears..) and employ them in
complex neural networks..
The Viola Jones method Boosted cascaded classifiers
But first, what is boosting
20 Sep 2011 56
11755/18797
And even before that – what is classification? Given “features” describing an entity, determine the
category it belongs to Walks on two legs, has no hair. Is this
A Chimpanizee A Human
Has long hair, is 5’4” tall, is this A man A woman
Matches “eye” pattern with score 0.5, “mouth pattern” with score 0.25, “nose” pattern with score 0.1. Are we looking at A face Not a face?
20 Sep 2011 57
11755/18797
Classification Multi-class classification
Many possible categories E.g. Sounds “AH, IY, UW, EY..” E.g. Images “Tree, dog, house, person..”
Binary classification Only two categories
Man vs. Woman Face vs. not a face..
Face detection: Recast as binary face classification For each little square of the image, determine if the square
represents a face or not
20 Sep 2011 58
11755/18797
Face Detection as Classification
Faces can be many sizes They can happen anywhere in the image For each face size
For each location Classify a rectangular region of the face size, at that location, as a
face or not a face This is a series of binary classification problems
For each square, run aclassifier to find out if itis a face or not
20 Sep 2011 59
11755/18797
Introduction to Boosting An ensemble method that sequentially combines many simple
BINARY classifiers to construct a final complex classifier Simple classifiers are often called “weak” learners The complex classifiers are called “strong” learners
Each weak learner focuses on instances where the previous classifier failed
Give greater weight to instances that have been incorrectly classified by previous learners
Restrictions for weak learners Better than 50% correct
Final classifier is weighted sum of weak classifiers
20 Sep 2011 60
11755/18797
Boosting: A very simple idea One can come up with many rules to classify
E.g. Chimpanzee vs. Human classifier: If arms == long, entity is chimpanzee If height > 5’6” entity is human If lives in house == entity is human If lives in zoo == entity is chimpanzee
Each of them is a reasonable rule, but makes many mistakes Each rule has an intrinsic error rate
Combine the predictions of these rules But not equally Rules that are less accurate should be given lesser weight
20 Sep 2011 61
11755/18797
Boosting and the Chimpanzee Problem
The total confidence in all classifiers that classify the entity as a chimpanzee is
The total confidence in all classifiers that classify it as a human is
If Scorechimpanzee > Scorehuman then the our belief that we have a chimpanzee is greater than the belief that we have a human
chimpanzeefavorsclassifier
chimpScore
classifier
humanfavorsclassifier
humanScore
classifier
Arm length?armlength
Height?height
Lives in house?house
Lives in zoo?zoo
human human chimp chimp
20 Sep 2011 62
11755/18797
Boosting as defined by Freund A gambler wants to write a program to predict winning horses. His
program must encode the expertise of his brilliant winner friend
The friend has no single, encodable algorithm. Instead he has many rules of thumb
He uses a different rule of thumb for each set of races E.g. “in this set, go with races that have black horses with stars on
their foreheads”
But cannot really enumerate what rules of thumbs go with what sets of races: he simply “knows” when he encounters a set
A common problem that faces us in many situations
Problem: How best to combine all of the friend’s rules of thumb What is the best set of races to present to the friend, to
extract the various rules of thumb20 Sep 2011 63
11755/18797
BoostingThe basic idea: Can a “weak” learning
algorithm that performs just slightly better than random guessing be boosted into an arbitrarily accurate “strong” learner Each of the gambler’s rules may be just better
than random guessing
This is a “meta” algorithm, that poses no constraints on the form of the weak learners themselves
The gambler’s rules of thumb can be anything
20 Sep 2011 64
11755/18797
Boosting: A Voting Perspective Boosting can be considered a form of voting
Let a number of different classifiers classify the data Go with the majority Intuition says that as the number of classifiers increases,
the dependability of the majority vote increases
The corresponding algorithms were called Boosting by majority
A (weighted) majority vote taken over all the classifiers How do we compute weights for the classifiers? How do we actually train the classifiers
20 Sep 2011 65
11755/18797
ADA Boost: Adaptive algorithm for learning the weights ADA Boost: Not named of ADA Lovelace An adaptive algorithm that learns the weights
of each classifier sequentially Learning adapts to the current accuracy
Iteratively: Train a simple classifier from training data
It will make errors even on training data Train a new classifier that focuses on the training data
points that have been misclassified
20 Sep 2011 66
11755/18797
Red dots represent training data from Red class Blue dots represent training data from Blue class
Boosting: An Example
20 Sep 2011 67
11755/18797
Very simple weak learner A line that is parallel to one of the two axes
Blue classRed class
Boosting: An Example
20 Sep 2011 68
11755/18797
First weak learner makes many mistakes Errors coloured black
Blue classRed class
Boosting: An Example
20 Sep 2011 69
11755/18797
Second weak learner focuses on errors made by first learner
Blue classRed class
Blue class
Red class
Boosting: An Example
20 Sep 2011 70
11755/18797
Second strong learner: weighted combination of first and second weak learners‒ Decision boundary shown by black lines
BLUE
RED
RED
RED
Boosting: An Example
20 Sep 2011 71
11755/18797
The second strong learner also makes mistakes Errors colored black
BLUE
RED
RED
RED
Boosting: An Example
20 Sep 2011 72
11755/18797
Third weak learner concentrates on errors made by second strong learner
Blue class
Red class
BLUE
RED
RED
RED
Boosting: An Example
20 Sep 2011 73
11755/18797
Third weak learner concentrates on errors made by combination of previous weak learners
Continue adding weak learners until….
Blue classRed class
Blue class
Red class
Blue class
Red class
Boosting: An Example
20 Sep 2011 74
11755/18797
Boosting: An Example
Voila! Final strong learner: very few errors on the training data
20 Sep 2011 75
11755/18797
Boosting: An Example
The final strong learner has learnt a complicated decision boundary
20 Sep 2011 76
11755/18797
Boosting: An Example
The final strong learner has learnt a complicated decision boundary
Decision boundaries in areas with low density of training points assumed inconsequential
20 Sep 2011 77
11755/18797
Overall Learning Pattern Strong learner increasingly accurate with increasing
number of weak learners
Residual errors increasingly difficult to correct‒ Additional weak learners less and less effective
Error of nth weak learner
Error of nth strong learner
number of weak learners20 Sep 2011 78
11755/18797
ADABoost Cannot just add new classifiers that work well only
the the previously misclassified data
Problem: The new classifier will make errors on the points that the earlier classifiers got right Not good On test data we have no way of knowing which points were
correctly classified by the first classifier
Solution: Weight the data when training the second classifier Use all the data but assign them weights
Data that are already correctly classified have less weight Data that are currently incorrectly classified have more weight
20 Sep 2011 79
11755/18797
ADA Boost
The red and blue points (correctly classified) will have a weight < 1 Black points (incorrectly classified) will have a weight > 1 To compute the optimal second classifier, we minimize the total
weighted error Each data point contributes or to the total count of correctly and
incorrectly classified points E.g. if one of the red points is misclassified by the new classifier, the total
error of the new classifier goes up by
20 Sep 2011 80
11755/18797
ADA Boost
Each new classifier modifies the weights of the data points based on the accuracy of the current classifier
The final classifier too is a weighted combination of all component classifiers
20 Sep 2011 81
11755/18797
Formalizing the Boosting Concept Given a set of instances (x1, y1), (x2, y2),… (xN, yN)
xi is the set of attributes of the ith instance
y1 is the class for the ith instance y1 can be 1 or -1 (binary classification only)
Given a set of classifiers h1, h2, … , hT
hi classifies an instance with attributes x as hi(x)
hi(x) is either -1 or +1 (for a binary classifier)
y*h(x) is 1 for all correctly classified points and -1 for incorrectly classified points
Devise a function f (h1(x), h2(x),…, hT(x)) such that classification based on f () is superior to classification by any hi(x)
The function is succinctly represented as f (x)20 Sep 2011 82
11755/18797
The Boosting Concept A simple combiner function: Voting
f (x) = i hi(x)
Classifier H(x) = sign(f (x)) = sign(i hi(x)) Simple majority classifier
A simple voting scheme
A better combiner function: Boosting f (x) = i i hi(x)
Can be any real number Classifier H(x) = sign(f (x)) = sign(i i hi(x)) A weighted majority classifier
The weight i for any hi(x) is a measure of our trust in hi(x)
20 Sep 2011 83
11755/18797
Adaptive BoostingAs before:
y is either -1 or +1 H(x) is +1 or -1 If the instance is correctly classified, both y and
H(x) will have the same sign The product y.H(x) is 1 For incorrectly classified instances the product is -1
Define the error for x : ½(1 – yH(x)) For a correctly classified instance, this is 0 For an incorrectly classified instance, this is 1
20 Sep 2011 84
11755/18797
The ADABoost AlgorithmGiven: a set (x1, y1), … (xN, yN) of training
instances xi is the set of attributes for the ith instance yi is the class for the ith instance and can be either
+1 or -1
20 Sep 2011 85
11755/18797
The ADABoost Algorithm Initialize D1(xi) = 1/NFor t = 1, …, T
Train a weak classifier ht using distribution Dt
Compute total error on training data t = Sum {½ (1 – yi ht(xi))}
Set t = ½ ln ((1 – t) / t) For i = 1… N
set Dt+1(xi) = Dt(xi) exp(- t yi ht(xi)) Normalize Dt+1 to make it a distribution
The final classifier is H(x) = sign(t t ht(x))
20 Sep 2011 86
11755/18797
First, some example data
Face detection with multiple Eigen faces Step 0: Derived top 2 Eigen faces from eigen face training data Step 1: On a (different) set of examples, express each image as a linear
combination of Eigen faces Examples include both faces and non faces Even the non-face images will are explained in terms of the eigen faces
E1
E2
= 0.3 E1 - 0.6 E2
= 0.5 E1 - 0.5 E2
= 0.7 E1 - 0.1 E2
= 0.6 E1 - 0.4 E2
= 0.2 E1 + 0.4 E2
= -0.8 E1 - 0.1 E2
= 0.4 E1 - 0.9 E2
= 0.2 E1 + 0.5 E2
Image = a*E1 + b*E2 a = Image.E1/|Image|
20 Sep 2011 87
11755/18797
Training Data
ID E1 E2. Class
A 0.3 -0.6 +1
B 0.5 -0.5 +1
C 0.7 -0.1 +1
D 0.6 -0.4 +1
E 0.2 0.4 -1
F -0.8 -0.1 -1
G 0.4 -0.9 -1
H 0.2 0.5 -1
= 0.3 E1 - 0.6 E2
= 0.5 E1 - 0.5 E2
= 0.7 E1 - 0.1 E2
= 0.6 E1 - 0.4 E2
= 0.2 E1 + 0.4 E2
= -0.8 E1 - 0.1 E2
= 0.4 E1 - 0.9 E2
= 0.2 E1 + 0.5 E2
Face = +1Non-face = -1
20 Sep 2011 88
11755/18797
The ADABoost Algorithm Initialize D1(xi) = 1/NFor t = 1, …, T
Train a weak classifier ht using distribution Dt
Compute total error on training data t = Sum {Dt (xi) ½(1 – yi ht(xi))}
Set t = ½ ln ((1 – t) / t) For i = 1… N
set Dt+1(xi) = Dt(xi) exp(- t yi ht(xi)) Normalize Dt+1 to make it a distribution
The final classifier is H(x) = sign(t t ht(x))
20 Sep 2011 89
11755/18797
Training Data
ID E1 E2. Class Weight
A 0.3 -0.6 +1 1/8
B 0.5 -0.5 +1 1/8
C 0.7 -0.1 +1 1/8
D 0.6 -0.4 +1 1/8
E 0.2 0.4 -1 1/8
F -0.8 -0.1 -1 1/8
G 0.4 -0.9 -1 1/8
H 0.2 0.5 -1 1/8
= 0.3 E1 - 0.6 E2
= 0.5 E1 - 0.5 E2
= 0.7 E1 - 0.1 E2
= 0.6 E1 - 0.4 E2
= 0.2 E1 + 0.4 E2
= -0.8 E1 - 0.1 E2
= 0.4 E1 - 0.9 E2
= 0.2 E1 + 0.5 E2
20 Sep 2011 90
11755/18797
Initialize D1(xi) = 1/NFor t = 1, …, T
Train a weak classifier ht using distribution Dt
Compute total error on training data t = Sum {Dt (xi) ½(1 – yi ht(xi))}
Set t = ½ ln (t /(1 – t)) For i = 1… N
set Dt+1(xi) = Dt(xi) exp(- t yi ht(xi)) Normalize Dt+1 to make it a distribution
The final classifier is H(x) = sign(t t ht(x))
The ADABoost Algorithm
20 Sep 2011 91
11755/18797
The E1“Stump”
ID E1 E2. Class Weight
A 0.3 -0.6 +1 1/8
B 0.5 -0.5 +1 1/8
C 0.7 -0.1 +1 1/8
D 0.6 -0.4 +1 1/8
E 0.2 0.4 -1 1/8
F -0.8 -0.1 -1 1/8
G 0.4 -0.9 -1 1/8
H 0.2 0.5 -1 1/8
0.3 0.5 0.6 0.70.2-0.8 0.40.2
F E H A G B C D
1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
Classifier based on E1:if ( sign*wt(E1) > thresh) > 0) face = true
sign = +1 or -1
Sign = +1, error = 3/8Sign = -1, error = 5/8
threshold
20 Sep 2011 92
11755/18797
The E1 “Stump”
ID E1 E2. Class Weight
A 0.3 -0.6 +1 1/8
B 0.5 -0.5 +1 1/8
C 0.7 -0.1 +1 1/8
D 0.6 -0.4 +1 1/8
E 0.2 0.4 -1 1/8
F -0.8 -0.1 -1 1/8
G 0.4 -0.9 -1 1/8
H 0.2 0.5 -1 1/8
0.3 0.5 0.6 0.70.2-0.8 0.40.2
F E H A G B C D
1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
Classifier based on E1:if ( sign*wt(E1) > thresh) > 0) face = true
sign = +1 or -1
Sign = +1, error = 2/8Sign = -1, error = 6/8
threshold
20 Sep 2011 93
11755/18797
The E1 “Stump”
ID E1 E2. Class Weight
A 0.3 -0.6 +1 1/8
B 0.5 -0.5 +1 1/8
C 0.7 -0.1 +1 1/8
D 0.6 -0.4 +1 1/8
E 0.2 0.4 -1 1/8
F -0.8 -0.1 -1 1/8
G 0.4 -0.9 -1 1/8
H 0.2 0.5 -1 1/8
0.3 0.5 0.6 0.70.2-0.8 0.40.2
F E H A G B C D
1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
Classifier based on E1:if ( sign*wt(E1) > thresh) > 0) face = true
sign = +1 or -1
Sign = +1, error = 1/8Sign = -1, error = 7/8
threshold
20 Sep 2011 94
11755/18797
The E1 “Stump”
ID E1 E2. Class Weight
A 0.3 -0.6 +1 1/8
B 0.5 -0.5 +1 1/8
C 0.7 -0.1 +1 1/8
D 0.6 -0.4 +1 1/8
E 0.2 0.4 -1 1/8
F -0.8 -0.1 -1 1/8
G 0.4 -0.9 -1 1/8
H 0.2 0.5 -1 1/8
0.3 0.5 0.6 0.70.2-0.8 0.40.2
F E H A G B C D
1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
Classifier based on E1:if ( sign*wt(E1) > thresh) > 0) face = true
sign = +1 or -1
Sign = +1, error = 2/8Sign = -1, error = 6/8
threshold
20 Sep 2011 95
11755/18797
The E1 “Stump”
ID E1 E2. Class Weight
A 0.3 -0.6 +1 1/8
B 0.5 -0.5 +1 1/8
C 0.7 -0.1 +1 1/8
D 0.6 -0.4 +1 1/8
E 0.2 0.4 -1 1/8
F -0.8 -0.1 -1 1/8
G 0.4 -0.9 -1 1/8
H 0.2 0.5 -1 1/8
0.3 0.5 0.6 0.70.2-0.8 0.40.2
F E H A G B C D
1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
Classifier based on E1:if ( sign*wt(E1) > thresh) > 0) face = true
sign = +1 or -1
Sign = +1, error = 1/8Sign = -1, error = 7/8
threshold
20 Sep 2011 96
11755/18797
The E1 “Stump”
ID E1 E2. Class Weight
A 0.3 -0.6 +1 1/8
B 0.5 -0.5 +1 1/8
C 0.7 -0.1 +1 1/8
D 0.6 -0.4 +1 1/8
E 0.2 0.4 -1 1/8
F -0.8 -0.1 -1 1/8
G 0.4 -0.9 -1 1/8
H 0.2 0.5 -1 1/8
0.3 0.5 0.6 0.70.2-0.8 0.40.2
F E H A G B C D
1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
Classifier based on E1:if ( sign*wt(E1) > thresh) > 0) face = true
sign = +1 or -1
Sign = +1, error = 2/8Sign = -1, error = 6/8
threshold
20 Sep 2011 97
11755/18797
The Best E1 “Stump”
ID E1 E2. Class Weight
A 0.3 -0.6 +1 1/8
B 0.5 -0.5 +1 1/8
C 0.7 -0.1 +1 1/8
D 0.6 -0.4 +1 1/8
E 0.2 0.4 -1 1/8
F -0.8 -0.1 -1 1/8
G 0.4 -0.9 -1 1/8
H 0.2 0.5 -1 1/8
0.3 0.5 0.6 0.70.2-0.8 0.40.2
F E H A G B C D
1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
Classifier based on E1:if ( sign*wt(E1) > thresh) > 0) face = true
Sign = +1Threshold = 0.45Sign = +1, error = 1/8
threshold
20 Sep 2011 98
11755/18797
The E2“Stump”
ID E1 E2. Class Weight
A 0.3 -0.6 +1 1/8
B 0.5 -0.5 +1 1/8
C 0.7 -0.1 +1 1/8
D 0.6 -0.4 +1 1/8
E 0.2 0.4 -1 1/8
F -0.8 -0.1 -1 1/8
G 0.4 -0.9 -1 1/8
H 0.2 0.5 -1 1/8
-0.4 -0.1 0.4 0.5-0.6-0.9 -0.1-0.5
G A B D C F E H
1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
Classifier based on E2:if ( sign*wt(E2) > thresh) > 0) face = true
sign = +1 or -1
Sign = +1, error = 3/8Sign = -1, error = 5/8
threshold
Note order
20 Sep 2011 99
11755/18797
The Best E2“Stump”
ID E1 E2. Class Weight
A 0.3 -0.6 +1 1/8
B 0.5 -0.5 +1 1/8
C 0.7 -0.1 +1 1/8
D 0.6 -0.4 +1 1/8
E 0.2 0.4 -1 1/8
F -0.8 -0.1 -1 1/8
G 0.4 -0.9 -1 1/8
H 0.2 0.5 -1 1/8
-0.4 -0.1 0.4 0.5-0.6-0.9 -0.1-0.5
1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
Classifier based on E2:if ( sign*wt(E2) > thresh) > 0) face = true
sign = -1Threshold = 0.15
Sign = -1, error = 2/8
threshold
G A B D C F E H
20 Sep 2011 100
11755/18797
The Best “Stump”
ID E1 E2. Class Weight
A 0.3 -0.6 +1 1/8
B 0.5 -0.5 +1 1/8
C 0.7 -0.1 +1 1/8
D 0.6 -0.4 +1 1/8
E 0.2 0.4 -1 1/8
F -0.8 0.1 -1 1/8
G 0.4 -0.9 -1 1/8
H 0.2 0.5 -1 1/8
0.3 0.5 0.6 0.70.2-0.8 0.40.2
F E H A G B C D
1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
The Best overall classifierbased on a single feature isbased on E1
If (wt(E1) > 0.45) Face
Sign = +1, error = 1/8
threshold
20 Sep 2011 101
11755/18797
Initialize D1(xi) = 1/NFor t = 1, …, T
Train a weak classifier ht using distribution Dt
Compute total error on training data t = Sum {Dt (xi) ½(1 – yi ht(xi))}
Set t = ½ ln (t /(1 – t)) For i = 1… N
set Dt+1(xi) = Dt(xi) exp(- t yi ht(xi)) Normalize Dt+1 to make it a distribution
The final classifier is H(x) = sign(t t ht(x))
The ADABoost Algorithm
20 Sep 2011 102
11755/18797
The Best Error
ID E1 E2. Class Weight
A 0.3 -0.6 +1 1/8
B 0.5 -0.5 +1 1/8
C 0.7 -0.1 +1 1/8
D 0.6 -0.4 +1 1/8
E 0.2 0.4 -1 1/8
F -0.8 0.1 -1 1/8
G 0.4 -0.9 -1 1/8
H 0.2 0.5 -1 1/8
0.3 0.5 0.6 0.70.2-0.8 0.40.2
F E H A G B C D
1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
The Error of the classifieris the sum of the weights ofthe misclassified instances
Sign = +1, error = 1/8
threshold
NOTE: THE ERROR IS THE SUM OF THE WEIGHTS OF MISCLASSIFIEDINSTANCES20 Sep 2011 103
11755/18797
Initialize D1(xi) = 1/NFor t = 1, …, T
Train a weak classifier ht using distribution Dt
Compute total error on training data t = Sum {Dt (xi) ½(1 – yi ht(xi))}
Set t = ½ ln ((1 – t) / t) For i = 1… N
set Dt+1(xi) = Dt(xi) exp(- t yi ht(xi)) Normalize Dt+1 to make it a distribution
The final classifier is H(x) = sign(t t ht(x))
The ADABoost Algorithm
20 Sep 2011 104
11755/18797
Computing Alpha
0.3 0.5 0.6 0.70.2-0.8 0.40.2
F E H A G B C D
1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
Alpha = 0.5ln((1-1/8) / (1/8))
= 0.5 ln(7) = 0.97
Sign = +1, error = 1/8
threshold
20 Sep 2011 105
11755/18797
The Boosted Classifier Thus Far
0.3 0.5 0.6 0.70.2-0.8 0.40.2
F E H A G B C D
1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
Alpha = 0.5ln((1-1/8) / (1/8))
= 0.5 ln(7) = 0.97
Sign = +1, error = 1/8
threshold
h1(X) = wt(E1) > 0.45 ? +1 : -1
H(X) = sign(0.97 * h1(X))
It’s the same as h1(x)
20 Sep 2011 106
11755/18797
Initialize D1(xi) = 1/NFor t = 1, …, T
Train a weak classifier ht using distribution Dt
Compute total error on training data t = Average {½ (1 – yi ht(xi))}
Set t = ½ ln ((1 – t) / t) For i = 1… N
set Dt+1(xi) = Dt(xi) exp(- t yi ht(xi)) Normalize Dt+1 to make it a distribution
The final classifier is H(x) = sign(t t ht(x))
The ADABoost Algorithm
20 Sep 2011 107
11755/18797
The Best Error
ID E1 E2. Class Weight Weight
A 0.3 -0.6 +1 1/8 * 2.63 0.33
B 0.5 -0.5 +1 1/8 * 0.38 0.05
C 0.7 -0.1 +1 1/8 * 0.38 0.05
D 0.6 -0.4 +1 1/8 * 0.38 0.05
E 0.2 0.4 -1 1/8 * 0.38 0.05
F -0.8 0.1 -1 1/8 * 0.38 0.05
G 0.4 -0.9 -1 1/8 * 0.38 0.05
H 0.2 0.5 -1 1/8 * 0.38 0.05
0.3 0.5 0.6 0.70.2-0.8 0.40.2
F E H A G B C D
1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
threshold
Dt+1(xi) = Dt(xi) exp(- t yi ht (xi))
exp(t) = exp(0.97) = 2.63exp(-t) = exp(-0.97) = 0.38
Multiply the correctly classified instances by 0.38Multiply incorrectly classified instances by 2.63
20 Sep 2011 108
11755/18797
Initialize D1(xi) = 1/NFor t = 1, …, T
Train a weak classifier ht using distribution Dt
Compute total error on training data t = Average {½ (1 – yi ht(xi))}
Set t = ½ ln ((1 – t) / t) For i = 1… N
set Dt+1(xi) = Dt(xi) exp(- t yi ht(xi)) Normalize Dt+1 to make it a distribution
The final classifier is H(x) = sign(t t ht(x))
The ADABoost Algorithm
20 Sep 2011 109
11755/18797
The Best Error
ID E1 E2. Class Weight Weight Weight
A 0.3 -0.6 +1 1/8 * 2.63 0.33 0.48
B 0.5 -0.5 +1 1/8 * 0.38 0.05 0.074
C 0.7 -0.1 +1 1/8 * 0.38 0.05 0.074
D 0.6 -0.4 +1 1/8 * 0.38 0.05 0.074
E 0.2 0.4 -1 1/8 * 0.38 0.05 0.074
F -0.8 0.1 -1 1/8 * 0.38 0.05 0.074
G 0.4 -0.9 -1 1/8 * 0.38 0.05 0.074
H 0.2 0.5 -1 1/8 * 0.38 0.05 0.074
0.3 0.5 0.6 0.70.2-0.8 0.40.2
F E H A G B C D
1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
threshold
D’ = D / sum(D)
Multiply the correctly classified instances by 0.38Multiply incorrectly classified instances by 2.63Normalize to sum to 1.0
20 Sep 2011 110
11755/18797
The Best Error
ID E1 E2. Class Weight
A 0.3 -0.6 +1 0.48
B 0.5 -0.5 +1 0.074
C 0.7 -0.1 +1 0.074
D 0.6 -0.4 +1 0.074
E 0.2 0.4 -1 0.074
F -0.8 0.1 -1 0.074
G 0.4 -0.9 -1 0.074
H 0.2 0.5 -1 0.074
0.3 0.5 0.6 0.70.2-0.8 0.40.2
F E H A G B C D
1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
threshold
D’ = D / sum(D)
Multiply the correctly classified instances by 0.38Multiply incorrectly classified instances by 2.63Normalize to sum to 1.0
20 Sep 2011 111
11755/18797
Initialize D1(xi) = 1/NFor t = 1, …, T
Train a weak classifier ht using distribution Dt
Compute total error on training data t = Average {½ (1 – yi ht(xi))}
Set t = ½ ln (t /(1 – t)) For i = 1… N
set Dt+1(xi) = Dt(xi) exp(- t yi ht(xi)) Normalize Dt+1 to make it a distribution
The final classifier is H(x) = sign(t t ht(x))
The ADABoost Algorithm
20 Sep 2011 112
11755/18797
E1 classifier
ID E1 E2. Class Weight
A 0.3 -0.6 +1 0.48
B 0.5 -0.5 +1 0.074
C 0.7 -0.1 +1 0.074
D 0.6 -0.4 +1 0.074
E 0.2 0.4 -1 0.074
F -0.8 0.1 -1 0.074
G 0.4 -0.9 -1 0.074
H 0.2 0.5 -1 0.074
0.3 0.5 0.6 0.70.2-0.8 0.40.2
F E H A G B C D
.074 .074 .074 .48 .074 .074 .074 .074
threshold
Classifier based on E1:if ( sign*wt(E1) > thresh) > 0) face = true
sign = +1 or -1
Sign = +1, error = 0.222Sign = -1, error = 0.778
20 Sep 2011 113
11755/18797
E1 classifier
ID E1 E2. Class Weight
A 0.3 -0.6 +1 0.48
B 0.5 -0.5 +1 0.074
C 0.7 -0.1 +1 0.074
D 0.6 -0.4 +1 0.074
E 0.2 0.4 -1 0.074
F -0.8 0.1 -1 0.074
G 0.4 -0.9 -1 0.074
H 0.2 0.5 -1 0.074
0.3 0.5 0.6 0.70.2-0.8 0.40.2
F E H A G B C D
.074 .074 .074 .074 .074 .074
threshold
Classifier based on E1:if ( sign*wt(E1) > thresh) > 0) face = true
sign = +1 or -1
Sign = +1, error = 0.148Sign = -1, error = 0.852
.48 .074
20 Sep 2011 114
11755/18797
The Best E1 classifier
ID E1 E2. Class Weight
A 0.3 -0.6 +1 0.48
B 0.5 -0.5 +1 0.074
C 0.7 -0.1 +1 0.074
D 0.6 -0.4 +1 0.074
E 0.2 0.4 -1 0.074
F -0.8 0.1 -1 0.074
G 0.4 -0.9 -1 0.074
H 0.2 0.5 -1 0.074
0.3 0.5 0.6 0.70.2-0.8 0.40.2
F E H A G B C D
.074 .074 .074 .074 .074 .074
threshold
Classifier based on E1:if ( sign*wt(E1) > thresh) > 0) face = true
sign = +1 or -1
Sign = +1, error = 0.074
.48 .074
20 Sep 2011 115
11755/18797
The Best E2 classifier
ID E1 E2. Class Weight
A 0.3 -0.6 +1 0.48
B 0.5 -0.5 +1 0.074
C 0.7 -0.1 +1 0.074
D 0.6 -0.4 +1 0.074
E 0.2 0.4 -1 0.074
F -0.8 -0.1 -1 0.074
G 0.4 -0.9 -1 0.074
H 0.2 0.5 -1 0.074
-0.4 -0.1 0.4 0.5-0.6-0.9 -0.1-0.5
G A B D C F E H
.074 .48 .074 .074 .074 .074 .074 .074
threshold
Classifier based on E2:if ( sign*wt(E2) > thresh) > 0) face = true
sign = +1 or -1
Sign = -1, error = 0.148
20 Sep 2011 116
11755/18797
The Best Classifier
ID E1 E2. Class Weight
A 0.3 -0.6 +1 0.48
B 0.5 -0.5 +1 0.074
C 0.7 -0.1 +1 0.074
D 0.6 -0.4 +1 0.074
E 0.2 0.4 -1 0.074
F -0.8 0.1 -1 0.074
G 0.4 -0.9 -1 0.074
H 0.2 0.5 -1 0.074
0.3 0.5 0.6 0.70.2-0.8 0.40.2
F E H A G B C D
.074 .074 .074 .074 .074 .074
threshold
Classifier based on E1:if (wt(E1) > 0.45) face = true
Sign = +1, error = 0.074
.48 .074
Alpha = 0.5ln((1-0.074) / 0.074) = 1.26
20 Sep 2011 117
11755/18797
The Boosted Classifier Thus Far
h1(X) = wt(E1) > 0.45 ? +1 : -1
h2(X) = wt(E1) > 0.25 ? +1 : -1
H(X) = sign(0.97 * h1(X) + 1.26 * h2(X))
0.3 0.5 0.6 0.70.2-0.8 0.40.2
F E H A G B C D
.074 .074 .074 .074 .074 .074
threshold
.48 .074
threshold
20 Sep 2011 118
11755/18797
Reweighting the Data
ID E1 E2. Class Weight
A 0.3 -0.6 +1 0.48*0.28 0.32
B 0.5 -0.5 +1 0.074*0.28 0.05
C 0.7 -0.1 +1 0.074*0.28 0.05
D 0.6 -0.4 +1 0.074*0.28 0.05
E 0.2 0.4 -1 0.074*0.28 0.05
F -0.8 0.1 -1 0.074*0.28 0.05
G 0.4 -0.9 -1 0.074*3.5 0.38
H 0.2 0.5 -1 0.074*0.28 0.05
0.3 0.5 0.6 0.70.2-0.8 0.40.2
F E H A G B C D
.074 .074 .074 .074 .074 .074
threshold
Sign = +1, error = 0.074
.48 .074
Exp(alpha) = exp(1.26) = 3.5Exp(-alpha) = exp(-1.26) = 0.28
RENORMALIZE20 Sep 2011 119
11755/18797
Reweighting the Data
0.3 0.5 0.6 0.70.2-0.8 0.40.2
F E H A G B C D
.074 .074 .074 .074 .074 .074
threshold
Sign = +1, error = 0.074
.48 .074
RENORMALIZE
NOTE: THE WEIGHT OF “G”WHICH WAS MISCLASSIFIEDBY THE SECOND CLASSIFIERIS NOW SUDDENLY HIGH
20 Sep 2011 120
ID E1 E2. Class Weight
A 0.3 -0.6 +1 0.48*0.28 0.32
B 0.5 -0.5 +1 0.074*0.28 0.05
C 0.7 -0.1 +1 0.074*0.28 0.05
D 0.6 -0.4 +1 0.074*0.28 0.05
E 0.2 0.4 -1 0.074*0.28 0.05
F -0.8 0.1 -1 0.074*0.28 0.05
G 0.4 -0.9 -1 0.074*3.5 0.38
H 0.2 0.5 -1 0.074*0.28 0.05
11755/18797
AdaBoost In this example both of our first two classifiers were
based on E1 Additional classifiers may switch to E2
In general, the reweighting of the data will result in a different feature being picked for each classifier
This also automatically gives us a feature selection strategy In this data the wt(E1) is the most important feature
20 Sep 2011 121
11755/18797
AdaBoost NOT required to go with the best classifier so far For instance, for our second classifier, we might use
the best E2 classifier, even though its worse than the E1 classifier So long as its right more than 50% of the time
We can continue to add classifiers even after we get 100% classification of the training data Because the weights of the data keep changing Adding new classifiers beyond this point is often a
good thing to do
20 Sep 2011 122
11755/18797
ADA Boost
The final classifier is H(x) = sign(t t ht(x))
The output is 1 if the total weight of all weak learners that classify x as 1 is greater than the total weight of all weak learners that classify it as -1
E1 E2
= 0.4 E1 - 0.4 E2
20 Sep 2011 123
11755/18797
Boosting and Face Detection Boosting forms the basis of the most
common technique for face detection today: The Viola-Jones algorithm.
20 Sep 2011 124
11755/18797
The problem of face detection 1. Defining Features
Should we be searching for noses, eyes, eyebrows etc.? Nice, but expensive
Or something simpler
2. Selecting Features Of all the possible features we can think of, which ones
make sense
3. Classification: Combining evidence How does one combine the evidence from the different
features?
20 Sep 2011 125
11755/18797
Features: The Viola Jones Method
Integral Features!! Like the Checkerboard
The same principle as we used to decompose images in terms of checkerboards: The image of any object has changes at various scales These can be represented coarsely by a checkerboard pattern
The checkerboard patterns must however now be localized Stay within the region of the face
B1 B2 B3 B4 B5 B6
...Im 332211 BwBwBwage
20 Sep 2011 126
Features Checkerboard Patterns to represent facial features
The white areas are subtracted from the black ones. Each checkerboard explains a localized portion of the
image Four types of checkerboard patterns (only)
20 Sep 2011 12711755/18797
11755/18797
Explaining a portion of the face with a checker..
How much is the difference in average intensity of the image in the black and white regions Sum(pixel values in white region) – Sum(pixel values in black
region) This is actually the dot product of the region of the face covered
by the rectangle and the checkered pattern itself White = 1, Black = -1
20 Sep 2011 128
11755/18797
“Integral” features
Each checkerboard has the following characteristics Length Width Type
Specifies the number and arrangement of bands
The four checkerboards above are the four used by Viola and Jones
20 Sep 2011 129
Integral images Summed area tables
For each pixel store the sum of ALL pixels to the left of and above it.
20 Sep 2011 13011755/18797
11755/18797
Fast Computation of Pixel Sums
20 Sep 2011 131
11755/18797
Store pixel table for every pixel in the image The sum of all pixel values to the left of and above the pixel
Let A, B, C, D, E, F be the pixel table values at the locations shown Total pixel value of black area = D + A – B – C Total pixel value of white area = F + C – D – E Feature value = (F + C – D – E) – (D + A – B – C)
A Fast Way to Compute the Feature A B
D
FC
E
20 Sep 2011 132
11755/18797
How many features?
Each checker board of width P and height H can start at (0,0), (0,1),(0,2), … (0, N-P) (1,0), (1,1),(1,2), … (1, N-P) .. (M-H,0), (M-H,1), (M-H,2), … ( M-H, N-P)
(M-H)*(N-P) possible starting locations Each is a unique checker feature
E.g. at one location it may measure the forehead, at another the chin
MxNPxH
20 Sep 2011 133
11755/18797
How many features
Each feature can have many sizes Width from (min) to (max) pixels Height from (min ht) to (max ht) pixels
At each size, there can be many starting locations Total number of possible checkerboards of one type:
No. of possible sizes x No. of possible locations There are four types of checkerboards
Total no. of possible checkerboards: VERY VERY LARGE!
20 Sep 2011 134
11755/18797
Learning: No. of features Analysis performed on images of 24x24
pixels only Reduces the no. of possible features to about
180000 Restrict checkerboard size
Minimum of 8 pixels wide Minimum of 8 pixels high
Other limits, e.g. 4 pixels may be used too Reduces no. of checkerboards to about 50000
20 Sep 2011 135
11755/18797
No. of features
Each possible checkerboard gives us one feature A total of up to 180000 features derived from a 24x24 image! Every 24x24 image is now represented by a set of 180000
numbers This is the set of features we will use for classifying if it is a face
or not!
F1 F2 F3 F4 ….. F180000
7 9 2 -1 ….. 12
-11 3 19 17 ….. 2
14 Sep 2010 136
11755/18797
The Classifier The Viola-Jones algorithm uses a simple Boosting
based classifier Each “weak learner” is a simple threshold At each stage find the best feature to classify the
data with I.e the feature that gives us the best classification of all the
training data Training data includes many examples of faces and non-face
images The classification rule is of the kind
If feature > threshold, face (or if feature < threshold, face) The optimal value of “threshold” must also be determined.
20 Sep 2011 137
11755/18797
The Weak Learner Training (for each weak learner):
For each feature f (of all 180000 features) Find a threshold f and polarity p(f) (p(f) = -1 or p(f) = 1) such that (f
> p(f)*f) performs the best classification of faces Lowest overall error in classifying all training data
Error counted over weighted samples Let the optimal overall error for f be error(f)
Find the feature f’ such that error(f’) is lowest The weak learner is the test (f’ > p(f’)*f’face
Note that the procedure for learning weak learners also identifies the most useful features for face recognition
20 Sep 2011 138
11755/18797
The Viola Jones Classifier A boosted threshold-based classifier First weak learner: Find the best feature, and
its optimal threshold Second weak learner: Find the best feature, for
the weighted training data, and its threshold (weighting from one weak learner) Third weak learner: Find the best feature for the
weighted data and its optimal threshold (weighting from two weak learners) Fourth weak learner: Find the best feature for the weighted
data and its optimal threhsold (weighting from three weak learners) ..
20 Sep 2011 139
11755/18797
To Train Collect a large number of histogram
equalized facial images Resize all of them to 24x24 These are our “face” training set
Collect a much much much larger set of 24x24 non-face images of all kinds Each of them is histogram equalized These are our “non-face” training set
Train a boosted classifier
20 Sep 2011 140
11755/18797
The Viola Jones Classifier
During tests: Given any new 24x24 image
R = f f (f > pf (f)) Only a small number of features (f < 100) typically used
Problems: Only classifies 24 x 24 images entirely as faces or non-faces
Typical pictures are much larger They may contain many faces Faces in pictures can be much larger or smaller
Not accurate enough
20 Sep 2011 141
11755/18797
Multiple faces in the picture
Scan the image Classify each 24x24 rectangle from the photo All rectangles that get classified as having a face indicate the
location of a face For an NxM picture, we will perform (N-24)*(M-24) classifications If overlapping 24x24 rectangles are found to have faces, merge
them
14 Sep 2010 142
11755/18797
Multiple faces in the picture
Scan the image Classify each 24x24 rectangle from the photo All rectangles that get classified as having a face indicate the
location of a face For an NxM picture, we will perform (N-24)*(M-24) classifications If overlapping 24x24 rectangles are found to have faces, merge
them
14 Sep 2010 143
11755/18797
Multiple faces in the picture
Scan the image Classify each 24x24 rectangle from the photo All rectangles that get classified as having a face indicate the
location of a face For an NxM picture, we will perform (N-24)*(M-24) classifications If overlapping 24x24 rectangles are found to have faces, merge
them
14 Sep 2010 144
11755/18797
Multiple faces in the picture
Scan the image Classify each 24x24 rectangle from the photo All rectangles that get classified as having a face indicate the
location of a face For an NxM picture, we will perform (N-24)*(M-24) classifications If overlapping 24x24 rectangles are found to have faces, merge
them
14 Sep 2010 145
11755/18797
Picture size solution We already have a
classifier That uses weak
learners Scale each classifier
Every weak learner Scale its size up by
factor . Scale the threshold up to .
Do this for many scaling factors
2x
f,
f, 2
20 Sep 2011 146
11755/18797
Overall solution
Scan the picture with classifiers of size 24x24 Scale the classifier to 26x26 and scan Scale to 28x28 and scan etc.
Faces of different sizes will be found at different scales
20 Sep 2011 147
11755/18797
False Rejection vs. False detection False Rejection: There’s a face in the image, but the classifier
misses it Rejects the hypothesis that there’s a face
False detection: Recognizes a face when there is none.
Classifier: Standard boosted classifier: H(x) = sign(t t ht(x))
Modified classifier H(x) = sign(t t ht(x) + Y) t t ht(x) is a measure of certainty
The higher it is, the more certain we are that we found a face If Y is large, then we assume the presence of a face even when we
are not sure By increasing Y, we can reduce false rejection, while increasing
false detection
20 Sep 2011 148
11755/18797
ROC
Ideally false rejection will be 0%, false detection will also be 0%
As Y increaases, we reject faces less and less But accept increasing amounts of garbage as faces
Can set Y so that we rarely miss a face
vs false neg determined by
% False detection
%F
alse
Rej
ecti
n
0 100
0
100
As Y increases
20 Sep 2011 149
11755/18797
Problem: Not accurate enough, too slow
If we set Y high enough, we will never miss a face But will classify a lot of junk as faces
Solution: Classify the output of the first classifier with a second classifier And so on.
Classifier 1
Not a face
Classifier 2
Not a face
20 Sep 2011 150
11755/18797
Problem: Not accurate enough, too slow
If we set Y high enough, we will never miss a face But will classify a lot of junk as faces
Solution: Classify the output of the first classifier with a second classifier And so on.
20 Sep 2011 151
11755/18797
Useful Features Learned by Boosting
20 Sep 2011 152
11755/18797
A Cascade of Classifiers
20 Sep 2011 153
11755/18797
Detection in Real Images Basic classifier operates on 24 x 24 subwindows
Scaling: Scale the detector (rather than the images) Features can easily be evaluated at any scale Scale by factors of 1.25
Location: Move detector around the image (e.g., 1 pixel increments)
Final Detections A real face may result in multiple nearby detections Postprocess detected subwindows to combine overlapping
detections into a single detection
20 Sep 2011 154
Training In paper, 24x24 images of faces and non faces (positive and
negative examples).
20 Sep 2011 15511755/18797
Sample results using the Viola-
Jones Detector Notice detection at multiple scales
20 Sep 2011 15611755/18797
11755/18797
More Detection Examples
20 Sep 2011 157
11755/18797
Practical implementation Details discussed in Viola-Jones paper
Training time = weeks (with 5k faces and 9.5k non-faces)
Final detector has 38 layers in the cascade, 6060 features
700 Mhz processor: Can process a 384 x 288 image in 0.067 seconds (in 2003
when paper was written)
20 Sep 2011 158