Robust real-time face detection Paul A. Viola and Michael J. Jones Intl. J. Computer Vision 57(2),...

Robust real-time face detection

Paul A. Viola and Michael J. Jones

Intl. J. Computer Vision

57(2), 137–154, 2004

(originally in CVPR’2001)(slides adapted from Bill Freeman, MIT 6.869, April 2005)

Robust Real-Time Face Detection1

Scan classifier over locs. & scales


“Learn” classifier from data


Training Data• 5000 faces (frontal)• 108 non faces• Faces are normalized

Scale, translation

Many variations• Across individuals• Illumination• Pose (rotation both in plane and out)

Characteristics of Algorithm


• Feature set (…is huge about 16M features)• Efficient feature selection using AdaBoost• New image representation: Integral Image • Cascaded Classifier for rapid detection

Fastest known frontal face detector for gray scale images

Integral Image


Allows for fast feature evaluation Do not work directly on image intensities

Compute integral image using a few operations per pixel (similar with Haar Basis functions)

Simple and Efficient Classifier


Select a small number of important features from a huge library of potential features using AdaBoost [Freund and Schapire,1995]

AdaBoost, Adaptive Boosting


Formulated by Yoav Freund and Robert Schapire.[1] It is a meta-algorithm, can be used in conjunction with many other learning algorithms

to improve their performance.

AdaBoost is adaptive subsequent classifiers are tweaked in favor of instances misclassified by previous classifiers.

Sensitive to noisy data and outliers. Less susceptible to the overfitting problem than most algorithms in some problems.

Calls a weak classifier repeatedly in a series of rounds from T classifiers. For each call

a distribution of weights Dt is updated that indicates the importance of examples in the data set

On each round, the weights of each incorrectly classified example are increased Or alternatively, the weights of each correctly classified example are decreased), The new classifier focuses more on those examples

AdaBoost


Given , Initialize For

For each classifier that minimizes the error with respect to the distribution

is the weighted error rate of classifier

If , then stop Choose , typically Update

where is a normalized factor (choose so that Dt+1 will sum_x=1)

1 1( , ),..., ( , )m mx y x y , { 1, 1}i ix X y Y

1

1( ) , 1,..., ,D i i m

m

1,...,t T: { 1, 1}th X

tD

arg mint

t th H

h

( )[ ( )]t t i t iD i y h x

0.5t

t R 11ln

2t

tt

t th

1

( ) exp( ( ))( ) t t i t i

tt

D i y h xD i

Z

tZ

AdaBoost


Output the final classifier

The equation to update the distribution Dt is constructed so that

After selecting an optimal classifier for the distribution Examples that the classifier identified correctly are weighted less Examples that is identified incorrectly are weighted more.

When the algorithm is testing the classifiers on the distribution it will select a classifier that better identifies those examples that

the previous classifier missed.

1

( ) ( )T

t tt

H x sign a h x

0, ( ) ( )( )

0, ( ) ( )t i

t i t it i

y i h xa y h x

y i h x

Characteristics of Algorithm


• Feature set (…is huge about 16M features)• Efficient feature selection using AdaBoost• New image representation: Integral Image • Cascaded Classifier for rapid detection

Cascaded Classifier


Combining successively more complex classifiers in a cascade structure Dramatically increases the speed of the detector by Focusing attention on promising regions of the image.

Focus of attention approaches It is often possible to rapidly determine where in an image a

face might occur (Tsotsos et al., 1995; Itti et al., 1998; Amit and Geman, 1999; Fleuret and Geman, 2001).

More complex processing is reserved only for these promising regions.

The key measure of such an approach is the “false

negative” rate of the attentional process.

Cascaded Classifier


Training process An extremely simple and efficient classifier Used as a “supervised” focus of attention operator.

A face detection attentional operator Filter out over 50% of the image Preserving 99% of the faces over a large dataset

This filter is exceedingly efficient it can be evaluated in 20 simple operations per

location/scale

Overview


Features: form and computing Combing features to form a classifier: AdaBoost Constructing cascade of classifiers Experimental results Discussions

Features


Using features rather than image pixels

Features act to encode ad-hoc domain knowledge that is difficult to learn using a finite quantity of training data

Much faster than a pixel-based system

Image features


• “Rectangle filters” [Papageorgiou et al. 1998] Similar to Haar wavelets

• Differences between sums of pixels inadjacent rectangles

• About 160000 rectangle features for a 200x200 image

Integral Image


Partial sum

Any rectangle is D = 1+4-(2+3)

Also known as:• summed area tables [Crow84]• boxlets [Simard98]

Huge library of filters


Feature Discussion


Primitive when compared with steerable filters, etc…

Excellent for the detailed analysis of boundaries, image compression, and texture analysis.

Sensitive to the presence of edges, bars, and other simple image structure

Quite coarse: only three orientations (|, X, --)

Overcomplete: 400 times, aspect ratio, location

Computational Advantage


Face detector scans the input at many scales starting at the base scale: detect face at a size of 24 × 24

pixels, Then at 12 scales, 1.25 larger than the last 384 × 288 pixel image is scanned at the top scale

The conventional approach: Compute a pyramid of 12 images (smaller and smaller image) A fixed scale detector is scanned at each image.

Computation of the pyramid directly requires significant time. It takes around .05 seconds to compute a 12 level pyramid of

this size (on an Intel PIII 700 MHz processor) Implemented efficiently on conventional hardware (using bilinear

interpolation to scale each level of the pyramid)

Computational Advantage


Define a meaningful set of rectangle features A single feature can be evaluated at any scale and

location in a few operations.

Effective detectors is constructed with two rectangle features.

Computational efficiency of features Face detection process can be completed for an entire

image at every scale at 15 frames per second About the same time required to evaluate the 12 level

image pyramid alone.

Learning Classification Functions


Any machine learning methods Given the feature set and training set

Mixture of Gaussian model (Sung and Poggio, 1998) Simple image feature and neural network (Rowley et al.

1998) Support Vector Machine (Osuna et al. 1997b) Winnow learning procedure (Roth et al. 2000)

160000 featuresEven though each feature can be

computed very efficiently, computingthe complete set is prohibitively expensive

AdaBoost


A very small number of features can be combined to form an effective classifier

Boost the classification performance Combining a collection of weak classification functions to form

a stronger classifier Weak learner

Do not expect even the best classification function to classify the training data well

The first round of learning Examples are re-weighted in order to emphasize those which were

incorrectly classified by the previous weak classifier. The final strong classifier

takes the form of a perceptron, a weighted combination of weak classifiers followed by a threshold.6

Training error of the strong classifier approaches zero exponentially in the number of rounds

AdaBoost


Selecting a small set of good classification functions nevertheless have significant variety Select effective features which nevertheless have significant

variety Restrict the weak learner to classification functions

Each function depends on a single feature

Select the single rectangle feature which best separates the positive and negative examples

1 if ( )( , , , )

0

pf x ph x f p

otherwise

threshold

24x24 subwindow

feature

Polarity indicating the direction of

inequality

AdaBoost


No single feature can perform the classification task with low error Features selected early: error rates 0.1~0.3 Features selected later: error rates 0.4~0.5

Threshold single features Single node decision trees Decision stumps

Constructing the classifier


Perceptron yields a sufficiently powerful classifier

Use AdaBoost to efficiently choose best features• add a new hi(x) at each round

• each hi(xk) is a “decision stump”b=Ew(y [x> q])

a=Ew(y [x< q])x

hi(x)

Constructing the Classifier


For each round of boosting:• Evaluate each rectangle filter on each example• Sort examples by filter values• Select best threshold for each filter (min error)

Use sorting to quickly scan for optimal threshold

• Select best filter/threshold combination• Weight is a simple function of error rate• Reweight examples

(There are many tricks to make this more efficient.)

AdaBoost using single rectangular feature


Given example images , Initialize weight For

Normalize the weights

Select the best classifier with respect to the weighted error

Define with the parameters minimizing Update weights

1 1( , ),..., ( , )m mx y x y 0,1iy

1,

1 1, for 0,1 respectively

2 2i iw ym l

1,...,t T

, ,min | ( , , , ) |t f p i i ii

w h x f p y

t( ) ( , , , )t t t th x h x f p

11, ,

iet i t i tw w

,,

,1

t it i n

t jj

ww

w

0 is classified correctly

1i

i

xe

otherwise

1t

tt

AdaBoost using single rectangular feature


The final strong classifier

1 1

11 ( )

( ) 2

0

1log

T T

t t tt t

tt

a h x aC x

otherwise

a

Good Reference on Boosting


Friedman, J., Hastie, T. and Tibshirani, R. Additive Logistic Regression: a Statistical View of Boosting

http://www-stat.stanford.edu/~hastie/Papers/boost.ps

“We show that boosting fits an additive logistic regression model by stagewise optimization of a criterion very similar to the log-likelihood, and present likelihood based alternatives. We also propose a multi-logit boosting procedure which appears to have advantages over other methods proposed so far.”

Learning Discussion


The set of weak classifier is extraordinarily large One weak classifier for each distinct

feature/threshold combination KN weak classifier

K: the number of features N: the number of examples

Others have larger classifier sets Wrapper method

M weak classifier: O(MNKN) 10^16 operations AdaBoost

O(MKN) 10^11 operations

Learning Discussion


Dependency on N? Suppose that the examples are sorted by a given feature value. Any two thresholds that lie between the same pair of sorted

examples is equivalent. Therefore the total number of distinct thresholds is N

For each feature, sort the examples based on feature value Compute optimal threshold for that feature in a single pass

over this sorted list. For each element in the list, Compute

Total sum of positive example weights T+ Total sum of negative example weights T- the sum of positive weights below the current example S+ The sum of negative weights below the current example S-

Learning Discussion


Error of a threshold split the list

The final application demanded a very aggressive process which would discard the vast majority of features.

Other feature selection


Papageorgiou et al.1998 Feature selection based on feature variance.

37 features out of 1734 features for every image subwindow: still large

Roth et al. 2000 Feature selection process based on the Winnow

exponential perceptron learning rule A very large and unusual feature set: each pixel is mapped into

a binary vector of d dimensions Concatenate all pixels to nd-D vector Perceptron: assign weight to each dimension Winnow learning process:

Converges to a solution where many of the weights are zero Very large number of features are retained (perhaps a few

hundred or thousand).

Learning Results


The classifier constructed from 200 features would yield reasonable results

1 in 14084

For a face detector to be practical for real applications, the false positive rate must be closer to 1 in 1,000,000.

Learning Results


Features selected by AdaBoost are meaningful and easily interpreted

In terms of detection Results are compelling but not sufficient for many real-

world tasks. In terms of computation

Very fast, requiring 0.7 seconds to scan an 384 by 288 pixel image.

Attentional Cascade


Achieves increased detection performance while radically reducing computation time

Construct boost classifier Rejecting many of negative sub-windows Detecting almost all positive instances. Adjusting the strong classifier threshold to minimize

false negatives: lower threshold

Attentional Cascade


Further processing

1. Evaluate the rectangle features (requires between 6 and 9 array references per feature).

2. Compute the weak classifier for each feature (requires one threshold operation per feature)

3. Combine the weak classifiers (requires one multiply per feature, an addition, and finally a threshold).

Attentional Cascade


Subsequent classifiers

Trading speed for accuracy


Given a nested set of classifier hypothesis classes

Computational Risk Minimization

Training a Cascade of Classifiers


Detection Goals Good detection rates (85%~95%) and Extremely low false positive rates (on the order of

10−5 or 10−6).

False positive rate of the cascade:

Detection rate:

1

K

ii

F f

1

K

ii

D d

To achieve a detection rate of 0.9 by a 10 stage classifier• each stage has a detection rate of 0.99• false positive rate 30% (0.3010 ≈ 6 × 10−6).

Training a Cascade of Classifiers


The expected number of features:

Scheme for trading off these errors is to adjust the threshold of the perceptron produced by AdaBoost

the positive rate of the ith classifier

the number of features in the ith classifier

Tradeoffs in Training


Classifiers with more features Achieve higher detection rates and lower false positive rates. require more time to compute

An optimization framework in which the number of classifier stages, the number of features, ni, of each stage, the threshold of each stage

are traded off in order to minimize the expected number of features N given a target for F and D.

Finding this optimum is a tremendously difficult problem.

Training Cascaded Detector


A simple framework to produce effective and efficient classifier The user selects the maximum acceptable rate for fi and the

minimum acceptable rate for di .

Each layer of the cascade is trained by AdaBoost with the number of features used being increased until the target detection and false positive rates are met for this level. The rates are determined by testing the current detector on a

validation set.

If the overall target false positive rate is not yet met then another layer is added to the cascade. The negative set for training subsequent layers is obtained by

collecting all false detections found by running the current detector on a set of images which do not contain any instances of faces.

Training Cascaded Detector


User selects values for f , the maximum acceptable false positive rate per layer and d, the minimum acceptable detection rate per layer.

• User selects target overall false positive rate, F_target .

• P = set of positive examples, N = set of negative examples

• F0 = 1.0; D0 = 1.0, i = 0

• while F_i > F_target

– i ←i + 1 – ni = 0; Fi = Fi−1 – while Fi > f × Fi−1

∗ ni ← ni + 1 ∗ Use P and N to train a classifier with ni features using AdaBoost ∗ Evaluate current cascaded classifier on validation set to determine Fi and Di . ∗ Decrease threshold for the ith classifier until the current cascaded classifier has a

detection rate of at least d × Di−1 (this also affects Fi )

– N ← ∅ – If Fi > Ftarget

Evaluate the current cascaded detector on the set of non-face images put any false detections into the set N

Simple Experiment


A monolithic 200-feature classifier and A cascade of ten 20-feature classifiers Training using

5000 faces + 10000 nonface sub-windows

Simple Experiment


A monolithic 200-feature classifier and A cascade of ten 20-feature classifiers Training using

5000 faces + 10000 nonface sub-windows

Little difference between them in terms of accuracy But cascaded classifier is nearly 10 times faster

since its first stage throws out most non-faces so that they are never evaluated by subsequent stages.

Detector Cascade Discussion


Similar to Rowley et al. (1998) (fast) Trained two neural networks

One was moderately complex focused on a small region of the image, detected faces with a low false positive rate.

Second neural network much faster focused on a larger regions of the image, and detected faces with a higher false positive rate

This method two stage cascade include 38 stages

Training Dataset


4916 hand labeled faces scaled and aligned to a base resolution of 24 by 24 pixels.

Structure of the Detector Cascade


38 layer cascade of classifiers included a total of 6060 features

First classifier constructed using two features rejects about 50% of non-faces while correctly detecting close to 100% of faces.

The next classifier has ten features rejects 80% of nonfaces while detecting almost 100% of faces.

The next two layers are 25-feature classifiers Then three 50-feature classifiers Then classifiers with variety of different numbers of features chosen

according

Speed of Face Detector


Speed is proportional to the average number of features computed per sub-window.

On the MIT+CMU test set, an average of 9 features (/ 6061) are computed per sub-window.

On a 700 Mhz Pentium III, a 384x288 pixel image takes about 0.067 seconds to process (15 fps).

Roughly 15 times faster than Rowley-Baluja-Kanade and 600 times faster than Schneiderman-Kanade.

Scanning The Detector


Multiple scales Scaling is achieved by scaling the detector itself, rather

than scaling the image The features can be evaluated at any scale with the same

cost

Locations Subsequent locations are obtained by shifting the window

some number of pixels D choice of D affects both speed and accuracy

a step size > 1 pixel tends to decrease the detection rate slightly while also decreasing the number of false positives

Integration of Multiple Detections


Postprocess: combine overlapping detections into a single detection The set of detections are first partitioned into disjoint

subsets Two detections are in the same subset if their bounding

regions overlap.

Each partition yields a single final detection. The corners of the final bounding region are the average

of the corners of all detections in the set.

Decreases the number of false positives.

Integration of Multiple Detections


A simple Voting Scheme further improves results Three detections performed similarly on the final task, but

in some cases errors were different. Retaining only those detections where at least 2 out of 3

detectors agree. This improves the final detection rate as well as

eliminating more false positives. Since detector errors are not uncorrelated, the

combination results in a measurable, but modest, improvement over the best single detector.

Sample results


MIT + CMU test set

Failure Cases


Trained on frontal, upright faces. The faces were only very roughly aligned so there is some variation in

rotation both in plane and out of plane. Detect faces that are tilted up to about ±15 degrees in plane and about

±45 degrees out of plane (toward a profile view). The detector becomes unreliable with more rotation.

Harsh backlighting in which the faces are very dark while the background is relatively light sometimes causes failures. Nonlinear variance normalization based on robust statistics to remove

outliers The problem with such a normalization is the greatly increased

computational cost within our integral image framework.

Fails on significantly occluded faces. Occluded eyes: usually fail. The face with covered mouth will usually still be detected.

Summary (Viola-Jones)


• Fastest known face detector for gray images• Three contributions with broad applicability:

Cascaded classifier yields rapid classificationAdaBoost as an extremely efficient feature

selectorRectangle Features + Integral Image can be

used for rapid image analysis

Face detector comparison


Informal study by Andrew Gallagher, CMU,for CMU 16-721 Learning-Based Methods in Vision, Spring 2007 The Viola Jones algorithm OpenCV implementation was

used. (<2 sec per image). For Schneiderman and Kanade, Object Detection Using

the Statistics of Parts [IJCV’04], the www.pittpatt.com demo was used. (~10-15 seconds per image, including web transmission).


SchneidermanKanadeViola

Jones

Robust real-time face detection Paul A. Viola and Michael J. Jones Intl. J. Computer Vision 57(2),...

Documents

Transcript of Robust real-time face detection Paul A. Viola and Michael J. Jones Intl. J. Computer Vision 57(2),...