Improvement of Viola Jones Face Detector using...

PROJECT REPORT

On

Improvement of Viola Jones Face Detector using Pre-Processing

Submitted in partial fulfillment for the award of the degree Of

BACHELOR OF TECHNOLOGY

in ELECTRONICS AND COMMUNICATION ENGINEERING

By

SHREYAS SESHADRI (10408611)

Under the guidance of

Mrs. G. REVATHI, ME, (Assistant Professor(O.G), School of Electronics and Communication Engineering)

FACULTY OF ENGINEERING AND TECHNOLOGY SRM UNIVERSITY, RAMAPURAM CAMPUS.

(Under section 3 of UGC Act, 1956) Chennai – 600089.

MAY, 2012

ii

BONAFIDE CERTIFICATE

Certified that this project report titled Improvement of Viola Jones Face detector

using Pre-processing is the bonafide work of SHREYAS SESHADRI (10408611) who

carried out the project under my supervision.

H.O.D Internal Guide

Date:

Internal Examiner External Examiner

iii

Certificate from germany to be added

iv

Attendance certificate from germany to be added

v

ACKNOWLEDGEMENT

I place on record our deep sense of gratitude to our beloved Chancellor Dr.

T.R.PACHAMUTHU, for providing us with the requisite infrastructure throughout the

course.

I take the opportunity to extend my hearty thanks to our Chairman, SRM

University, Ramapuram, Mr. R. SHIVAKUMAR, for his constant encouragement.

I convey my sincere thanks to our Dean Dr. K. ABDUL GHANI and Vice

Principal Dr. L.ANTONY MICHAEL RAJ, for his interest and support.

I take the privilege to extend my hearty thanks to the Head of the Department, Mrs.

T.BEULA CHITRA PRIYA, for her suggestions, support and encouragement towards the

completion of the project with perfection.

I thank my Internal Guide, Mrs. G.REVATHI, for her timely help and guidance

throughout the overall process of the project.

I would like to express my sincere thanks to all of our staff members of the

Department of Electronics and Communication who gave many suggestions from time to

time that made our project work better and well finished.

Finally I am indebted to Prof Dr-Ing BODO ROSENHAHN of Leibniz

University Hannover and his team who made me feel absolutely at home at the university.

They were very helpful with technical inputs whenever required. But for their guidance

this project work would not have been possible.

vi

ABSTRACT

This technical essay describes my final year Bachelors project undertaken at

Leibniz Universität Hannover under the supervision of Prof. Dr-Ing Bodo Rosenhahn

during November, 2011 to March, 2012.

Face detection has become an important feature in many mobile devices these

days. It is basic to a variety of other human computer interface systems. The project

presented proposes a method to improve performance and computation time of the

popularly used Viola-Jones face detection framework. This is done using Skin colour

detection and Canny edge detection algorithms as pre-processing steps.

The project involves creation of an iOS 5 App which detects faces from live

video feed. The same algorithm is implemented as a Mac OS project for testing. The

performance of the proposed method was tested and showed good preliminary results.

vii

TABLE OF CONTENTS

CHAPTER TITLE PAGE ABSTRACT vi

TABLE OF CONTENTS vii

LIST OF FIGURES x

LIST OF TABLES xi

LIST OF CHARTS xii

1 INTRODUCTION 1

1.1 FACE DETECTION 1

1.2 VIOLA-JONES 1

1.3 PRE-PROCESSING 1

1.4 APPLICATIONS AND USES 2

2 THEORY 3

2.1 MACHINE LEARNING 3

2.1.1 ARTIFICIAL NEURAL NETWORK 3

2.1.2 PERCEPTRON 4

2.1.3 ADABOOSTING 5

2.2 VIOLA-JONES 5

2.2.1 INTEGRAL IMAGE 6

2.2.2 ADABOOST LEARNING 8

2.2.3 CASCADE ARCHITECTURE 9

2.3 CANNY EDGE DETECTION 10

2.3.1 NOISE REDUCTION 10

2.3.2 INTENSITY GRADIENTS 11

2.3.3 NON-MAXIMUM SUPRESSION 11

2.3.4 ADDITIONAL STEPS 11

2.3.5 EXAPMLES 12

2.4 SKIN COLOUR DETECTION 13

viii

2.4.1 HIS MODEL 13

2.4.2 SKIN COLOUR 14

2.4.3 EXAMPLES 14

3 DETAILED EXPLANATION OF

PROJECT 16

3.1 STRUCTURE OF PROJECT 16

3.1.1 iOS PROJECT 16

3.1.2 MAC PROJECT 17

3.2 TRAINING OF CLASSIFIERS 18

3.3 IMAGE PROCESSING 19

3.3.1SKIN COLOUR DETECTION 19

3.3.2 RGB TO GREY 21

3.3.3 CANNY PRUNING 21

3.3.4 SLIDING WINDOW 23

3.3.5 PRE-PROCESSING 24

3.3.6 FACE DETECTION 25

4 EXPERIMENTAL ANALYSIS 27

4.1 THEORY 27

4.2 ANALYSIS 28

4.2.1 TEST SET OF IMAGES 28

4.2.2 DIFFERENT CASES 29

4.2.3 FPR AND TPR CALCULATION 30

4.2.4 ANALYSIS OF SPEED 31

4.3 EXPEXPECTED RESULTS 31

5 SPECICIFCATIONS 33

5.1 SOFTWARE 33

5.5.1 XCODE 33

5.5.2 MATLAB 33

5.1.3 LIBRARIES 33

5.2 HARDWARE 34

5.2.1 iMAC/ MACBOOK PRO 34

5.2.2 iPOD TOUCH 4/ iPAD 2 34

5.3 COMPUTERS USED IN TRAINING 34

ix

5.4 GIT REPOSITORY 35

5.5 SIMD 36

6 RESULTS 37

6.1 PERFORMANCE COMPARISON 37

6.2 SPEED COMPARISON 38

6.3 EXAMPLE IMAGES 39

7 APPENDIX 43

7.1 SKIN COLOUR DETETCION 43

7.2 CANNY PRUNING 44

8 CONCLUSION 51

9 FUTURE ENHANCEMENTS 52

REFERENCES 53

WEBSITES 54

x

LIST OF FIGURES

FIGURE NO. FIGURE TITLE PAGE

2.1 ARTIFICIAL NEURAL NETWORK 3

2.2 PERCEPTRON 4

2.3 EXAMPLE OF RECTANGULAR FEATURE 6

2.4 VALUE OF INTEGRAL IMAGE AT POINT (x,y) 7

2.5 SUM OF PIXELS IN RECTANGLE 8

2.6 FIRST 2 FEATURES SELECTED BY ADABOOST 9

2.7 CASCADED ARCHITECTURE 9

2.8 EXAMPLES 12

2.9 HSI MODEL 13

2.10 EXAMPLES 15

3.1 APP RUNNING ON iPOD TOUCH 4 17

3.2 EXAMPLE 20

3.3 EXAMPLE 21

3.4 EXAMPLE 22

3.5 EXAMPLE OF SCALING 23

3.6 VARIANCE NORMALIZATION 25

4.1 CONFUSION MATRIX 27

4.2 TEST IMAGES 29

5.1 STORAGE LEVELS OF GIT REPOSITORY 35

5.2 SIMD 36

6.1 EXAMPLE - 1 39

6.2 EXAMPLE - 2 40

6.3 EXAMPLE - 3 41

6.4 EXAMPLE - 4 42

xi

LIST OF TABLES

TABLE NO. TABLE NAME PAGE

3.1 NUMBER OF RECTANGULAR

FEATURES PER STAGE 18

6.1 PERFORMANCE COMPARISON 37

LIST OF CHARTS

CHART NO. CHART NAME PAGE

6.1 SPEED-UP COMPARISON 38

CHAPTER - I

INTRODUCTION

1.1 FACE DETECTION

Face detection is the computer technology by which the locations and sizes

of human faces in arbitrary digital images are obtained. Face detection can be regarded as a

specific case of object detection. In object detection, the task is to find the locations and

sizes of all objects in an image that belong to a particular class. The examples include cars,

traffic lights, faces etc.

Face detection has become an important feature in most mobile devices

these days. It is basic to a variety of other human computer interface systems. Face

detection is used in biometrics, often as an initial step to a face recognition system. It is

also used in video surveillance, human computer interface and image database

management. Some recent mobile devices use face detection for autofocus.

1.2 VIOLA JONES FACE DETECTION FRAMEWORK

The Viola-Jones face detection framework is one of the most widely used

methodologies for face detection. This method was proposed by Paul Viola and Michael L

Jones in 2001 [10]. This framework is based on the computation of rectangular features,

which refers to difference between pixel sums of two or more adjacent, equally sized

rectangles within a sub-window of an image. The problem with this method is that it is

computationally intensive when run on mobile devices. This is not a big problem for

dedicated devices such as digital cameras which are designed to perform this task. But on a

regular device such as a smart phone this methodology has to be improvised.

1.3 PRE-PROCESSING

The most obvious choice for improvement is the use of pre-processing. The

idea behind pre-processing is that regions in the image which can be easily classified as

non-faces are eliminated before entering the actual Viola Jones face detection framework.

1

This increases the computational efficiency as well as reduces the number of false

positives in the final output.

This project analyzes the performance of the Viola-Jones face detection

framework with Skin colour detection and Canny edge detection as the pre-processing

steps. Skin colour detection, is to quickly reject sub-windows of the image which have too

few skin coloured pixels. Edge detection is used as a pre-processing step so that sub-

windows having too few or too many edges to be identified as a face are easily eliminated.

1.4 APPLICATIONS AND USES

This is the age of smart phones. The smart phones will function as one

integrated tool for all mankind’s personal and business needs. It is becoming an extended

arm of the modern man. So it is evident that any work taken up for improvement in any

application relevant to smart phones will be immensely useful.

2

CHAPTER - IITHEORY

2.1 MACHINE LEARNING Machine learning is a branch of artificial intelligence concerned with the

design and development of algorithms that allow computers to evolve behaviors based on

empirical data i.e. data produced by an observation or experiment[a]. A learner can capture

characteristics of interest from the data set. This data can be seen as examples that illustrate

relations between observed variables. The major focus of machine learning research is that

computers automatically learn to recognize complex patterns and make intelligent

decisions based on data.

2.1.1 ARTIFICIAL NEURAL NETWORK

An Artificial Neural Network (ANN) is an mathematical or computational

model that is inspired by the way biological nervous systems, such as the brain, process

information. A neural network consists of an interconnected group of artificial neurons. An

ANN is configured for a specific application, such as pattern recognition or data

classification, through a learning process.

Conventional computers use an algorithmic approach i.e. they follow a set

of instructions in order to solve a problem. Unless the specific steps that the computer

needs to follow are known the computer cannot solve the problem. That restricts the

problem solving capability of conventional computers to problems that we already

understand and know how to solve. Neural networks on the other hand learn by example.

They cannot be programmed to perform a specific task. They are trained with a set of

specific examples of a particular problem, and learn how to solve that problem.

Fig 2.1- Artificial Neural Network (ANN)[l]

3

Fig. 2.1 shows an example of a simple ANN. Every ANN consists of the

input, output and hidden layers. Each of these layers has a set of units which represent

weights that manipulate the data in the calculations. ANNs can be classified as[p, c]

• Feedforward ANNs where information travels only in one direction i.e. form the input

layer to the output layer. Fig. 2.1 is an example of a simple feedforward network.

• Feedback ANNs where there are signals traveling in both directions.

2.1.2 PERCEPTRON

A perceptron is the simplest form of a feedforward network. It is a binary

classifier which maps its input x (a real-valued vector) to an output value f(x) (a single

binary value) as shown in Eq. (2.1)

Here w is a vector of real-valued weights, w.x is the dot product which

computes a weighted sum, and b is the 'bias', a constant term that does not depend on any

input value. The value of f(x) (0 or 1) is used to classify x as either a positive or a negative

instance. If b is negative, then the weighted combination of inputs must produce a positive

value greater than | b | in order to push the classifier neuron over the 0 threshold.

Fig. 2.2 - Perceptron [h]

Fig. 2.2 shows a simple perceptron with seven inputs each of which are

multiplied by a corresponding weight and passed through the function f(x) to give an

output y.

4

09/02/12 4:53 PMPerceptron - Wikipedia, the free encyclopedia

Page 1 of 7http://en.wikipedia.org/wiki/Perceptron

PerceptronFrom Wikipedia, the free encyclopedia

The perceptron is a type of artificial neural network invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt.[1] It can beseen as the simplest kind of feedforward neural network: a linear classifier.

Contents1 Definition2 Learning algorithm

2.1 Learning algorithm steps2.2 Separability and convergence

3 Variants4 Example5 Multiclass perceptron6 History7 References8 External links

Definition

The perceptron is a binary classifier which maps its input x (a real-valued vector) to an output value f(x) (a single binary value):

where w is a vector of real-valued weights, is the dot product (which here computes a weighted sum), and b is the 'bias', a constant termthat does not depend on any input value.

The value of f(x) (0 or 1) is used to classify x as either a positive or a negative instance, in the case of a binary classification problem. If b isnegative, then the weighted combination of inputs must produce a positive value greater than | b | in order to push the classifier neuron over the0 threshold. Spatially, the bias alters the position (though not the orientation) of the decision boundary. The perceptron learning algorithm doesnot terminate if the learning set is not linearly separable.

The perceptron is considered the simplest kind of feed-forward neural network.

Learning algorithmBelow is an example of a learning algorithm for a single-layer (no hidden layer) perceptron. For multilayer perceptrons, more complicatedalgorithms such as backpropagation must be used. Alternatively, methods such as the delta rule can be used if the function is non-linear anddifferentiable, although the one below will work as well.

The learning algorithm we demonstrate is the same across all the output neurons, therefore everything that follows is applied to a single neuronin isolation. We first define some variables:

denotes the output from the perceptron for an input vector . is the bias term, which in the example below we take to be 0.

is the training set of s samples, where: is the n-dimensional input vector. is the desired output value of the perceptron for that input.

We show the values of the nodes as follows:

is the value of the ith node of the jth training input vector..

To represent the weights:

(2.1)

2.1.3 ADABOOSTING

Boosting, a machine learning algorithm, refers to a general and provably

effective method of producing a very accurate prediction rule by combining rough and

moderately inaccurate rules of thumb[12]. It is the method of combining a set of weak

learners to create a single strong learner. A weak learner is defined to be a classifier which

is only slightly correlated with the true classification i.e. for a given problem the weak

learner may only classify the training data correctly 51% of the time. It is only slightly

better than random guessing. In contrast, a strong learner is a classifier that is arbitrarily

well-correlated with the true classification.[10]

Kearns and Valiant [5,6] were the first to pose the question of whether a

“weak” learning algorithm which performs just slightly better than random guessing can be

“boosted” into an arbitrarily accurate “strong” learning algorithm. Schapire [8] came up

with the first provable polynomial-time boosting algorithm in 1989.

AdaBoost is short for Adaptive Boosting, formulated by Yoav Freund

Robert E. Schapire in 1995 [12]. AdaBoost is adaptive in the sense that subsequent

classifiers built are tweaked in favor of those instances misclassified by previous

classifiers.

AdaBoost is an algorithm for constructing a strong classifier as linear

combination of simple weak classifiers ht(x) as shown in Eq. (2.2)

Here ht(x) is the weak classifier, H(x) is the strong or final classifier and α

is the weight or importance assigned to each weak classifier ht(x).

2.2 VIOLA-JONES FACE DETECTION FRAMEWORK

The Viola-Jones face detection framework is one of the most widely used

methodologies for face detection. This was proposed by Paul Viola and Michael L Jones in

2001(paper). This describes a face detection framework that is based on computation of

5

(Discrete) AdaBoost Algorithm – Singer & Schapire (1997)

Given: (x1, y1), ..., (xm

, ym

); xi

2 X , yi

2 {�1, 1}

Initialize weights D1(i) = 1/m

For t = 1, ..., T :1. (CallWeakLearn), which returns the weak classifier h

t

: X ! {�1, 1} withminimum error w.r.t. distribution D

t

;2. Choose ↵

t

2 R,3. Update

Dt+1(i) =

Dt

(i)exp(�↵t

yi

ht

(xi

))

Zt

where Zt

is a normalization factor chosen so that Dt+1 is a distribution

Output the strong classifier:

H(x) = sign

TX

t=1

↵t

ht

(x)

!

Comments⌅ The computational complexity of selecting h

t

is independent of t.⌅ All information about previously selected “features” is captured in D

t

!

12345678910111213141516

(2.2)

rectangular features. It achieves high detection rates with relatively low computation time.

It has the following major innovations.

• Integral Image - a new image representation for fast computation of rectangular features

• AdaBoost learning algorithm - A set of classifiers built based on the Adaboosting

algorithm

• Cascaded Architecture - An efficient method for combining the classifiers to reduce

computation.

The following sections takes an in depth look to each of these.

2.2.1 INTEGRAL IMAGE

This face detection procedure classifies images based on the value of

simple rectangular features. Three kinds of features are used as shown in Fig. 2.1.

Fig. 2.3 - Example of rectangular features [j]

In Fig. 2.3 the features A and B show a two-rectangle feature that is the

difference between the sum of the pixel values within two adjacent rectangular

regions(shown as grey and white). Feature C shows a three-rectangle feature which

computes the sum of pixel values within the two white(outer) rectangles subtracted from

the sum in the grey rectangle(central). And, feature D shows a four-rectangle feature which

computes the difference, of pixel sum values, between diagonal pairs of rectangles.

These rectangle features can be computed very rapidly using an

intermediate representation for the image which is called the integral image. It is based on

as the summed area table, which is an algorithm for quickly and efficiently generating the

sum of values in a rectangular subset of a grid. It was first introduced to the computer

6

10/02/12 3:12 PMViola–Jones object detection framework - Wikipedia, the free encyclopedia

Page 1 of 3http://en.wikipedia.org/wiki/Viola–Jones_object_detection_framework

Feature types used by Viola and Jones

Viola–Jones object detection frameworkFrom Wikipedia, the free encyclopedia

The Viola–Jones object detection framework is the first object detection framework to providecompetitive object detection rates in real-time proposed in 2001 by Paul Viola and Michael Jones[1][2].Although it can be trained to detect a variety of object classes, it was motivated primarily by the problem offace detection. This algorithm is implemented in OpenCV as cvHaarDetectObjects().

Contents1 Components of the framework

1.1 Feature types and evaluation1.2 Learning algorithm1.3 Cascade architecture

2 References3 External links

Components of the framework

Feature types and evaluation

The features employed by the detection framework universallyinvolve the sums of image pixels within rectangular areas. As such,they bear some resemblance to Haar basis functions, which havebeen used previously in the realm of image-based object detection[3].However, since the features used by Viola and Jones all rely on morethan one rectangular area, they are generally more complex. Thefigure at right illustrates the four different types of features used inthe framework. The value of any given feature is always simply thesum of the pixels within clear rectangles subtracted from the sum ofthe pixels within shaded rectangles. As is to be expected, rectangularfeatures of this sort are rather primitive when compared toalternatives such as steerable filters. Although they are sensitive tovertical and horizontal features, their feedback is considerablycoarser. However, with the use of an image representation called the integral image, rectangular features canbe evaluated in constant time, which gives them a considerable speed advantage over their moresophisticated relatives. Because each rectangular area in a feature is always adjacent to at least one otherrectangle, it follows that any two-rectangle feature can be computed in six array references, any three-rectangle feature in eight, and any four-rectangle feature in just nine.

Learning algorithm

The speed with which features may be evaluated does not adequately compensate for their number, however.

graphics world in 1984, for texture mapping [1]. The value of the integral image at point

(x,y) contains the sum of the pixels above and to the left of (x,y), on the original image. It is

as shown in Eq. (2.3)

Here ii(x, y) is the integral image and i(x, y) is the original image. This can

be efficiently calculated using the pair of recursive equations shown in Eq. (2.4)

Here s(x, y) is the cumulative row sum, where s(x, −1) = 0, and ii(−1, y) =

0. Fig. 2.4 shows the value of the integral image at a point (x,y) as the sum of the pixels in

the shaded region.

Fig. 2.4 - Value of integral image at a point (x,y)[10]

Using the integral image any rectangular sum can be computed very easily

as shown in Fig 2.5. Here the sum of the pixels within rectangle D is to be calculated. The

value of the integral image at location 1 is the sum of the pixels in rectangle A. The value

at location 2 is A+B, at location 3 is A+C, and at location 4 is A+B+C+D. The sum within

D can be computed as 4 + 1 − (2 + 3).

7

Robust Real-Time Face Detection 139

together yield an extremely reliable and efficient facedetector. Section 5 will describe a number of experi-mental results, including a detailed description of ourexperimental methodology. Finally Section 6 containsa discussion of this system and its relationship to re-lated systems.

2. Features

Our face detection procedure classifies images basedon the value of simple features. There are many moti-vations for using features rather than the pixels directly.The most common reason is that features can act to en-code ad-hoc domain knowledge that is difficult to learnusing a finite quantity of training data. For this systemthere is also a second critical motivation for features:the feature-based system operates much faster than apixel-based system.

The simple features used are reminiscent of Haarbasis functions which have been used by Papageorgiouet al. (1998). More specifically, we use three kinds offeatures. The value of a two-rectangle feature is thedifference between the sum of the pixels within tworectangular regions. The regions have the same sizeand shape and are horizontally or vertically adjacent(see Fig. 1). A three-rectangle feature computes thesum within two outside rectangles subtracted from thesum in a center rectangle. Finally a four-rectangle fea-ture computes the difference between diagonal pairs ofrectangles.

Given that the base resolution of the detector is24 ! 24, the exhaustive set of rectangle features is

Figure 1. Example rectangle features shown relative to the enclos-ing detection window. The sum of the pixels which lie within thewhite rectangles are subtracted from the sum of pixels in the greyrectangles. Two-rectangle features are shown in (A) and (B). Figure(C) shows a three-rectangle feature, and (D) a four-rectangle feature.

quite large, 160,000. Note that unlike the Haar basis,the set of rectangle features is overcomplete.3

2.1. Integral Image

Rectangle features can be computed very rapidly usingan intermediate representation for the image which wecall the integral image.4 The integral image at locationx, y contains the sum of the pixels above and to the leftof x, y, inclusive:

i i(x, y) =!

x "#x,y"#y

i(x ", y"),

where i i(x, y) is the integral image and i(x, y) is theoriginal image (see Fig. 2). Using the following pair ofrecurrences:

s(x, y) = s(x, y $ 1) + i(x, y) (1)

i i(x, y) = i i(x $ 1, y) + s(x, y) (2)

(where s(x, y) is the cumulative row sum, s(x, $1) =0, and i i($1, y) = 0) the integral image can be com-puted in one pass over the original image.

Using the integral image any rectangular sum can becomputed in four array references (see Fig. 3). Clearlythe difference between two rectangular sums can becomputed in eight references. Since the two-rectanglefeatures defined above involve adjacent rectangularsums they can be computed in six array references,eight in the case of the three-rectangle features, andnine for four-rectangle features.

One alternative motivation for the integral im-age comes from the “boxlets” work of Simard et al.

Figure 2. The value of the integral image at point (x, y) is the sumof all the pixels above and to the left.

(2.3)



2. Features






2.1. Integral Image


i i(x, y) =!

x "#x,y"#y

i(x ", y"),


s(x, y) = s(x, y $ 1) + i(x, y) (1)

i i(x, y) = i i(x $ 1, y) + s(x, y) (2)







2. Features






2.1. Integral Image


i i(x, y) =!

x "#x,y"#y

i(x ", y"),


s(x, y) = s(x, y $ 1) + i(x, y) (1)

i i(x, y) = i i(x $ 1, y) + s(x, y) (2)





(2.4)

2.2.2 ADABOOST LEARNING ALGORITHM

The weak classifier in the Viola-Jones face detection model is shown in Eq.

(2.3) where a weak classifier h(x,f,p,θ) consists of a rectangular feature f, a threshold θ and

a polarity p indicating the direction of the inequality and where x is a 24 × 24 pixel sub-

window of the image.

This framework is based on the AdaBoost learning algorithm and uses it to

both select the features and train the classifier. The learning algorithm is as shown [10]

• Given example images (x1, y1),...,(xn, yn) where yi = {0, 1} for negative and positive examples respectively.

• Initialize weights w1,i = (1/2m), (1/2l) for yi = 0, 1 respectively, where m and l are the number of negatives and positives respectively.

• For t=1,...,T:

a. Normalize the weights,

b. Select the best weak classifier with respect to the weighted error,

c. Define ht(x) = h(x, ft, pt,θt) where ft, pt, and θt are the minimizers of εt .

d. Update the weights as shown where ei = 0 if example xi is classified correctly, ei = 1 otherwise, and βt = (εt /(1-εt)) ,

• The final strong classifier is as shown where αt = log(1/βt)

8140 Viola and Jones

Figure 3. The sum of the pixels within rectangle D can be computedwith four array references. The value of the integral image at location1 is the sum of the pixels in rectangle A. The value at location 2 isA + B, at location 3 is A + C , and at location 4 is A + B + C + D.The sum within D can be computed as 4 + 1 ! (2 + 3).

(1999). The authors point out that in the case of linearoperations (e.g. f · g), any invertible linear operationcan be applied to f or g if its inverse is applied to theresult. For example in the case of convolution, if thederivative operator is applied both to the image and thekernel the result must then be double integrated:

f " g =! !

( f # " g#).

The authors go on to show that convolution can besignificantly accelerated if the derivatives of f and gare sparse (or can be made so). A similar insight is thatan invertible linear operation can be applied to f if itsinverse is applied to g:

( f ##) "" ! !

g#

= f " g.

Viewed in this framework computation of the rect-angle sum can be expressed as a dot product, i ·r , wherei is the image and r is the box car image (with value1 within the rectangle of interest and 0 outside). Thisoperation can be rewritten

i · r =" ! !

i#

· r ##.

The integral image is in fact the double integral of theimage (first along rows and then along columns). Thesecond derivative of the rectangle (first in row and thenin column) yields four delta functions at the corners of

the rectangle. Evaluation of the second dot product isaccomplished with four array accesses.

2.2. Feature Discussion

Rectangle features are somewhat primitive whencompared with alternatives such as steerable filters(Freeman and Adelson, 1991; Greenspan et al., 1994).Steerable filters, and their relatives, are excellent for thedetailed analysis of boundaries, image compression,and texture analysis. While rectangle features are alsosensitive to the presence of edges, bars, and other sim-ple image structure, they are quite coarse. Unlike steer-able filters, the only orientations available are vertical,horizontal and diagonal. Since orthogonality is not cen-tral to this feature set, we choose to generate a verylarge and varied set of rectangle features. Typically therepresentation is about 400 times overcomplete. Thisovercomplete set provides features of arbitrary aspectratio and of finely sampled location. Empirically it ap-pears as though the set of rectangle features providea rich image representation which supports effectivelearning. The extreme computational efficiency of rect-angle features provides ample compensation for theirlimitations.

In order to appreciate the computational advantageof the integral image technique, consider a more con-ventional approach in which a pyramid of images iscomputed. Like most face detection systems, our de-tector scans the input at many scales; starting at thebase scale in which faces are detected at a size of24 $ 24 pixels, a 384 by 288 pixel image is scannedat 12 scales each a factor of 1.25 larger than the last.The conventional approach is to compute a pyramid of12 images, each 1.25 times smaller than the previousimage. A fixed scale detector is then scanned acrosseach of these images. Computation of the pyramid,while straightforward, requires significant time. Imple-mented efficiently on conventional hardware (using bi-linear interpolation to scale each level of the pyramid) ittakes around .05 seconds to compute a 12 level pyramidof this size (on an Intel PIII 700 MHz processor).5

In contrast we have defined a meaningful set of rect-angle features, which have the property that a singlefeature can be evaluated at any scale and location in afew operations. We will show in Section 4 that effec-tive face detectors can be constructed with as few as tworectangle features. Given the computational efficiencyof these features, the face detection process can be com-pleted for an entire image at every scale at 15 frames per

Fig. 2.5 - Sum of pixels in rectangle D is computed as integral image values at locations 4 + 1 − (2 + 3) [10]


second, about the same time required to evaluate the 12level image pyramid alone. Any procedure which re-quires a pyramid of this type will necessarily run slowerthan our detector.

3. Learning Classification Functions

Given a feature set and a training set of positive andnegative images, any number of machine learning ap-proaches could be used to learn a classification func-tion. Sung and Poggio use a mixture of Gaussian model(Sung and Poggio, 1998). Rowley et al. (1998) use asmall set of simple image features and a neural net-work. Osuna et al. (1997b) used a support vector ma-chine. More recently Roth et al. (2000) have proposeda new and unusual image representation and have usedthe Winnow learning procedure.

Recall that there are 160,000 rectangle features as-sociated with each image sub-window, a number farlarger than the number of pixels. Even though eachfeature can be computed very efficiently, computingthe complete set is prohibitively expensive. Our hy-pothesis, which is borne out by experiment, is that avery small number of these features can be combinedto form an effective classifier. The main challenge is tofind these features.

In our system a variant of AdaBoost is used bothto select the features and to train the classifier (Freundand Schapire, 1995). In its original form, the AdaBoostlearning algorithm is used to boost the classificationperformance of a simple learning algorithm (e.g., itmight be used to boost the performance of a simple per-ceptron). It does this by combining a collection of weakclassification functions to form a stronger classifier. Inthe language of boosting the simple learning algorithmis called a weak learner. So, for example the percep-tron learning algorithm searches over the set of possibleperceptrons and returns the perceptron with the lowestclassification error. The learner is called weak becausewe do not expect even the best classification function toclassify the training data well (i.e. for a given problemthe best perceptron may only classify the training datacorrectly 51% of the time). In order for the weak learnerto be boosted, it is called upon to solve a sequence oflearning problems. After the first round of learning, theexamples are re-weighted in order to emphasize thosewhich were incorrectly classified by the previous weakclassifier. The final strong classifier takes the form of aperceptron, a weighted combination of weak classifiersfollowed by a threshold.6

The formal guarantees provided by the AdaBoostlearning procedure are quite strong. Freund andSchapire proved that the training error of the strongclassifier approaches zero exponentially in the numberof rounds. More importantly a number of resultswere later proved about generalization performance(Schapire et al., 1997). The key insight is that gen-eralization performance is related to the margin of theexamples, and that AdaBoost achieves large marginsrapidly.

The conventional AdaBoost procedure can be eas-ily interpreted as a greedy feature selection process.Consider the general problem of boosting, in which alarge set of classification functions are combined usinga weighted majority vote. The challenge is to associatea large weight with each good classification functionand a smaller weight with poor functions. AdaBoost isan aggressive mechanism for selecting a small set ofgood classification functions which nevertheless havesignificant variety. Drawing an analogy between weakclassifiers and features, AdaBoost is an effective pro-cedure for searching out a small number of good “fea-tures” which nevertheless have significant variety.

One practical method for completing this analogy isto restrict the weak learner to the set of classificationfunctions each of which depend on a single feature.In support of this goal, the weak learning algorithm isdesigned to select the single rectangle feature whichbest separates the positive and negative examples (thisis similar to the approach of Tieu and Viola (2000) inthe domain of image database retrieval). For each fea-ture, the weak learner determines the optimal thresholdclassification function, such that the minimum num-ber of examples are misclassified. A weak classifier(h(x, f, p, ! )) thus consists of a feature ( f ), a thresh-old (! ) and a polarity (p) indicating the direction of theinequality:

h(x, f, p, ! ) =!

1 if p f (x) < p!

0 otherwise

Here x is a 24 ! 24 pixel sub-window of an image.In practice no single feature can perform the classifi-

cation task with low error. Features which are selectedearly in the process yield error rates between 0.1 and0.3. Features selected in later rounds, as the task be-comes more difficult, yield error rates between 0.4 and0.5. Table 1 shows the learning algorithm.

The weak classifiers that we use (thresholded singlefeatures) can be viewed as single node decision trees.

(2.5)142 Viola and Jones

Table 1. The boosting algorithm for learning a query online.T hypotheses are constructed each using a single feature. Thefinal hypothesis is a weighted linear combination of the T hy-potheses where the weights are inversely proportional to thetraining errors.

• Given example images (x1, y1), . . . , (xn, yn) whereyi = 0, 1 for negative and positive examples respectively.

• Initialize weights w1,i = 12m , 1

2l for yi = 0, 1 respectively,where m and l are the number of negatives and positivesrespectively.

• For t = 1, . . . , T :

1. Normalize the weights, wt,i ! wt,i!nj=1 wt, j

2. Select the best weak classifier with respect to theweighted error

!t = min f,p,"

"

i

wi | h(xi , f, p, " ) " yi | .

See Section 3.1 for a discussion of an efficientimplementation.

3. Define ht (x) = h(x, ft , pt , "t ) where ft , pt , and "t

are the minimizers of !t .4. Update the weights:

wt+1,i = wt,i #1"eit

where ei = 0 if example xi is classified correctly, ei = 1otherwise, and #t = !t

1"!t.

• The final strong classifier is:

C(x) =

#1

T"

t=1

$t ht (x) # 12

T"

t=1

$t

0 otherwise

where $t = log 1#t

Such structures have been called decision stumps inthe machine learning literature. The original work ofFreund and Schapire (1995) also experimented withboosting decision stumps.

3.1. Learning Discussion

The algorithm described in Table 1 is used to selectkey weak classifiers from the set of possible weakclassifiers. While the AdaBoost process is quite effi-cient, the set of weak classifier is extraordinarily large.Since there is one weak classifier for each distinct fea-ture/threshold combination, there are effectively KNweak classifiers, where K is the number of featuresand N is the number of examples. In order to appre-ciate the dependency on N , suppose that the examplesare sorted by a given feature value. With respect to thetraining process any two thresholds that lie between thesame pair of sorted examples is equivalent. Therefore

the total number of distinct thresholds is N . Given atask with N = 20000 and K = 160000 there are 3.2billion distinct binary weak classifiers.

The wrapper method can also be used to learn a per-ceptron which utilizes M weak classifiers (John et al.,1994) The wrapper method also proceeds incremen-tally by adding one weak classifier to the perceptron ineach round. The weak classifier added is the one whichwhen added to the current set yields a perceptron withlowest error. Each round takes at least O(NKN) (or 60Trillion operations); the time to enumerate all binaryfeatures and evaluate each example using that feature.This neglects the time to learn the perceptron weights.Even so, the final work to learn a 200 feature classi-fier would be something like O(MNKN) which is 1016

operations.The key advantage of AdaBoost as a feature selec-

tion mechanism, over competitors such as the wrappermethod, is the speed of learning. Using AdaBoost a200 feature classifier can be learned in O(MNK) orabout 1011 operations. One key advantage is that ineach round the entire dependence on previously se-lected features is efficiently and compactly encodedusing the example weights. These weights can then beused to evaluate a given weak classifier in constant time.

The weak classifier selection algorithm proceeds asfollows. For each feature, the examples are sorted basedon feature value. The AdaBoost optimal threshold forthat feature can then be computed in a single pass overthis sorted list. For each element in the sorted list, foursums are maintained and evaluated: the total sum ofpositive example weights T +, the total sum of negativeexample weights T ", the sum of positive weights belowthe current example S+ and the sum of negative weightsbelow the current example S". The error for a thresholdwhich splits the range between the current and previousexample in the sorted list is:

e = min$S+ + (T " " S"), S" + (T + " S+%

,

or the minimum of the error of labeling all examplesbelow the current example negative and labeling the ex-amples above positive versus the error of the converse.These sums are easily updated as the search proceeds.

Many general feature selection procedures have beenproposed (see chapter 8 of Webb (1999) for a review).Our final application demanded a very aggressive pro-cess which would discard the vast majority of features.For a similar recognition problem Papageorgiou et al.(1998) proposed a scheme for feature selection based

142 Viola and Jones





• For t = 1, . . . , T :



!t = min f,p,"

"

i

wi | h(xi , f, p, " ) " yi | .






1"!t.


C(x) =

#1

T"

t=1

$t ht (x) # 12

T"

t=1

$t

0 otherwise

where $t = log 1#t









e = min$S+ + (T " " S"), S" + (T + " S+%

,



142 Viola and Jones





• For t = 1, . . . , T :



!t = min f,p,"

"

i

wi | h(xi , f, p, " ) " yi | .






1"!t.


C(x) =

#1

T"

t=1

$t ht (x) # 12

T"

t=1

$t

0 otherwise

where $t = log 1#t









e = min$S+ + (T " " S"), S" + (T + " S+%

,



142 Viola and Jones





• For t = 1, . . . , T :



!t = min f,p,"

"

i

wi | h(xi , f, p, " ) " yi | .






1"!t.


C(x) =

#1

T"

t=1

$t ht (x) # 12

T"

t=1

$t

0 otherwise

where $t = log 1#t









e = min$S+ + (T " " S"), S" + (T + " S+%

,



The first two features selected by the Adaboost algorithm are as shown in

Fig. 2.6. The first is two-rectangle feature which measures the difference in intensity

between the darker region of the eyes and the lighter region of the cheekbones. The second

is a three-rectangle feature which compares the intensities in the darker eye regions to the

intensity across the lighter bridge of the nose.

Fig. 2.6 - First two features selected by the Adaboosting learning algorithm

2.2.3 CASCADE ARCHITECTURE

The evaluation of the strong classifiers generated by the learning process

can be done quickly. But to further improve performance, the strong classifiers are

arranged in a cascade in order of complexity i.e the number and complexity of the features

increases from one cascade to the next. If at any stage in the cascade a classifier rejects the

sub-window under inspection, no further processing is performed and continue on

searching the next sub-window. Fig. 2.7 shows a 4 stage cascaded architecture where the

sub-window under inspection has to pass through each of the 4 stages to be detected as a

face.

Fig. 2.7 - Cascaded architecture [r]

9

10/02/12 5:01 PMHow face detection works | News | TechRadar

Page 1 of 6http://www.techradar.com/news/software/applications/how-face-detection-works-703173

The first time I looked at the reardisplay of a camera that had face-detection software, it was aninteresting experience. Point it at aperson, and the software wouldsuperimpose a coloured squareover that person's face.

This would enable you to moreeasily frame the photo, ensurecorrect exposure for the facecompared to the rest of the sceneand make sure that the face wasproperly focused. So how did itmanage it?

What's so special about a face that enables the camera to identify that this setof pixels is a face, but that set isn't? And in real-time too?

The camera doesn't have a chip with great processing power, either, so thealgorithm must be extremely efficient. We should also remember that over theyears, camera face-detection software has become pretty advanced. You cannow expect the software in your point-and-shoot camera to work out not justthe location of a face but also whether the person is smiling, and to take thephoto automatically if so.

Back in 2001, Paul Viola and Michael Jones invented a new framework fordetecting arbitrary objects and refined it for face detection. The algorithm isnow known as the Viola-Jones framework. The first thing to realise is that facedetection, whether by Viola-Jones or not, is not an exact science.

Just like we humans can be fooled by images that seem to contain a face whenin reality they do not, so face-detection software can be hoodwinked. Thisphenomenon is known as pareidolia: the apparent recognition of somethingsignificant (usually a face or a human form) in something that doesn't have itnaturally.

There are many examples of this, the most prominent being perhaps the Faceon Mars – a photo taken in the Cydonia region of Mars that appeared tocontain a human face in the rock – or the image of the Virgin Mary that anAmerican lady found in a grilled cheese sandwich.

Face detection software can be fooled too, so we talk about the algorithm'srate of false positives (detecting a face when there is none) or false negatives(not detecting a face that's present).

A good guess

The Viola-Jones method has a very high accuracy rate – the researchersreport a false negative rate of less than one per cent and a false positive rate ofunder 40 per cent, even when used with the simplest filter. (The full frameworkuses up to 32 filters, or 'classifiers'.) But we're getting ahead of ourselves.

The breakthrough for Viola and Jones came when they didn't try to analyse theimage directly: instead, they started to analyse rectangular 'features' in theimage. These features are known as 'Haar-like features', due to the similarity of

Try the new BETA version of TechRadar →

Esize SRM software Procurement cost savings. Best-in-class SRM software www.esize.com

Rapid Data Integration Fast, Online Data Matching Tool Simple to use - Try for Free www.Match2Lists.com

MOVEit Central Automate and Schedule Data Transfer Processes. Request Ipswitch Demo! www.IpswitchFT.com/MOVEitCentral

Premier PartnerSamsungNEWS REVIEWS BLOGS FORUMS TR STORE MAGAZINES TECH DEALS

Where am I? News News by technology Software Applications All feeds Get weekly newsletter Join TechRadar

TweetTweet 18 2

By Julian M Bucknall

How face detection worksIn Depth: You can do it, and your camera cando it - but how?

July 18th 2010 | Tell us what you think [ 1 comments ]

APPLICATIONS NEWS

10 SendLike

The first classifier on a 24 x 24 image of theauthor's face showing the two features in use

EXPLORE NEWS

ApplicationsOperating systems

RELATED NEWS

Facebook face detection tech goesworldwide

Digital cameras and cars to fuelmobile broadband surge

The best digital cameras to buyright now

PecoBOO face detection cutsenergy bills

Disposable digital cameras userecycled phone parts

Get the best deals onsubscriptionsAnd find out more about PC PlusMagazine

The truth about PC gamepiracyThe figures, the excuses andjustifications examined

45 best digital cameras inthe world todayWhat's the best digital camera?

NEWEST MOST READ MOST COMMENTED

TECH NEWS HEADLINES

Nikon: pros don't want articulating screensPhones4U JUMP scheme offers 6 month phone

upgradesWindows 8 on ARM: a confusing messiPad 3 apps being readied by AppleCanon: G1 X is a 'new category'Google and Microsoft combined can't beat AppleRIM reboots PlayBook as Windows 8 gets closerMore

Find a review Search reviews

All news Mobile Phones TVs Tablets Components Cameras AV Computing Laptops More iPad 3 rumours MWC 2012 Nokia Lumia

Updated 1 hour ago Log in | Join TechRadar and get our free newsletter Search news, reviews, blogs

10/02/12 5:01 PMHow face detection works | News | TechRadar

Page 4 of 6http://www.techradar.com/news/software/applications/how-face-detection-works-703173

pass though 100 per cent of the faces with a 40 per cent false positive rate (60per cent of the non-faces would be rejected by this classifier).

Figure 3 shows this simple classifier in action. It uses two features to test theimage: a horizontal feature that measures the difference between the darkereyes and the lighter cheekbones, and the three-rectangle feature that tests forthe darker eyes against the lighter bridge of the nose.

FIGURE 3: The first classifier on a 24 x 24 image of the author's face showingthe two features in use

Although they had been trying to implement a strong classifi er from acombination of 200 or so weak classifi ers, this early success prompted them tobuild a cascade of classifiers instead of a single large one (see Figure 4).

Each subwindow of the original image is tested against the first classifier. If itpasses that classifier, it's tested against the second. If it passes that one, it'sthen tested against the third, and so on. If it fails at any stage of the testing, thesubwindow is rejected as a possible face. If it passes through all the classifiersthen the subwindow is classified as a face.

FIGURE 4: The Viola-Jones cascade of classifiers

The interesting thing is that the second and subsequent classifiers are nottrained with the full training set. Instead they are trained on the images that

2.3 CANNY EDGE DETECTION

In general, the purpose of edge detection is to significantly reduce the

amount of data in an image, while preserving its structural properties [o]. This is done as

an initial step in many image processing algorithms so that it can be used for further

processing. Several edge detection algorithms exists, but the Canny edge detector is one of

the most popular.

! ! The Canny edge detector is an edge detection operator that uses a multi-

stage algorithm to detect a wide range of edges in images [d]. It was developed by John F.

Canny in 1986. His aim was to develop an algorithm that met the following criteria [3, 4]:

• Good Detection: The detection of real edges should be maximized while that of non-

edges should be minimized.

• Good Localization: The edges marked by the Canny edge detector should be as close as

possible to the real edges in the original image.

• Minimum number of responses: An edge should be detected only once, and image noise

should not be detected as edges.

The Canny edge detection algorithm operates in five separate steps as shown

2.3.1 NOISE REDUCTION BY GAUSSIAN BLUR

All images taken from a camera will contain some amount of noise. To

reduce single pixel noise being mistaken for edges, this noise must be reduced. Therefore

the image is first blurred by convolving it with a Gaussian filter. This is called as a

Gaussian Blur. The kernel of a Gaussian filter with a standard deviation of σ = 1.4 is

usually used and is shown in Eq.(2.6)

10

B = 1159

2 4 5 4 24 9 12 9 45 12 15 12 54 9 12 9 42 4 5 4 2

⎡

⎣

⎢⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥⎥

(2.6)

2.3.2 FINDING INTENSITY GRADIENTS

The Canny algorithm finds edges where there is a high variation in the

grayscale intensity of the image. These regions are found by calculating the gradients at

each pixel of the image. This can be done by applying various operators such as Roberts

Cross, Scharr, Prewitt etc. But the most commonly used one is the Sobel operator. This

uses two 3×3 kernels as shown in Eq. (2.7) which are convolved with the original image to

calculate approximate derivatives in the x and y directions respectively.

The gradient magnitudes, G and direction of the edges, θ can be calculated

as shown in Eq. (3.3), where Gx and Gy are the the gradients in the x- and y-directions

respectively, derived from the Sobel operator. The edge direction, θ is then rounded to one

of four angles representing vertical, horizontal and the two diagonals (usually 0, 45, 90 and

135 degrees, respectively)

2.3.3 NON-MAXIMUM SUPPRESSION

Each of the calculated image gradients are then checked, to see if they

assume a local maximum in the gradient direction. For example, if the rounded gradient

angle is zero degrees (i.e. the edge is in the east-west direction) the point will be

considered to be on the edge if its gradient magnitude is greater than the magnitudes in the

north-south direction. Otherwise this gradient value is suppressed/removed. This

mechanism is carried out for all the gradient values to get the final binary image consisting

of what are called thin edges.

2.3.4 ADDITIONAL STEPS

Additional steps such as Double thresholding and Edge tracking by

hysteresis can be carried out to remove insignificant edges. Double thresholding does this

11

KGX =−1 0 1−2 0 2−1 0 1

⎡

⎣

⎢⎢⎢

⎤

⎦

⎥⎥⎥

KGY =1 2 10 0 0−1 −2 −1

⎡

⎣

⎢⎢⎢

⎤

⎦

⎥⎥⎥

θ = arctanGX

GY

⎛

⎝⎜⎞

⎠⎟G = G2

X +GY2 (2.8)

(2.7)

by setting a threshold for the gradients. If the edges have a gradient value higher than the

threshold they are marked as strong edges and are otherwise marked as weak edges. Edge

tracking by hysteresis is then done to retain only those weak edges that are connected to

strong edges.

2.3.5 EXAMPLE

Fig. 2.8 (a), (b) and (c) - Examples

Fig. 2.8 (d) - Examples

Fig 2.8 illustrates the Canny edge detection algorithm through each step.

Fig 2.8 (a) is a grayscale example image. Fig 2.8 (b) shows a Gaussian bur applied on the

12

original image. Fig 2.8 (c) shows the image after finding gradients and turning those with

high enough gradient to white and the rest of the pixels to black. Fig 2.8 (d) shows the final

Canny edge detected image after applying non-maximum suppression.

2.4 SKIN COLOUR DETECTION

Skin colour detection is a popular method used for face detection. It is a

known fact that human skin has a characteristic color, which is easily recognized by

humans. Colour detection, also allows for fast processing and is highly robust to geometric

variations of the face [11]. So it is only logical to create an algorithm that detects faces

based on colour.

! ! There are various models for classifying pixels as skin colour. This report

uses the Hue Saturation and Intensity(HSI) model which is discussed in the following

sections.

2.4.1 HSI MODEL

HSV is very different three-dimensional color space from RGB or CYM.

The three key elements are defined as follows [2, f].

• Hue - It is an attribute associated with the dominant wavelength in a mixture of light

waves. It represents the human sensation according to which an area appears to be

similar to one, or a combination of two, of the perceived colours red, yellow, green and

blue.

• Saturation - It is the colorfulness of an area relative to its own brightness. Gives a

measure of how much a pure colour(hue) is diluted with white light.

• Value - It is the visual sensation due to which an area appears to emit more or less light.

Fig. 2.9 - Double cone representation of HSI Space [f]

13

Fig. 2.9 illustrates a common representation of the HSI space. The

cylindrical shape has one central axis representing value, V, which varies from 0 to 1. It

has a value 0, representing black, at the lower pointed end of the double cone and 1 for the

upper end, representing white. Along this axis are all the gray values. If this double cone is

viewed from the top, it becomes a circle. Different colors, or hues, are arranged around this

circle. Hues are determined by their angular location on this circle, with the hue at red

being 0 ̊ green at 120˚ and blue at 240˚. Saturation is the distance perpendicular to the

value axis. Colors near the central axis have low saturation and have a washed out look.

Colors near the surface of the cone have high saturation. Another point to be noted is that

when saturation is 0 then the hue is undefined.

2.4.2 SKIN COLOUR

Kjeldson and Kender [7] defined a model in HSV color space to separate

skin colour regions from background for a a hand-gesture-based user interface. In order to

reduce the effects of lighting, the model proposed relies heavily on hue and saturation, and

giving minor importance to intensity.

In the model used in this report the value of intensity is ignored and the

threshold values shown in Eq.(2.9) [n] are used to distinguish between skin coloured and

non-skin coloured pixels. Here 0˚ and 50˚ refer to the upper and lower thresholds of hue

respectively. Similarly 0.23 and 0.68 are the upper and lower thresholds for saturation.

2.4.3 EXAMPLES

! ! In Fig 2.10 (a) each pixel is checked to see if it satisfies the threshold

values mentioned in Eq. (2.9). If a particular pixel passes the threshold it is turned to white,

else it is turned to black as shown in Fig 2.10 (b).

14

0 ≤ Hueskin ≤ 50

0.23≤ Saturationskin ≤ 0.68(2.9)

Fig 2.10 (a) and (b) - Example

15

CHAPTER - III

DETAILED EXPLANATION OF PROJECT

The main aim of the project is to quickly reject regions of an image which

can’t possibly be faces.This increases the speed of computation and reduces the number of

false positives in the output. This is done using skin colour detection and Canny edge

detection as pre-processing techniques. An iOS 5 App is created with the above mention

face detection methodology. A detailed description of the project is given below

3.1 STRUCTURE OF THE PROJECT

The project has two parts. One is the operation of the proposed face detector

on the iMac. The other is an App running this algorithm on the iPod Touch (4th

generation). Both of these are discussed in detail below.

3.1.1 iOS PROJECT

This part of the project is basically an App that takes live video information

from the camera of the mobile device, stores it and runs it thorough the face detection

algorithm. This algorithm returns the positions and the dimensions of the faces. The App

then creates a transparent view on top of the current view on which the boxes surrounding

the face are dawn.

The image coming from the camera is in the standard BRGA format. The

first three letters (B, G and R) refer to the primary colour components i.e. blue, green and

red. The A represents the alpha channel. The alpha channel is used in computer graphics to

blend images with their background in order to create the appearance of partial or full

transparency. If the A value is minimum it means that particular pixel is completely

transparent and it has a maximum value it represents an opaque pixel.

The image has a resolution of 640x480 pixels. This is then in the memory

stored as a uint8_t one dimensional array. The uint8_t variable is similar to an unsigned

char. It occupies one byte of memory and has values in the range (0-255). This means that

each pixel value has four uint8_t values assigned to it i.e. one for every colour component.

The App basically has three views. Topmost is a transparent view that

16

displays boxes around the detected faces. The other two views show the live video feed

and the current frame being processed respectively. The user can switch between these two

using Swipe(left/right) gestures. The App also has a has a Toolbar with options to

• Switch cameras - Has the option of using the front and back cameras of the device.

• Toggle face detection - This is done to turn the face detection ON and OFF

• Take a picture - A button to store the current frame in the save photos library.

• Help option - This brings up a scroll view with options to choose between the different

types of pre-processing (See Section 4.2.2)

Fig 3.1- App running on iPod touch 4th generation

3.1.2 MAC PROJECT

The Mac project runs the same face detection algorithm as the iOS project

and is used for testing. This is useful as images can be displayed and verified from any

point in the program, which is not always possible on a mobile device.

The Mac project uses the Cimg, C++ image library for storing and

displaying the images [b]. The test images taken from the mobile device are stored run

through the Mac algorithm. The Cimg image library accepts images in the Windows

Bitmap format. This then produces an output file containing the name of the image with

the coordinates and dimensions of the faces. Its also saves images at various stages in the

processing.

17

3.2 TRAINING OF CLASSIFIERS

Initially 10,920 training images are taken with 8,004 faces and 2,916 non-

faces. All the images with faces have fixed eye positions. Then the training algorithm

mentioned in the Viola-Jones framework is used (See Section 2.2.2). Each stage is trained

until stage goal of 99.5% true positive rate and 60% true negative rate (for definitions see

Section 4.1) is fulfilled. This means that 99.5% of the images with faces have to be

correctly classified as faces and 60% of the images without any faces have to be correctly

rejected as non-faces. or stage limit of 100 classifiers is reached. Each cascade stage has

can have a maximum of 100 rectangular features. If these values(TPR = 99.5% and TNR =

60%) are not obtained by the 100th rectangular feature the stage is stopped there

irrespective of the values of TPR and TNR.

This is followed by what is called as Bootstrapping. Here, after each stage

non-faces that were correctly rejected by the stage are removed from the negative training

set. Then a collection of 19,774 higher resolution scene images (containing no faces) are

used to randomly sample new non-face training images. Only random samples, that are

falsely not rejected by all previous stages, are added to the negative training set. In this

way the following stage is focused on the weaknesses of the previous stages. 10,000 of

these samples are collected and added to the negative training set. Thus the negative set

normally grows during training. After the 7th stage the training set contained the 8,004

faces and 15,129 non-faces. The number of rectangular features for each stage is

represented in Table 2.1

Table 3.1 - Number of rectangular features per stage.

18

Stage Number Number of rectangular features

1 52 33 104 185 326 507 68

Total = 186

3.3 IMAGE PROCESSING

The 640x480 image is passed on to the face detection algorithm. This

basically detects the position and dimension of each detected face in the image. The

algorithm works in the following steps

3.3.1 SKIN COLOUR DETECTION

Research on object detection based on skin color classification has gained

popularity in recent years[9, 11]. However object detection based solely on this method has

a lot of shortcomings as it is difficult to get perfect classification under different lighting

conditions,etc. Moreover when this method is employed for face detection, the detector

fails when there are skin coloured regions like legs, arms, etc or the background region

contain skin colour like pixels. Hence its better to use it as a pre-processing step.

This step is carried out in using the algorithm mentioned in the theory section. This

algorithm uses images in the HS colour model, but the representation of the image is in the

RGB model. Hence a conversion is required from the RGB to HS. The steps followed are

discussed below

• RGB to HS conversion -

The set of equations in Eq. (3.1) [n] show the conversion from the RGB

model to the HS model. Here the values R, G and B represent the red, green and blue

intensities of each pixel receptively. As can be seen hue is an angle denoted in degrees and

saturation is a value varying between 0 and 1.

19

M = max(R,G,B)m = min(R,G,B)

Cb = M − BM −m

Cg = M −GM −m

Cr = M − RM −m

′H =

undefined , if Saturation = 0Cb −Cg , if M = R2 +Cr −Cb , if M = G4 +Cg −Cr , if M = B

⎧

⎨

⎪⎪

⎩

⎪⎪

Saturation = M −mM

Hue = 60 × ′H

(3.1)

• Thresholding - Each pixel in the HS image is then checked if it a skin coloured value

using the threshold values mentioned in Eq. (2.9). If a certain pixel passes the threshold

then it is turned 255 (white) and 0 (black), otherwise.

• Nearest neighbour scale down - This image is then scaled down to half its dimension

using a nearest neighbour scale down algorithm. This is done to reduce the computation

time of the integral image. In this algorithm the value of the pixel at the new point is

approximated to the value of the pixel which is nearest to it in the original image. It does

not consider all the surrounding neighbours resulting in a pice-wise interpolant.

The final image is stored separately in the memory for to be later used for preprocessing.

Fig 3.2 (a), (b) and (c) - Example

Fig 3.2(a) shows an image taken from the test set (See Section 4.2.1). The

image is first set through the skin colour detection algorithm and all the pixels that are

classified as skin coloured are set to white and the rest are set to black. As can be seen

some of the pixels on the reddish building are also misclassified as skin coloured. The

image is then reduced to half its dimension. It has to be noted that the images shown here

have the same dimension as

20

3.3.2 RGB TO GREY CONVERSION

The face detection algorithm requires a greyscale image. Hence the original

image is converted as shown in Eq. (3.2).

Here each pixel in the grayscale image has a value of the sum of 30%

(approx.) of the red value, 59% (approx.) of the green value, and 11% (approx.) of the blue

value. This image too is stored separately in the memory and the original image can now

be released from the memory. Fig 3.3 shows Fig 3.2(a) converted to greyscale using the

above equation.

Fig 3.3 - Example

3.3.3 CANNY PRUNING

Canny pruning is used as a pre-processing step to remove the regions in the

image that have too many or too few edges to be a face.

The steps followed by the Canny pruning algorithm are discussed below.

• Bilinear scale down - The grey scale image is initially scaled down to half its dimension

i.e.320x240 using a bilinear scale down function. This is done to reduce the computation

time of the canny edge detection algorithm and also that of the integral image

calculation. This algorithm finds the value of pixel intensity at a position in the scaled

image by taking the weighted average of the four nearest neighbours in the original

21

GreyVal = (R × 77)+ (G ×151)+ (B × 28)256

(3.2)

image. The weights are assigned to each neighbour inversely proportional on the distance

between the new pixel and the neighbour.

• Histogram normalization - The scaled grey image is then contrast stretched so that it

used the entire grayscale range, to improve the quality of the detected edges. This is

done using histogram equalization. The normalized histogram of the image is first

calculated and is stretched to cover the entire grayscale range. It is described in Eq. (3.3)

[2], [m].

Here pn denotes the normalized histogram of an image f and g denotes the

histogram normalized image. The summation term in the second equation calculated the

cumulative distribution of the normalized histogram, where L is 256 (the range of the

grayscale values). The function floor() is used to round down to the nearest integer, in

order to avoid getting out of range.

• Canny edge detection - This image is now passed through the canny edge detection

algorithm. This algorithm is the same as the one described in the theory section. The

upper and lower thresholds used for the gradient values are 60 and 30 respectively [q].

This means that the edge tacking starts if the gradient value is above the upper threshold

and it continues if the gradient value is above the lower threshold.

Steps such as double thresholding and edge tracking by hysteresis are

skipped so as not to loose edge information.The final output of the algorithm is an

unsigned char image having detected edges with value 0 (white) and rest of the pixels with

value 255(black). This image is stored separately in the memory.

Fig 3.4 (a), (b) and (c) - Examples

22

Histogram Equalization

Histogram equalization is a technique for adjusting image intensities to enhance contrast.

Let f be a given image represented as a mr by mc matrix of integer pixel intensities rangingfrom 0 to L ! 1. L is the number of possible intensity values, often 256. Let p denote thenormalized histogram of f with a bin for each possible intensity. So

pn =number of pixels with intensity n

total number of pixelsn = 0, 1, ..., L! 1.

The histogram equalized image g will be defined by

gi,j = floor((L ! 1)

fi,j!

n=0

pn), (1)

where floor() rounds down to the nearest integer. This is equivalent to transforming thepixel intensities, k, of f by the function

T (k) = floor((L ! 1)k

!

n=0

pn).

The motivation for this transformation comes from thinking of the intensities of f and g ascontinuous random variables X, Y on [0, L ! 1] with Y defined by

Y = T (X) = (L ! 1)

" X

0

pX(x)dx, (2)

where pX is the probability density function of f . T is the cumulative distributive functionof X multiplied by (L ! 1). Assume for simplicity that T is di!erentiable and invertible. Itcan then be shown that Y defined by T (X) is uniformly distributed on [0, L ! 1], namelythat pY (y) = 1

L!1 .

" y

0

pY (z)dz = probability that 0 " Y " y

= probability that 0 " X " T!1(y)

=

" T!1(y)

0

pX(w)dw

d

dy

#" y

0

pY (z)dz

$

= pY (y) = pX(T!1(y))d

dy(T!1(y)).

1

gi, j = floor((L −1)× pnn=0

fi , j

∑ )(3.3)

Fig 3.4(a) shows image in Fig 3.3 shrunk using bilinear scaling. This is

then contrast stretched as shown in Fig 3.4 (b). Finally, Fig 3.4 (c) shows the Canny edge

detected image.

3.3.4 SLIDING WINDOW

The face detection algorithm tries to detect a face in a rectangular sub-

window of the image. As the classifiers used for this project are trained with images having

dimension 110x128, the size of a sub-window used is also 110x128. In order to detect

larger faces the image is scaled down and the sub-window having the same size is used.

The scale down is done by bilinear scaling for the grayscale image and using nearest

neighbour for the skin colour detected and Canny edge detected image.

The scaling factor used is 1.25 and is repeated seven times. This is

illustrated in Fig 3.5 where the scaling of the grayscale image is shown. The window is

shown in (red) the top left position having constant size in each image, but covering a large

area in the small image thereby detecting big faces. Similarly in the large image the

window covers a smaller area detecting small faces

Fig 3.5 - Example of scaling where window size remains constant

23

For each image size the sub-window starts from the top left corner and

moves through the entire image with a step size of two pixels. The step size for the Canny

image and skin colour detected image is, hence, one pixel, as they have half the dimension

as the original image.

Hence, in effect, the smallest sub-window used to detect a face has

dimension 110x128 and the largest corresponding sub-window has dimension 413x480.

3.3.5 PRE-PROCESSING

Pre-processing is done before the actual Viola-Jones face detector so that a

significant number of windows which can be easily classified as non-faces are rejected

without much computation. This is done by counting the number of white pixels in the

canny edge detected image and the skin colour detected image for each sub-window. This

value of white pixel count is passed through a threshold to check if it is within an

acceptable range for being a face.

The threshold value for the skin colour detected image is chosen to be 45%

to 90%. This means that the number of pixels detected as skin coloured pixels(white) has

to be within 45% to 90% of the total number of pixels in the current sub-window. So the

sub-windows having less than 45% skin coloured pixels are not considered as faces and

sub-windows having more than 90% skin images are also regarded as non-faces. This is

because faces are not made up entirely of skin coloured pixels like the region of the eye,

hair etc. Similarly the threshold values for the Canny edge detected image are 16.7% to

22.7%. Here sub-windows having less than 16.7% white pixels are considered to have too

few edges to be a face and those having more than 22.7% white pixels cannot be faces as

they have too many edges.

The white pixel count is done for each scale for the skin colour detected

and Canny edge detected image by calculating their integral images. The value of the

integral image at any point is the sum of all the pixels above and to the left of the point, as

discussed in the theory. Using this value the total sum of pixels within each sub-window

can be calculated using the integral image values at each of its vertices as shown in

Fig(2.5). This sum value is then divided by 255(value of each white pixel) to get the actual

number of white pixels in the sub-window.

24

3.3.6 FACE DETECTION

The Viola-jones face detection framework as discussed in the theory is used

here. This is done using the trained cascaded classifiers, (See Section 3.2). All the sub-

windows that pass the pre-processing step are checked for faces. Initially the integral

images of the grayscale image and its square are calculated.

The square is calculated for variance normalization as suggested in the

original Viola Jones face detection framework. This is done to reduce the effect of different

lighting conditions and is describes as shown in Eq.(3.4). Here v is the variance of the

image sub-window, σ is the standard deviation, n is the number of pixels in the sub-

window, M is the mean and x is the pixel value within the sub-window.

The normalized value, xnormalized, of pixel x calculated as shown in Eq.(3.5)

Fig 3.6 (a) shows 1200 of 1520 images used for training and Fig 3.6 (b) shows the same

images variance normalized. As can be seen, this compensates for the different lighting

conditions found in the training images

Fig 3.6 (a) and (b) - Variance normalization

25

v =σ 2 = 1n

(x2 )∑ −M 2 (3.4)

xnormalized =x −M

v(3.5)

This value can be easily calculated with the help of the integral image, as

shown in Fig.(2.5). Now, the normalized values of the rectangular features within the sub-

window are compared to the threshold obtained from the classifier training. If they pass

this threshold they go on to the next stage of the cascaded architecture and the same

procedure is followed. If any stage rejects a given sub-window, it is classified as a non-face

and the process is carried out for the next sub-window.

All the sub-windows that pass through all the stages are classified as

containing a face. The position and size of these sub-windows are stored for drawing of

rectangles around the faces. In the Mac project these values (size and position of each sub-

window in which a a face is detected) are stored in a separate document to be later used for

testing.

26

CHAPTER - IV

EXPERIMENTAL ANALYSIS

4.1 THEORY

The face detection classification model is a mapping of instances into

classes of faces and non-faces. This is a two-class prediction problem, also called as a

binary classification, in which the outcomes are labeled either as positive (p) i.e. faces or

negative (n) i.e non-faces. There are four possible outcomes from any binary classifier as

shown.

• True Positive (TP) - If the outcome from a prediction is that a given image is a face and it

is actually a face, then the prediction is called as a true positive .

• False positive (FP) - However if the image is not a face, but it is predicted to be a face,

then it is said to be a false positive.

• True negative (TN) - A true negative occurs when the prediction is that the image is a

non-face and its is actually not a face.

• False negative (FN) - Similarly, a prediction is called a false negative when the

prediction is that the image is a non-face while it is actually a face.

This can be formulated as a confusion matrix as shown in Fig 4.1. A confusion matrix is a

specific table layout that allows visualization of the performance of an algorithm. Here the

values on the horizontal axis denote the actual values i.e. p for faces and n for non-faces.

Similarly the values on the vertical axis denote the outcome of the prediction i.e. p´ for

faces and n´ for non-faces.

Fig 4.1 - Confusion Matrix

27

22/02/12 2:48 PMReceiver operating characteristic - Wikipedia, the free encyclopedia

Page 2 of 9http://en.wikipedia.org/wiki/Receiver_operating_characteristic

eqv. with miss, Type II errorsensitivity or true positive rate (TPR)

eqv. with hit rate, recallTPR = TP / P = TP / (TP + FN)

false positive rate (FPR)eqv. with fall-outFPR = FP / N = FP / (FP + TN)

accuracy (ACC)ACC = (TP + TN) / (P + N)

specificity (SPC) or True Negative RateSPC = TN / N = TN / (FP + TN) = 1 − FPR

positive predictive value (PPV)eqv. with precisionPPV = TP / (TP + FP)

negative predictive value (NPV)NPV = TN / (TN + FN)

false discovery rate (FDR)FDR = FP / (FP + TP)

Matthews correlation coefficient (MCC)

F1 scoreF1 = 2TP / (P + P') = 2TP / (2TP + FP + FN)

Source: Fawcett (2006).

The ROC space and plots of the four

Let us consider a two-class prediction problem (binaryclassification), in which the outcomes are labeled eitheras positive (p) or negative (n) class. There are fourpossible outcomes from a binary classifier. If theoutcome from a prediction is p and the actual value isalso p, then it is called a true positive (TP); however ifthe actual value is n then it is said to be a false positive(FP). Conversely, a true negative (TN) has occurredwhen both the prediction outcome and the actual valueare n, and false negative (FN) is when the predictionoutcome is n while the actual value is p.

To get an appropriate example in a real-world problem,consider a diagnostic test that seeks to determinewhether a person has a certain disease. A false positivein this case occurs when the person tests positive, butactually does not have the disease. A false negative, onthe other hand, occurs when the person tests negative,suggesting they are healthy, when they actually do havethe disease.

Let us define an experiment from P positive instancesand N negative instances. The four outcomes can beformulated in a 2×2 contingency table or confusionmatrix, as follows:

actual value p n total

predictionoutcome

p' TruePositive

FalsePositive P'

n' FalseNegative

TrueNegative N'

total P N

ROC spaceThe contingency table can derive several evaluation "metrics" (see infobox). Todraw an ROC curve, only the true positive rate (TPR) and false positive rate(FPR) are needed. TPR determines a classifier or a diagnostic test performanceon classifying positive instances correctly among all positive samples availableduring the test. FPR, on the other hand, defines how many incorrect positiveresults occur among all negative samples available during the test.

A ROC space is defined by FPR and TPR as x and y axes respectively, whichdepicts relative trade-offs between true positive (benefits) and false positive(costs). Since TPR is equivalent with sensitivity and FPR is equal to 1 −specificity, the ROC graph is sometimes called the sensitivity vs (1 − specificity)plot. Each prediction result or one instance of a confusion matrix represents onepoint in the ROC space.

The best possible prediction method would yield a point in the upper left corner

There certain values that can be obtain from the above confusion matrix.

These are useful for the analysis of the operation of the face detector and as follows

• True positive rate (TPR) - It determines the performance of the face detector in

classifying positive instances correctly (true positives) among all positive samples

available (faces) during the test. It is also called sensitivity.

• False positive rate (FPR) - On the other hand, the false positive rate defines how many

false positive results occur among all negative samples (non-faces) available during the

test.

These two values are defined as shown the Eq.(4.1), where TP denotes true positives, P

denotes total number of actual positives, FN denotes false negatives, FP denotes false

positives, TN denotes true negatives and N denotes the total number of negatives. Another

value called sensitivity is defined as shown.

4.2 ANALYSIS

The following sections describe the different cases that are to be analyzed

and the methodologies employed for the analysis

4.2.1 TEST SET OF IMAGES

To analyze the performance of the proposed face detector a test set of 29

images were taken under different lighting conditions and with different backgrounds. This

was done with an iPod Touch 4th generation using both the front and back cameras.

Ground truth values of each the test images were then taken. This refers to the eye

positions of all the faces present each test image and also the names of the images which

don't contain any face. These ground truth values are very important for analysis of

performance to check whether the detected windows actually do contain faces.

28

TPR = TPP

= TPTP + FN

Sensitivity = 1− FPR

FPR = FPN

= FPFP +TN (4.1)

The analysis of the test images is done using the Mac project. Some of the test images

taken are shown below in Fig 4.2 (a) to (d).

Fig 4.2 (a) to (d) - Test images

4.2.2 DIFFERENT CASES TO BE ANALYZED

The analysis of the face detector is done under five different cases as

shown

• Without pre-processing - This is just the normal Viola Jones face detection framework

running the first seven stages of the cascaded architecture.

• With Skin colour detection as pre-processor - Here the skin colour detection algorithm is

used as a pre-processing step to reject a number of sub-windows before the actual Viola

Jones face detection framework is carried out.

• With Canny edge detection as pre-processor - The Canny edge detection algorithm is

used here to reject the sub-windows of the image that have too many or too few edges to

contain a face.

29

• With both Skin colour and Canny edge detection with an AND gate - Here both skin

colour detection and Canny edge detection algorithms are used. The sub-window is

passed to the Viola Jones face detection framework only if it passes through both theses

pre-processing steps.

• With both Skin colour and Canny edge detection with an OR gate - This is similar to the

previous case, but the sub-window is rejected by pre-processing only if both the skin

colour detection and the Canny edge detection algorithms reject it.

4.2.3 CALCULATION OF FPR AND TPR USING MATLAB

As mentioned earlier, all the test images are run through the Mac project,

under each of the above discussed cases. This returns a text document containing the

position and size of the sub-windows in which a face was detected. The calculation of FPR

and TPR of all the images under each case is done using a Matlab script. This algorithm

requires the output text document from the Mac project and the ground truth values as

inputs.

The Matlab script works by calculating the ideal image sub-window around

each face in all the images using the ground truth data. This can be done as all the images

that were used in the training of the classifiers, have a fixed eye position. Hence with the

eye positions from the ground truth data the calculated the ideal size and position of the

sub-window. The algorithm allows for a 1.5 times variation in size and a 33% variation in

position from the calculated ideal sub-window. Hence all the possible sub-window

positions and sizes are known for each face. This data is compared with the output of the

Mac project to calculate the TPR and FPR values.

It is assumed that each face has to be detected only once i.e. at least by one

sub-window to have a TPR of 1.0. Also the for the calculation of FPR the total number of

windows is used in the denominator rather than the total number of possible negatives.

This is because the total number of possible positives is negligibly small as compared to

the total number of windows and wont have a significant effect on the FPR. This is as

shown in Eq. (4.2). Here Tot os the total number of windows which is the sum of the

number of positives(sub-windows containing faces) and number of negatives(sub-windows

containing no faces). As the the number of positives is a negligible value the total number

of negatives can be approximated to the number of windows.

30

4.2.4 ANALYSIS OF SPEED

Each of the test images is loaded onto the iPod touch 4th generation and the

proposed face detection algorithm is run under each of the above described cases. The

time for the face detection algorithm to calculate the output values, i.e. position and size of

each sub-window detected to contain a face. This is done using functions under the C++

header file <sys/time.h>. Each of these time values are noted and compared.

4.3 EXPECTED RESULTS

Each case of proposed face detection algorithm is expected to show the

following results.

• Without pre-processing - This is expected to have a number of false positives and hence

the FPR should have a high value. This is because only the first seven stages of the Viola

Jones face detection framework are used. However, this is expect to detect all the faces in

the image thus have a TPR value close to 1.0. This algorithm is also expected to take the

longest time amongst the five cases when run on the iPod touch 4th generation.

• With Skin colour detection as pre-processor - Here the number of false positives is

expected to be reduced. Hence FPR will be lower than the first case. But this algorithm

wont work under dark shadowy images where it is difficult to detect skin colour in the

faces. Hence some of the faces will have been missed and the TPR value will be slightly

less. This is also expected to take much lesser time than the first method as many of the

sub-windows will be rejected by pre-processing

• With Canny edge detection as pre-processor - Similar to the previous case the FPR will

be lower than the first case but the TPR will be slightly lesser than 1.0. The speed too

will be lower due to pre-processing as compared to the first case.

• With both Skin colour and Canny edge detection with an AND gate - Here the FPR will

be even lower than the previous two cases as only very few sub-windows will pass the

pre-processing sage. The TPR value will also be significantly lower as will time taken for

processing on the iPod Touch 4th generation.

31

FPR = FP

NFPTot Tot = P + N N (4.2)

• With both Skin colour and Canny edge detection with an OR gate - The number of false

positives will be lower than the case without pre-processing but higher than the ones

using skin colour detection and Canny edge detection separately. The TPR will also be

closer to 1.0 than these two cases. But the speed of processing will be reduced as the

number of sub-windows allowed to pass thorough the pre-processing stage will have

increased.

32

CHAPTER - V

SPECIFICATIONS

5.1 SOFTWARE USED FOR PROGRAMMING AND TESTING

5.1.1 XCODE

Xcode is a suite of tools developed by Apple Inc. It is used for developing

software for Mac OS X and iOS. It was first released in 2003. The Xcode suite also

includes most of Apple's developer documentation, and built-in Interface Builder which is

an application used to construct graphical user interfaces.It supports C, C++, Objective-C,

Objective-C++, Java, AppleScript,Python and Ruby source code with a variety of

programming models. Xcode version 4.3 was used for the project. This was the main

application used for creating and running both the iOS 5 App and the Mac project used for

testing.

5.1.2 MATLAB

MATLAB (Matrix laboratory) is a numerical computing environment used

for various mathematical and engineering analysis. Developed by MathWorks, MATLAB

allows matrix manipulations, plotting of functions and data, implementation of algorithms,

etc. MATLAB version R2010a was used. This was used for getting testing and statistical

results for true positive rate and false positive rate to compare all the five cases.

5.1.3 LIBRARIES

The languages used for the project were C, C++, Objective-C/C++. The

following additional libraries were used other than those found in the Apple's developer

documentation of Xcode.

• CImg Library - The CImg Library [b] is a small, open source, C++ toolkit for image

processing. It mainly consists in a single header file CImg.h providing a set of C++

classes and functions that can be used to load/save, manage/process and display generic

images. It was used for the Mac project for storing and displaying images at different

parts of the program.

33

• pugixml - It is a fast, light-weight C++ XML processing library [k]. Its is used for both

the Mac and iOS projects for parsing the XML files which store the information of the

classifiers.

5.2 HARDWARE

5.2.1 iMAC/ MACBOOK PRO

Macbook Pro - 2.3 GHz Intel Core i5,

4 GB RAM (1333 MHz DDR3),

Mac OS X Lion 10.7.3

iMAC - 3.06 GHz Intel Core i3,

4 GB RAM (1333 MHz DDR3),

Mac OS X Lion 10.7.2

5.2.2 iPOD TOUCH 4th GENERATION/ iPAD 2

iPod Touch 4th generation - 1GHz ARM Cortex - A8 processor,

256 MB DRAM,

iOS 5

iPad 2 - 1GHz Dual core Apple A5 processor

512 MB DDR2 RAM

iOS 5

5.3 SPECIFICATION OF COMPUTERS USED IN TRAINING

The RRZN (Regionales Rechenzentrum für Niedersachsen) computer

cluster system offers massively parallel computing systems and computers with large main

memory. These computers are available for all employees and students of the Leibniz

University of Hannover. The "Tane"-Cluster of the RRZN is used for training. The

specifications of the RRZN cluster system are as shown

Number of nodes - 96

Processors per node - 12

Processor - Intel Xeon CPU X5670, 2.93GHz

34

Main memory per node - 48 GB

Local disk space - 80 GB per node

File systems - Lustre

Operating system - Scientific Linux

The training uses 10 nodes i.e. 120 processors in total. A training round

selecting one classifier takes approximately 20 minutes for a training set of approximately

24000 images. In later stages the bootstrapping process can take up to 1 hour. So the

training classifiers takes in total approximately100 hours.

5.4 GIT REPOSITORY

Git is a distributed version control system (DVCS) written in C [3]. It

creates a history for a collection of files and includes the functionality to revert them to

another state. The collection of files is usually called source code. In a d DVCS all users

have a complete copy of the source code, including its complete history, called a local

repository. Each user can perform version control operations against this local copy, for

example revert to a previous version of the source code, merge the current version with

another created by a different user, etc.

If a user makes such changes to the source code, he/she can mark them as

relevant for the version control by add them to the index(cache) and then add them to the

local repository (commit). Git maintains all versions. Therefore a user can revert to any

point in the source code history using Git.

Git synchronizes these local repositories with other (remote) repositories.

Owners of local repositories can synchronize changes via push (transferring changes to the

remote repository) or pull (getting changes from the remote repository). This is illustrated

in Fig. 5.1

Fig. 5.1 - Storage levels in a Git repository [e]

35

A Git repository was used to maintain a complete history of the program.

This was useful in order to revert to previous versions, include changes made by others

working on the same code etc.

5.5 SIMD SIMD stands for single instruction multiple data. It is a class of parallel

computing. It is very useful for fast processing of similar instructions of a bulk of data.

This is as shown in Fig. 5.2

Fig 5.2 - SIMD[i]

For example, changing the brightness of an image. Here, the R, G and B

values of each pixel are read from memory, a value is added/subtracted to/from them, and

the resulting values are written back out to memory. A SIMD processor can improve this

process. Instead of a series of instructions saying "get this pixel, now get the next pixel", it

will have a single instruction that effectively says "get lots of pixels" . This takes much less

time than "getting" each pixel individually, as with traditional CPU design. Also all

operations performed on this block of pixels are performed in parallel.

ARM's NEON technology, used in many mobile devices (including iPods,

iPhones, etc) is a 64/128-bit hybrid SIMD architecture designed to accelerate the

performance of multimedia and signal processing applications, including video encoding

and decoding, audio encoding and decoding, 3D graphics, speech and image processing

[a]. This technique is used in the iOS project to accelerate a few simple functions such as

finding the square of the image and BRGA to greyscale conversion. This operation can be

extended to other parts of the algorithm which can drastically improve performance.

36

CHAPTER - VI

RESULTS

6.1 PERFORMANCE COMPARISON

The true positive rates and false positive rates of all the five cases are as

shown in Table 1.1. The following conclusions can be drawn

• Without pre-processing - All the faces are detected in thins case(TPR is 1.0), but there is

a high value of false positives.

• With Skin colour detection as pre-processor - The number of false positives is reduced by

almost a factor of 4.5. But this case is unfavourable as too many faces go undetected i.e.

TPR is too low.

• With Canny edge detection as pre-processor - Similar to the previous case, the FPR is

reduced by a large value(factor of 3.5), but TPR is still to low.

• With both Skin colour and Canny edge detection with an AND gate - The number of false

positives is the lowest amoung all the cases but this is also the case which the fails to

detect the most number of cases. Hence this case too is unfavourable

• With both Skin colour and Canny edge detection with an OR gae - This case has a

considerably lower FPR, reduced by almost a factor of 3. The TPR is also at an

acceptable level at around 85%.

Pre-processing TPR FPR (x 10-4)

Without Pre-processing

With Skin Colour detection

With Canny Edge detection

With Both using AND gate

With Both using OR gate

1.0 1.42

0.65 0.32

0.57 0.41

0.34 0.16

0.84 0.57

Table 6.1 - Performance comparison

37

From this analysis it was found that the case using both skin colour and

canny edge detection yields the best results.

6.2 SPEED COMPARISON

The speed-up factors of all the five cases as compared to the case without pre-processing

are as shown in Table 2.1. The following conclusions can be drawn

• Without pre-processing - Value of 1.0

• With Skin colour detection as pre-processor - As expected the speed-up is by a high

factor as a large number of faces get rejected by pre-processing.

• With Canny edge detection as pre-processor - The speed-up factor is a high value. It is

not as high as the previous case as the Canny edge detection algorithm is more time

consuming than the skin colour detection.

• With both Skin colour and Canny edge detection with an AND gate - This case has the

highest speed-up factor as it rejects the maximum number of sub-windows. This can be

expected from the low value of TPR seen in the previous section.

• With both Skin colour and Canny edge detection with an OR gate - This case has a

considerably high sped-up factor of almost 2.3.

Chart 6.1 - Speed-up comparison

38

Without Pre-processing

With Skin colour detection

With Canny edge detection

With noth using AND gate

With both using OR gate

0 1 2 3 4 5

2.28

4.07

2.86

4.0

1.0

6.3 EXAMPLE IMAGES

Fig. 6.1 (a) to (f) - (a) Original Image, (b) Without Pre-processing, (c) With Skin

colour detection, (d) With Skin colour detection, (e) With both using AND gate, (f)


39




40




41




42

CHAPTER - VII

APPENDIX

7.1 SKIN COLOUR DETECTION#include <iostream.h>#include "SkinColourDetect.h"#include <stdio.h>

void doSkinColourDtetction(unsigned char *imageSkinColourDst, unsigned char *imageSkinColourSrc, int W, int H){ for(int i=0,k=0; i<W*H*4; i+=4,k++) { unsigned char Value,tmp; double Saturation, Hue, Cr, Cg, Cb; unsigned char B = imageSkinColourSrc[i], // Read red value at coordinates (x,y). G = imageSkinColourSrc[i+1], // Read green value at coordinates (x,y) R = imageSkinColourSrc[i+2] // Read blue value at coordinates (x,y) /////////////////// VALUE // find maximum if ( (R >= G) && (R >= B)) Value = R; else if ( (G >= R) && (G >= B)) Value = G; else if ( (B >= G) && (B >= R)) Value = B; /////////////////// SATURATION // find minimum if ( (R <= G) && (R <= B)) tmp = R; else if ( (G <= R) && (G <= B)) tmp = G; else if ( (B <= G) && (B <= R)) tmp = B; if (Value == 0) Saturation = 0; else Saturation = ((double)Value - (double)tmp)/(double)Value; /////////////////// SATURATION if (Saturation == 0) Hue = -1; else { Cr = ((double)Value-(double)R)/((double)Value-(double)tmp); Cg = ((double)Value-(double)G)/((double)Value-(double)tmp); Cb = ((double)Value-(double)B)/((double)Value-(double)tmp); if (R == Value) Hue = Cb - Cg; if (G == Value) Hue = 2 + Cr - Cb; if (B == Value) Hue = 4 + Cg - Cr; Hue *= 60; if (Hue < 0) Hue +=360; } // DETECT SKIN COLOR // if ( (Hue >= 0.0) && (Hue <= 50.0) && (Saturation >= 0.23) && (Saturation <= 0.68) )

43

{ imageSkinColourDst[k] = 1;// std::cout << "yes"; } else { imageSkinColourDst[k] = 0; } }}

7.2 CANNY PRUNING#include <iostream>#include "CannyPruning.h"#include "math.h"

int edgeDir[240][320];! ! ! // Stores the edge direction of each pixelfloat gradient[240][320];! ! // Stores the gradient strength of each pixel

void doCanny(unsigned char *imgData,int W,int H){! ! unsigned int row, col;! ! // Pixel's row and col positions! unsigned long i;! ! ! ! // Dummy variable for row-column vector! int! upperThreshold = 60;! // Gradient strength nessicary to start edge! int! ! lowerThreshold = 30;! // Minimum gradient strength to continue edge! unsigned long iOffset;! ! ! // Variable to offset row-column vector during sobel mask! int rowOffset;!! ! ! ! // Row offset from the current pixel! int colOffset;!! ! ! ! // Col offset from the current pixel! int rowTotal = 0;! ! ! ! // Row position of offset pixel! int colTotal = 0;! ! ! ! // Col position of offset pixel! int Gx;!! ! ! ! ! ! // Sum of Sobel mask products values in the x direction! int Gy;!! ! ! ! ! ! // Sum of Sobel mask products values in the y direction! float thisAngle;! ! ! ! // Gradient direction based on Gx and Gy! int newAngle;! ! ! ! ! // Approximation of the gradient direction! bool edgeEnd;! ! ! ! ! // Stores whether or not the edge is at the edge of the possible image! int GxMask[3][3];! ! ! ! // Sobel mask in the x direction! int GyMask[3][3];! ! ! ! // Sobel mask in the y direction//! float newPixel;! ! ! ! ! // Sum pixel values for gaussian//! float gaussianMask[5];! ! ! // Gaussian mask int gaussianMask[5][5]; int newPixel;!! for (row = 0; row < H; row++) {! ! for (col = 0; col < W; col++) {! ! ! edgeDir[row][col] = 0;! ! }! } Hist_Eq(imgData, W, H); ! /* Declare Sobel masks */! GxMask[0][0] = -1; GxMask[0][1] = 0; GxMask[0][2] = 1;! GxMask[1][0] = -2; GxMask[1][1] = 0; GxMask[1][2] = 2;! GxMask[2][0] = -1; GxMask[2][1] = 0; GxMask[2][2] = 1;

44

!! GyMask[0][0] = 1; GyMask[0][1] = 2; GyMask[0][2] = 1;! GyMask[1][0] = 0; GyMask[1][1] = 0; GyMask[1][2] = 0;! GyMask[2][0] = -1; GyMask[2][1] = -2; GyMask[2][2] = -1; /* Declare Gaussian mask */ gaussianMask[0][0] = 2;! ! gaussianMask[0][1] = 4;!! gaussianMask[0][2] = 5;! ! gaussianMask[0][3] = 4;!! gaussianMask[0][4] = 2;!! gaussianMask[1][0] = 4;! ! gaussianMask[1][1] = 9;!! gaussianMask[1][2] = 12;! gaussianMask[1][3] = 9;!! gaussianMask[1][4] = 4;!! gaussianMask[2][0] = 5;! ! gaussianMask[2][1] = 12;! gaussianMask[2][2] = 15;! gaussianMask[2][3] = 12;! gaussianMask[2][4] = 5;!! gaussianMask[3][0] = 4;! ! gaussianMask[3][1] = 9;!! gaussianMask[3][2] = 12;! gaussianMask[3][3] = 9;!! gaussianMask[3][4] = 4;!! gaussianMask[4][0] = 2;! ! gaussianMask[4][1] = 4;!! gaussianMask[4][2] = 5;! ! gaussianMask[4][3] = 4;!! gaussianMask[4][4] = 2;!! for (row = 2; row < H-2; row++) { unsigned long val = row*W;! ! for (col = 2; col < W-2; col++) {! ! ! newPixel = 0;! ! ! for (rowOffset=-2; rowOffset<=2; rowOffset++) {! ! ! ! rowTotal = row + rowOffset; const unsigned char * const rowTemp = &imgData[rowTotal * W];! ! ! ! for (colOffset=-2; colOffset<=2; colOffset++) {! ! ! ! ! colTotal = col + colOffset;! ! ! ! ! newPixel += rowTemp[colTotal] * gaussianMask[2 + rowOffset][2 + colOffset];! ! ! ! } ! ! ! }! ! ! i = (unsigned long)(val + col);! ! ! *(imgData + i) = newPixel / 159;! ! }! } ! /* Determine edge directions and gradient strengths */! for (row = 1; row < H-1; row++) {! ! for (col = 1; col < W-1; col++) { //! ! ! i = (unsigned long)(row*W + col);! ! ! Gx = 0;! ! ! Gy = 0;! ! ! /* Calculate the sum of the Sobel mask times the nine surrounding pixels in the x and y direction */! ! ! for (rowOffset=-1; rowOffset<=1; rowOffset++) { rowTotal = row + rowOffset; const unsigned char * const rowTemp = &imgData[rowTotal * W];! ! ! ! for (colOffset=-1; colOffset<=1; colOffset++) {! ! ! ! ! colTotal = col + colOffset;! ! ! ! ! Gx = Gx + (rowTemp[colTotal] * GxMask[rowOffset + 1][colOffset + 1]);! ! ! ! ! Gy = Gy + (rowTemp[colTotal] * GyMask[rowOffset + 1][colOffset + 1]);! ! ! } ! ! ! gradient[row][col] = sqrt(pow(Gx,2.0) + pow(Gy,2.0));!// Calculate gradient strength! ! !! ! ! thisAngle = (atan2(Gx,Gy)/3.14159) * 180.0;! ! // Calculate actual direction of edge

45

! ! !! ! ! /* Convert actual edge direction to approximate value */! ! ! if ( ( (thisAngle < 22.5) && (thisAngle > -22.5) ) || (thisAngle > 157.5) || (thisAngle < -157.5) )! ! ! ! newAngle = 0;! ! ! if ( ( (thisAngle > 22.5) && (thisAngle < 67.5) ) || ( (thisAngle < -112.5) && (thisAngle > -157.5) ) )! ! ! ! newAngle = 45;! ! ! if ( ( (thisAngle > 67.5) && (thisAngle < 112.5) ) || ( (thisAngle < -67.5) && (thisAngle > -112.5) ) )! ! ! ! newAngle = 90;! ! ! if ( ( (thisAngle > 112.5) && (thisAngle < 157.5) ) || ( (thisAngle < -22.5) && (thisAngle > -67.5) ) )! ! ! ! newAngle = 135; ! ! ! edgeDir[row][col] = newAngle;! ! // Store the approximate edge direction of each pixel in one array! ! }! } ! /* Trace along all the edges in the image */! for (row = 1; row < H - 1; row++) { unsigned long val = row*W;! ! for (col = 1; col < W - 1; col++) {! ! ! edgeEnd = false;! ! ! if (gradient[row][col] > upperThreshold) {! ! // Check to see if current pixel has a high enough gradient strength to be part of an edge! ! ! ! /* Switch based on current pixel's edge direction */! ! ! ! switch (edgeDir[row][col]){! !! ! ! ! ! case 0:! ! ! ! ! ! findEdge(0, 1, row, col, 0, lowerThreshold, imgData);! ! ! ! ! ! break;! ! ! ! ! case 45:! ! ! ! ! ! findEdge(1, 1, row, col, 45, lowerThreshold, imgData);! ! ! ! ! ! break;! ! ! ! ! case 90:! ! ! ! ! ! findEdge(1, 0, row, col, 90, lowerThreshold, imgData);! ! ! ! ! ! break;! ! ! ! ! case 135:! ! ! ! ! ! findEdge(1, -1, row, col, 135, lowerThreshold, imgData);! ! ! ! ! ! break;! ! ! ! ! default :! ! ! ! ! ! i = (unsigned long)(val + col);! ! ! ! ! ! *(imgData + i) = 0;! ! ! ! ! ! break; } }! ! ! else {! ! ! ! i = (unsigned long)(val + col); *(imgData + i) = 0;! ! ! }!! ! }! } ! /* Suppress any pixels not changed by the edge tracing */! for (row = 0; row < H; row++) { unsigned long val = row*W;! ! for (col = 0; col < W; col++) {!! ! ! // Recall each pixel is composed of 3 bytes! ! ! i = (unsigned long)(val + col);! ! ! // If a pixel's grayValue is not black or white make it black! ! ! if( ((*(imgData + i) != 1) && (*(imgData + i) != 0)) ) ! ! ! ! *(imgData + i) = 0; // Make pixel black! ! }! }

46

! /* Non-maximum Suppression */! for (row = 1; row < H - 1; row++) { unsigned long val = row*W; // const unsigned char * const rowTemp = &imgData[rowTotal * W];! ! for (col = 1; col < W - 1; col++) {! ! ! i = (unsigned long)(val + col);! ! ! if (*(imgData + i) == 1) {!! // Check to see if current pixel is an edge! ! ! ! /* Switch based on current pixel's edge direction */! ! ! ! switch (edgeDir[row][col]) {! !! ! ! ! ! case 0:! ! ! ! ! ! suppressNonMax( 1, 0, row, col, 0, lowerThreshold, imgData);! ! ! ! ! ! break;! ! ! ! ! case 45:! ! ! ! ! ! suppressNonMax( 1, -1, row, col, 45, lowerThreshold, imgData);! ! ! ! ! ! break;! ! ! ! ! case 90:! ! ! ! ! ! suppressNonMax( 0, 1, row, col, 90, lowerThreshold, imgData);! ! ! ! ! ! break;! ! ! ! ! case 135:! ! ! ! ! ! suppressNonMax( 1, 1, row, col, 135, lowerThreshold, imgData);! ! ! ! ! ! break;! ! ! ! ! default :! ! ! ! ! ! break;! ! ! ! }! ! ! }!! ! }! }}

void findEdge(int rowShift, int colShift, int row, int col, int dir, int lowerThreshold, unsigned char *imgData){! int W = 320;! int H = 240;! int newRow;! int newCol;! unsigned long i;! bool edgeEnd = false; ! /* Find the row and column values for the next possible pixel on the edge */! if (colShift < 0) {! ! if (col > 0)! ! ! newCol = col + colShift;! ! else! ! ! edgeEnd = true;! } else if (col < W - 1) {! ! newCol = col + colShift;! } else! ! edgeEnd = true;! ! // If the next pixel would be off image, don't do the while loop! if (rowShift < 0) {! ! if (row > 0)! ! ! newRow = row + rowShift;! ! else! ! ! edgeEnd = true;! } else if (row < H - 1) {! ! newRow = row + rowShift;! } else! ! edgeEnd = true;! ! /* Determine edge directions and gradient strengths */! while ( (edgeDir[newRow][newCol]==dir) && !edgeEnd && (gradient[newRow][newCol] > lowerThreshold) ) {! ! /* Set the new pixel as white to show it is an edge */

47

! ! i = (unsigned long)(newRow*W + newCol);! ! *(imgData + i) =1;! ! if (colShift < 0) {! ! ! if (newCol > 0)! ! ! ! newCol = newCol + colShift;! ! ! else! ! ! ! edgeEnd = true;!! ! } else if (newCol < W - 1) {! ! ! newCol = newCol + colShift;! ! } else! ! ! edgeEnd = true;!! ! if (rowShift < 0) {! ! ! if (newRow > 0)! ! ! ! newRow = newRow + rowShift;! ! ! else! ! ! ! edgeEnd = true;! ! } else if (newRow < H - 1) {! ! ! newRow = newRow + rowShift;! ! } else! ! ! edgeEnd = true;!! }!}

void suppressNonMax(int rowShift, int colShift, int row, int col, int dir, int lowerThreshold, unsigned char *imgData){! int W = 320;! int H = 240;! int newRow = 0;! int newCol = 0;! unsigned long i;! bool edgeEnd = false;! float nonMax[320][3];!! ! // Temporarily stores gradients and positions of pixels in parallel edges! int pixelCount = 0;! ! ! ! ! // Stores the number of pixels in parallel edges! int count;! ! ! ! ! ! // A for loop counter! int max[3];! ! ! ! ! ! // Maximum point in a wide edge!! if (colShift < 0) {! ! if (col > 0)! ! ! newCol = col + colShift;! ! else! ! ! edgeEnd = true;! } else if (col < W - 1) {! ! newCol = col + colShift;! } else! ! edgeEnd = true;! ! // If the next pixel would be off image, don't do the while loop! if (rowShift < 0) {! ! if (row > 0)! ! ! newRow = row + rowShift;! ! else! ! ! edgeEnd = true;! } else if (row < H - 1) {! ! newRow = row + rowShift;! } else! ! edgeEnd = true;!! i = (unsigned long)(newRow*W + newCol);! /* Find non-maximum parallel edges tracing up */! while ((edgeDir[newRow][newCol] == dir) && !edgeEnd && (*(imgData + i) == 1)) {! ! if (colShift < 0) {! ! ! if (newCol > 0)! ! ! ! newCol = newCol + colShift;! ! ! else! ! ! ! edgeEnd = true;!! ! } else if (newCol < W - 1) {! ! ! newCol = newCol + colShift;

48

! ! } else! ! ! edgeEnd = true;!! ! if (rowShift < 0) {! ! ! if (newRow > 0)! ! ! ! newRow = newRow + rowShift;! ! ! else! ! ! ! edgeEnd = true;! ! } else if (newRow < H - 1) {! ! ! newRow = newRow + rowShift;! ! } else! ! ! edgeEnd = true;!! ! nonMax[pixelCount][0] = newRow;! ! nonMax[pixelCount][1] = newCol;! ! nonMax[pixelCount][2] = gradient[newRow][newCol];! ! pixelCount++;! ! i = (unsigned long)(newRow*W + newCol);! } ! /* Find non-maximum parallel edges tracing down */! edgeEnd = false;! colShift *= -1;! rowShift *= -1;! if (colShift < 0) {! ! if (col > 0)! ! ! newCol = col + colShift;! ! else! ! ! edgeEnd = true;! } else if (col < W - 1) {! ! newCol = col + colShift;! } else! ! edgeEnd = true;!! if (rowShift < 0) {! ! if (row > 0)! ! ! newRow = row + rowShift;! ! else! ! ! edgeEnd = true;! } else if (row < H - 1) {! ! newRow = row + rowShift;! } else! ! edgeEnd = true;!! i = (unsigned long)(newRow*W + newCol);! while ((edgeDir[newRow][newCol] == dir) && !edgeEnd && (*(imgData + i) == 1)) {! ! if (colShift < 0) {! ! ! if (newCol > 0)! ! ! ! newCol = newCol + colShift;! ! ! else! ! ! ! edgeEnd = true;!! ! } else if (newCol < W - 1) {! ! ! newCol = newCol + colShift;! ! } else! ! ! edgeEnd = true;!! ! if (rowShift < 0) {! ! ! if (newRow > 0)! ! ! ! newRow = newRow + rowShift;! ! ! else! ! ! ! edgeEnd = true;! ! } else if (newRow < H - 1) {! ! ! newRow = newRow + rowShift;! ! } else! ! ! edgeEnd = true;!! ! nonMax[pixelCount][0] = newRow;! ! nonMax[pixelCount][1] = newCol;! ! nonMax[pixelCount][2] = gradient[newRow][newCol];! ! pixelCount++;! ! i = (unsigned long)(newRow*W + newCol);! } ! /* Suppress non-maximum edges */! max[0] = 0;

49

! max[1] = 0;! max[2] = 0;! for (count = 0; count < pixelCount; count++) {! ! if (nonMax[count][2] > max[2]) {! ! ! max[0] = nonMax[count][0];! ! ! max[1] = nonMax[count][1];! ! ! max[2] = nonMax[count][2];! ! }! }! for (count = 0; count < pixelCount; count++) {! ! i = (unsigned long)(nonMax[count][0]*W + nonMax[count][1]);! ! *(imgData + i) = 0;! }}void Hist_Eq(unsigned char *img_data, int width, int height){ unsigned long hist[256];! double s_hist_eq[256]={0.0}, sum_of_hist[256]={0.0};! long i, k, l, n; n = width * height; ! for(i=0;i<256;i++)! {! ! hist[i] = 0;! }! for(i=0;i<n;i++)! { l = img_data[i]; hist[l]++;! } ! for (i=0;i<256;i++) // pdf of image! {! ! s_hist_eq[i] = (double)hist[i]/(double)n;! }! sum_of_hist[0] = s_hist_eq[0];! for (i=1;i<256;i++)! // cdf of image! {! ! sum_of_hist[i] = sum_of_hist[i-1] + s_hist_eq[i];! } ! for(i=0;i<n;i++)! { k = img_data[i]; img_data[i] = (unsigned char)round( sum_of_hist[k] * 255.0 );! } }

50

CHAPTER - VIII

CONCLUSION

As can be clearly observed from the analysis, the case with both Skin colour and Canny

edge detection with an OR gate gives the best results. This gives a reduction of FPR by a

factor of almost 3 while maintaining a high true positive rate of 84%. This also gives a

speed-up by almost a factor of 2.3.

The value of TPR obtained is not a true indication of the efficiency of the face detector as

the number of images used for testing was very low. Within this number in fact, this

method failed to detect only four faces all of which were taken under low lighting. One

example of an undetected face is shown in Fig. 3.2. As can be seen this is a typically

obtuse case. So is understandable why both skin colour and canny edge detection do not

satisfactorily work in this case. If these aberrations are eliminated from the analysis, the

TPR obtained will be much higher.

It was a great pleasure working in the laboratory atmosphere at Leibniz University. It is an

ambience which is very conducive to thesis work. I am indebted to the guidance of Prof

Dr-Ing Bodo Rosenhahn and the valuable inputs given by my supervisors Mr. Arne Ehlers,

Mr Björn Scheuermann and Mr. Florian Baumann.

51

CHAPTER - IX

FUTURE ENHANCEMENTS

Some of the future developments that can be made to enhance the working of the project

are mentioned below

• Analysis for exact threshold for Canny and Skin colour detection - The threshold values

used for Canny edge detection and Skin colour detection (45% - 90% for Canny and

16.7% - 22.7% for skin colour detection) are taken just by testing out various threshold

values for the 29 test images. A proper analysis to calculate the exact threshold values for

white count within each sub-window should be done. This may drastically improve the

results.

• Speed up using Neon intrinsics - As mentioned before SIMD processing of the ARM

processor can be used to greatly improve the speed of computation.

• Speed up using GUI - The graphical user interface of the mobile device may be used to

further speed up the processing

• Pre-processing used in training - The pre-processing techniques maybe included in the

training of the features to create better classifiers.

52

REFERENCES

1. Crow. F, (1984), ‘Summed-area tables for texture mapping’, Proceedings of SIGGRAPH, Vol.18, No.3, pp.207–212.

2. Gonzalez and Woods, (2008), ‘Digital Image Processing’, Third edition, Prentice Hall, Noida.

3. John Canny, (1986), ‘A computational approach to edge detection’, Pattern Analysis and Machine Intelligence, IEEE Transactions on, Vol.8, No.6, pp.679–698.

4. Masoud Nosrati, Ronak Karimi, Mehdi Hariri, (2012), ‘Detecting circular shapes from areal images using median filter and CHT’, World Applied Programming, Vol.2, pp.49-54

5. Michael Kearns and Leslie G. Valiant, (1994), ‘Cryptographic limitations on learning Boolean formulae and finite automata’, Journal of the Association for Computing Machinery, Vol.41, No.1, pp.67–95,

6. Michael Kearns and Leslie G. Valiant, (1988), ‘Learning Boolean formulae or finite automata is as hard as factoring’, Technical Report TR-14-88, Harvard University Aiken Computation Laboratory.

7. Rick Kjeldsen and John R. Kender, (1996), ‘Finding skin in color images’, in proceedings of 2nd International Conference on Automatic Face and Gesture Recognition 96, pp.312-317.

8. Robert E. Schapire, (1990), ‘The strength of weak learnability’, Machine Learning, Vol.5, No.2, pp.197–227.

9. Sanjay Kr. Singh and D. S. Chauhan and Mayank Vatsa and Richa Singh, (2003), ‘A Robust Skin Color Based Face Detection Algorithm’, Tamkang Journal of Science and Engineering, Vol.6, pp. 227-234.

10. Viola, P., Jones, M.J, (2004), ‘Robust real-time face detection’, International Journal of Computer Vision, Vol. 57, pp.137–154.

11. Vladimir Vezhnevets, Vassili Sazonov, and Alla Andreeva, (2003), ‘A survey on pixel-based skin color detection techniques’, Proc. Graphicon-2003, pp.85- 92.

12. Yoav Freund and Robert E, (1999), ‘Schapire,A short introduction to boosting’, Journal of Japanese Society for Artificial Intelligence, Vol.14, No.5, pp.771-780.

53

WEBSITES

a. http://blogs.arm.com/software-enablement/161-coding-for-neon-part-1-load-and-stores/b. http://cimg.sourceforge.net/index.shtmlc. http://en.wikipedia.org/wiki/Artificial_neural_networkd. http://en.wikipedia.org/wiki/Canny_edge_detector e. http://en.wikipedia.org/wiki/Git_(software)#Source_code_hostingf. http://en.wikipedia.org/wiki/HSL_and_HSVg. http://en.wikipedia.org/wiki/Machine_learningh. http://en.wikipedia.org/wiki/Perceptroni. http://en.wikipedia.org/wiki/SIMDj. http://en.wikipedia.org/wiki/Viola–Jones_object_detection_frameworkk. http://pugixml.org/l. http://smig.usgs.gov/SMIG/features_0902/tualatin_ann.htmlm. http://www.chasanc.com/index.php/Image-Processing/Histogram-Equalization-

HE.htmln. http://www.chasanc.com/index.php/Image-Processing/Skin-Color-Detection-with-

HSV-Lookup.htmlo. http://www.classle.net/sites/default/files/text/36461/cannyedge_detection_0.pdfp. http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.htmlq. http://www.pages.drexel.edu/~nk752/cannyTut2.htmlr. http://www.techradar.com/news/software/applications/how-face-detection-

works-703173s. http://www.vogella.de/articles/Git/article.html

54

http://blogs.arm.com/software-enablement/161-coding-for-neon-part-1-load-and-stores/

http://blogs.arm.com/software-enablement/161-coding-for-neon-part-1-load-and-stores/

http://cimg.sourceforge.net/index.shtml

http://cimg.sourceforge.net/index.shtml

http://en.wikipedia.org/wiki/Artificial_neural_network

http://en.wikipedia.org/wiki/Artificial_neural_network

http://en.wikipedia.org/wiki/Canny_edge_detector

http://en.wikipedia.org/wiki/Canny_edge_detector

http://en.wikipedia.org/wiki/Git_(software)#Source_code_hosting

http://en.wikipedia.org/wiki/Git_(software)#Source_code_hosting

http://en.wikipedia.org/wiki/HSL_and_HSV

http://en.wikipedia.org/wiki/HSL_and_HSV

http://en.wikipedia.org/wiki/Machine_learning

http://en.wikipedia.org/wiki/Machine_learning

http://en.wikipedia.org/wiki/Perceptron

http://en.wikipedia.org/wiki/Perceptron

http://en.wikipedia.org/wiki/SIMD

http://en.wikipedia.org/wiki/SIMD

http://en.wikipedia.org/wiki/Viola

http://en.wikipedia.org/wiki/Viola

http://pugixml.org

http://pugixml.org

http://smig.usgs.gov/SMIG/features_0902/tualatin_ann.html

http://smig.usgs.gov/SMIG/features_0902/tualatin_ann.html

http://www.chasanc.com/index.php/Image-Processing/Histogram-Equalization-HE.html




http://www.chasanc.com/index.php/Image-Processing/Skin-Color-Detection-with-HSV-Lookup.html




http://www.classle.net/sites/default/files/text/36461/cannyedge_detection_0.pdf

http://www.classle.net/sites/default/files/text/36461/cannyedge_detection_0.pdf

http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html

http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html

http://www.pages.drexel.edu/~nk752/cannyTut2.html

http://www.pages.drexel.edu/~nk752/cannyTut2.html

http://www.techradar.com/news/software/applications/how-face-detection-works-703173




http://www.vogella.de/articles/Git/article.html

http://www.vogella.de/articles/Git/article.html

Improvement of Viola Jones Face Detector using...

Documents

Transcript of Improvement of Viola Jones Face Detector using...