Indoor Facial Detection and Recognition...

Equation Chapter 1 Section 1

Trabajo Fin de Grado

Ingeniería Electrónica, Robótica y Mecatrónica

Indoor Facial Detection and Recognition Application

Autor: Pătraşcu Viorica Andreea

Tutor: Jesus Capitan Fernandez

Dep. Ingeniería de Sistemas y Automática

Escuela Técnica Superior de Ingeniería

Universidad de Sevilla

Sevilla, 2016

3

Trabajo Fin de Grado

Ingeniería Electrónica, Robótica y Mecatrónica

Indoor Facial Detection and Recognition

Application

Autor:

Pătraşcu Viorica Andreea

Tutor:

Jesus Capitan Fernandez

Profesor Ayudante Doctor

Dep. Ingeniería de Sistemas y Automática

Escuela Técnica Superior de Ingeniería

Universidad de Sevilla

Sevilla, 2016

5

Trabajo Fin de Grado: Indoor Facial Detection and Recognition Application

Autor: Pătraşcu Viorica Andreea

Tutor: Jesus Capitan Fernandez

El tribunal nombrado para juzgar el Trabajo arriba indicado, compuesto por los siguientes miembros:

Presidente:

Vocales:

Secretario:

Acuerdan otorgarle la calificación de:

Sevilla, 2016

El Secretario del Tribunal

7

Acknowledgements

Through my educational journey, I was lucky enough to meet several mentors who had a great influence on

me and truly shaped the way I am today. To all of them, I say thank you!

Sincere appreciation to my family and my dearest friends for supporting me through the past year and always

encouraging me not to give up.

And last but not least, I want to express my gratitude towards all the beautiful people I have met here, in

Seville. Thank you for giving me a sense of belonging during my Erasmus experience!

Viorica Andreea Pătraşcu

Seville, 2016

8

Abstract

Automatic face recognition represents a fascinating biometric identification method which uses the same

identifier as humans do to distinguish one person from another: their faces. Although it is a new concept, this

technology has evolved incredibly fast and has reached such a level where it can even distinguish between

identical twins. Face recognition has many applications in our daily life, such as: security surveillance, access

control, smart payment cards and even helping individuals who suffer from Prosopagnosia disorder.

For my Final Year Project I implemented a C# software application which detects and recognizes human faces

in an input image or video frame from a live video source. This system could be successfully used inside an

office, a household or any indoor space, where the working environment is a constrained one.

As a background for our topic, this report gives a brief introduction about Artificial Intelligence, Machine

Learning and Computer Vision concepts. Then it covers in detail the theory behind the Viola-Jones method for

Face Detection and the Eigenfaces algorithm for Face Recognition. Last but not least, it describes the

application implementation, the experimental results and it finally outlines the advantages of the system

created.

9

Table of Contents

Acknowledgements 7

Abstract 8

Table of Contents 9

List of Tables 11

List of Figures 12

1 Introduction 15 1.1 Overview 15 1.2 Motivation 15 1.3 About the project 16 1.4 Achievements 16 1.5 Objectives 16 1.6 About the report 16

2 Background 17 2.1 Artificial Intelligence 17 2.2 Computer Vision 18 2.3 Machine Learning (ML) 19

2.3.1 Supervised and Unsupervised Dataset 19 2.3.2 Discriminative and Generative Models 20 2.3.3 Machine Learning and Computer Vision 21 2.3.4 Variable Significance 21 2.3.5 Common Problems 21 2.3.6 Cross-validation, Bootstrapping, ROC curves and Confusion matrices 22 2.3.7 Binary Decision Trees 23 2.3.8 Boosting 24

3 Face Detection 25 3.1 The Viola-Jones Detection Framework - Theory 25

3.1.1 The Haar-type Features 26 3.1.2 The Integral Image 26 3.1.3 Adaboost 30 3.1.4 Cascade Filter 34

3.2 The Viola-Jones Detection Framework – Experimental Results 36

4 Face Recognition 37 4.1 Face Recognition Overview 37 4.2 The Eigenfaces Recognizer - Theory 38

10

4.2.1 Steps to obtain the eigenfaces 40 4.2.2 Preprocessing the images 40 4.2.3 The Covariance Matrix 42 4.2.4 Eigenvectors and Eigenvalues 44 4.2.5 The Eigendecomposition of a Covariance Matrix 45 4.2.6 Principal Component Analysis (PCA) 48 4.2.7 PCA in Computer Vision 49 4.2.8 Pseudocode Eigenfaces Recognizer 53

4.3 The Eigenfaces Recognizer – Experimental Results 53

5 Implementation and Testing Results 55 5.1 Required Software 55

5.1.1 Microsoft Visual Studio 55 5.1.2 The EmguCV Library 58 5.1.3 Microsoft Access 59

5.2 Application Configuration and Testing Results 60 5.2.1 Face Detection Form 61 5.2.2 Face Recognition Form 66

5.3 Some Implementation Details 70

6 Conclusions 73 6.1 Face recognition in our everyday life 73 6.2 Indoor Facial Detection and Recognition application 74 6.3 Conclusions 74

Bibliography 75

11

LIST OF TABLES

Table 4-1. Matrix of pixel values. 41

Table 4-2. Histogram equalization steps. 42

12

LIST OF FIGURES

Figure 2-1. Inference with the scope of propositional logic. 17

Figure 2-2. The Turing machine structure. 18

Figure 2-3. Machine learning algorihms available in the OpenCV library. 20

Figure 2-4. Underfitting, best fitting and overfitting. 21

Figure 2-5. The ROC curve and the confusion matrix. 23

Figure 2-6. Decision tree impurity measures. 24

Figure 3-1. Some Haar-type features. 26

Figure 3-2. The input image and the integral image. 27

Figure 3-3. The four references needed in the integral image. 28

Figure 3-4. The integral image and the input image. 28

Figure 3-5. Edge feature – 6 references. 29

Figure 3-6. Line feature – 8 references. 29

Figure 3-7. Four rectangular feature – 9 references 29

Figure 3-8. DS1(original dataset), C1(trained classifier 1), C1’s mistakes. 31

Figure 3-9. DS2 (weighted dataset 2), C2 (trained classifier 2), C2’s mistakes. 31

Figure 3-10. DS3 (weighted dataset 3), C3 (trained classifier 3) 32

Figure 3-11. The final classifier. 32

Figure 3-12. Cascade classifier. 35

Figure 3-13. “haarcascade_frontalface_default”, available in the EmcuCV library. 35

Figure 3-14. ROC characteristic. 36

Figure 4-1 Structured light. 38

Figure 4-2. XBOX 360. 38

Figure 4-3. Example of eigenfaces. 39

Figure 4-4. Histogram equalization. 41

Figure 4-5. The effect of a transformation matrix T. 44

Figure 4-6. A dataset in a 2-dimensional space. 45

Figure 4-7. Diagonal Covariance Matrix. 46

Figure 4-8. Non-diagonal Covariance Matrix. 47

Figure 4-9. White data and Observed data. 48

13

Figure 4-10. Changing the data space with PCA. 49

Figure 4-11. Digital image stored as a vector. 50

Figure 4-12. Database matrix T. 50

Figure 4-13. The covariance matrix of the database matrix T 51

Figure 4-14. The covariance matrix of the database matrix TT 51

Figure 4-15. The “Eigenfaces for recognition” experiment (1991). 54

Figure 5-1. Microsoft Visual Studio 2010 Environment. 56

Figure 5-2. C# code stages. 57

Figure 5-3. EmguCV layers. 58

Figure 5-4. Microsoft Access 2010. 59

Figure 5-5. Start Form. 60

Figure 5-6. Face Detection From. 61

Figure 5-7. The scan window. 62

Figure 5-8. Detection rectangles overlapping. 62

Figure 5-9. Minimum Neighbors = 1. 63




Figure 5-13. Scale increase rate = 1.2. 65

Figure 5-14. Scale increase rate = 1.4. 65

Figure 5-15. The database used. 66

Figure 5-16. Face Recognition Form. 67

Figure 5-17. Threshold value = 500. 67



Figure 5-20. Normal light conditions. 69

Figure 5-21. Strong light on the right side of the face. 69

15

1 INTRODUCTION

1.1 Overview

owadays, Biometrics play a vital role in our everyday life. Since it is highly secure and convenient, our

society makes use of this technology almost everywhere, from airport surveillance to intelligent houses.

Compared to other biometric solutions, face recognition yields greater advantages because it does not

require the subject’s interaction or permission. From this point of view, it represents a fast and effective way to

increase our security level.

Automated facial recognition is a modern concept. It was born in the 1960s and it is still under constant

development today. In 2006, the Face Recognition Grand Challenge (FRGC) project evaluated the facial

recognition algorithms available at that time. Tests involved 3D scans, high quality images and iris

photographs. The FRGC proved that the algorithms available then were 10 times more precise than those of

2002 and 100 better than those of 1995. Some recognition methods were able to outperform humans in

recognizing faces and could even distinguish between identical twins.

1.2 Motivation

I was fascinated by the fast pace of the development of facial recognition technologies and the progress

they have undergone to reach the level it is at today. In highschool, when I first heard about it, it seemed

extremely compelling, yet rather difficult to grasp as a whole. However, as the years have gone by, my

academic studies helped me to see concepts like Artificial Intelligence with other eyes. For this reason, in the

past year I wanted to take a closer look at how a face recognition system works and implement one myself.

N

I have no special talent. I am only passionately curious.

- Albert Einstein -

16

1.3 About the project

For my Final Year Project I implemented a C# software application which detects and recognizes human faces

in an input image or video frame from a live video source. I developed it in Microsoft Visual Studio 2010

using EmguCV (a cross platform .NET wrapper for the OpenCV image processing library) and a Microsoft

Access database file.

Despite the fact that face recognition builds upon face detection, they represent different concepts. This is

why I have chosen to divide my application into two modules, one for each problem. The system can detect,

count and extract faces in a given image or video frame. The program can also recognize people’s faces if they

are registered in the Microsoft Access database. The user can add new faces to the database together with their

names or delete existing faces.

The algorithms behind the software application are the following: Viola-Jones method for face detection and

Eigenfaces for face recognition.

1.4 Achievements

This report aims to present in detail how a face detection and recognition system can be built, outline the

theoretical frameworks used and, last but not least, which are the connections between such systems and some

related fields, including Machine Learning and Computer Vision.

I claim to have succeeded in implementing a face recognition software application which can be used for

security purposes in constrained environments. For instance, one can install such a program in order to control

the access inside an office or household, without any concerns regarding identity theft.

1.5 Objectives

This project represents my first attempt to implement a facial detection and recognition system. In the near

future, I definitely intend to study other algorithms available today and improve my application. Furthermore,

I look forward to deepen my knowledge and potentially work in the field of Artificial Intelligence someday.

1.6 About the report

The remainder of my Final Year Project report is divided into 6 chapters. They are as follows:

Chapter 2 presents concepts such as Artificial Intelligence, Machine Learning and Computer Vision.

Chapter 3 covers the theory behind Viola-Jones method for Face Detection.

Chapter 4 explains the Eigenfaces algorithm for Face Recognition.

Chapter 5 describes implementation and testing details about the software application.

Chapter 6 concludes this report and highlights the advantages of the system.

17

2 BACKGROUND

2.1 Artificial Intelligence

rtificial intelligence (AI) can be defined as the intelligence possessed by computers. An intelligent

agent perceives the surrounding environment and acts according to an unknown scenario. It is a

flexible machine capable of performing “cognitive” tasks such as learning in order to solve different

problems. Some modern examples of artificial intelligent computers include systems that can play and win

chess games (“Deep Blue”) or even self-driven cars that can navigate through crowded roads.

Today, many goals have been reached in this scientific area: learning, planning, reasoning, communication,

perception and manipulation. However, Artificial General Intelligence (AGI) still remains a primary goal and a

topic for the SF writers and futurists all over the world.

Going back to the origins, the first functional calculating machine was constructed by scientist Wilhelm

Schickond around 1623. In the 19th

century, the mathematician and philosopher George Boole formulated the

“propositional calculus” (or “sentential logic”) and Gottlob Fredge developed what is known today as the

“first-order logic/predicate”, both representations still being used nowadays.

Figure 2-1. Inference with the scope of propositional logic.

A

Success in creating Artificial Intelligence would be the

biggest event in human history.

-Stephen Hawking -

18

In 1936, Alan Turing brought the mathematical model of “computation”, inventing an abstract machinery

named after him. Nowadays, nearly all programming languages are Turing complete _ capable to simulate a

Turing machine. This invention inspired researchers all over the world to build an electronic brain.

Figure 2-2. The Turing machine structure.

The area of artificial intelligence research was officially founded at a conference on the campus of Dartmonth

College in 1956. At the beginning of 21st

century, reached its apogee and started to be used in the industrial

sector, logistics and medical diagnosis.

Today, we can proudly talk about the IBM’s first question answering system (“Watson”), about the “Kinect”

which provides a 3D body motion interface for XboxOne and Xbox360 and, last but not least, about the

intelligent personal assistants in our smartphones. Although this discipline has evolved so much lately, still no

agent achieved to pass the Turing test formulated in 1950, exhibiting the same intelligence as a human being.

2.2 Computer Vision

Computer Vision is a scientific discipline that includes methods for acquiring, processing, analyzing and

finally interpreting images from the world we live in. This field is concerned with the theory of extracting

information from input images using models and learning algorithms. These days, we can enumerate

many subdomains related with computer vision, such as object recognition, image restoration, scene

reconstruction, event detection, video tracking or motion estimation.

Computer Vision is concerned with constructing systems that obtain data from a set of given images.

Understanding an image means transforming it into a “description”, using patterns constructed with the

aid of geometry, statistics, physics and machine learning theory.

The main parts of artificial intelligence cover planning the mechanical movements of a robot through an

environment. In order to achieve this kind of task, the robot needs input data provided by a computer

vision module, acting as his “eyes”_ a bridge between his world and ours.

The input information can appear in many forms, such as video sequences, representations from multiple

cameras or even multi-dimensional data from a medical scanner. We could think of computer vision as

the reverse of computer graphics. While computer graphics creates image data from tridimensional

models, computer vision produces such models from the images received. There is also a tendency to

combine these two technologies, as explored in augmented reality.

A standard problem in computer vision is determining if the input image contains a certain feature, object

or activity. Here we could divide computer vision into detection (the image in question is scanned for a

specific condition _ for example, scanning a tissue to search for abnormal cells), object classification or

recognition (programs such as “Blippar”, “LikeThat” and “Google Goggles”) and identification (a

specific instance of an object is recognized, such as a person’s fingerprint or face).

Among numerous applications of computer vision we could refer to the industrial process control, quality

inspection in manufacturing applications, navigation of mobile robots, detecting events for security

19

purposes, modeling objects as in medical image analysis, human-machine interaction and missile

guidance in military applications.

2.3 Machine Learning (ML)

As we can imagine, the supreme goal in Computer Vision filed is to use machines that emulate perfectly

human vision, being able to take actions based only on visual input information. However, taking

decisions would be impossible without a learning technique. Machine Learning (ML) aims to transform

data into information. A system can learn from a dataset by extracting certain patterns and then be able to

answer questions regarding a new set of data. In 1959, Arthur Samuel defined machine learning as a

"field of study that gives computers the ability to learn without being explicitly programmed".

When it comes to binary decisions, one usually breaks up an original dataset of 10.000 faces into a large

set for training (for example 9000 faces) and another one for testing (the remaining 1000 faces). The

classifier will run over the first set, constructing its own model of how a face looks. Then, the classifier

will be tested on the smaller dataset and see how well it worked. If the results are bad, we might consider

adding some more features to our first dataset or even try a different type of classifier. Sometimes,

jumping directly from training to testing might be an exaggeration. Instead of that, we could split the

samples into three: 8000 faces for learning, 1000 for validation and the last 1000 for the final testing.

During the validation phase, we can “sneak” at the results and see its performance. Only when we are

completely satisfied with this middle stage we should run the classifier on the final test.

2.3.1 Supervised and Unsupervised Dataset

The input information given to a machine will have labels that go together with the data feature vectors. For

example, we could associate a name to a face or even numerical data, such as the age of a person. This type of

learning is called “categorical”.When data comes with labels, the system will do what is known as

“classification”, whereas when the data is numeric, the learner is doing “regression”.

Supervised learning could involve, as in the examples above, one to one pairing of label with data vectors or it

could be “deferred/reinforcement learning”. In the last case, the labels (called “punishment” or “reward”)

would come along after the data vectors were observed. So the machine receives a delayed signal, inferring a

plan of taking decisions for the future runs.

On the contrary, we could not attach label to the input data if we are interested in seeing how the data falls

naturally into categories. This type of ML algorithms are called “clustering algorithms”. Maybe the machine

will group the faces given into short, long, thin or wide faces.

These two forms of machine learning overlap with two frequent tasks in Computer Vision _ recognition

(“what?”) and segmentation (“where?”). We need our computer to identify the object in an image and also

estimate its position.

Since Computer Vision makes such great use of machine learning, OpenCV includes many machine learning

algorithms in the ML library, as we can see in the image below (Multilayer percepcion (MLP), Boosting,

Decision trees, Random trees, K-nearest neighbors, Normal Bayes, Suport Vector Machine (SVM),

Mahalanobis, K-means, Face detector/ Haar classifier). Although it is frequently used for vision tasks, the

OpenCV ML code is universal.

https://en.wikipedia.org/wiki/Arthur_Samuel

20

Figure 2-3. Machine learning algorihms available in the OpenCV library.

2.3.2 Discriminative and Generative Models

OpenCV deals with the most currently used statistical approaches to machine learning. Probabilistic

approaches such as graphical models or Bayesian networks are still being improved.

OpenCV gives support for discriminative models, rather than generative ones. A discriminative algorithm will

give us the probability of the signal y (label) given the data x, so the machine will learn the conditional

probability distribution p(y|x) , whereas a generative model will help learn the joint probability distribution

p(x,y). To be more specific, p(y|x) helps us classify a given input x into a class y and p(x,y) could be used to

generate likely pairs (x,y).

Example 2–1. Given the pairs (1,0), (1,0), (3,0), (3,1), the conditional probability distribution p(y|x) and the

joint probability distribution p(x,y) are as follows:

Figure 2-4. The conditional probability distribution p(y|x) and the joint probability distribution p(x,y).

We can notice that in the first case the sum of the values at every row equals 1, whereas in the second case the

total sum equals 1.

Generative models are more facile to interpret. Thinking of imagining a car, you would be generating

information given the condition “car”. On the opposite, discriminative learning resumes to taking a decision

based on some thresholds, that can often be confusing. For instance, if we are trying to detect a car in a given

image and the image passes a stage of detection this doesn’t mean that there is certainly a car, but that there is

a “candidate” for a car. [1]

21

2.3.3 Machine Learning and Computer Vision

All the algorithms mentioned before take as input a vector of many features. When trying to detect if a certain

object is present or not in an image, the first problem we encounter is how to gather training data that falls into

negative and positive cases. Another issue is that the objects may appear at different scales or in different

postures.You need to define what you mean by saying the the wanted object is present in the image.

When collecting data it is important to train your system in the same conditions in which it will later work_

same camera, same lighting conditions. In other words, any variation in the data must be taken into account

and captured.

There exits many ways to improve the system’s performance, such as subtract the background. Then process

the image using techniques of normalization (histogram equalization, rescaling, rotating).

The next step after collecting the data is to break it up into training, validation and testing sets.

And finally choose an appropriate classifier depending of time, data, accuracy or memory constraints.

2.3.4 Variable Significance

Some algorithms let you asses more or less importance to a variable. For instance, some features might be

more important than others for the classification accuracy. For instance, binary trees select which feature best

splits the data at each node. Top node variable is the more important one and the following variable decrease

in importance. The biggest advantage of binary threes is that they reduce the number of features considered for

our classifier. The training starts with many variables, finding the importance of each variable relative to the

others. Then you can eliminate the irrelevant features, improving speed performance.

2.3.5 Common Problems

Firstly, I would like to mention that “more data beats less data and better features beat better algorithms” [1].It

is important to maximize the independence between each feature and to minimize their variations under

different conditions.

Apart from this, we could enumerate some common problems. First of all, the concept of bias appears due to

the wrong assumptions made in the learning phase, when the model does not fit well. Variance happens when

memorizing the information including the noise, so the data won’t be general. High bias causes underfitting

issues and high variance causes overfitting.

A possible solution to bias is to collect more features or even try a more powerful algorithm. In the variance

case, we might need more training data and less features or a less powerful algorithm.

Figure 2-4. Underfitting, best fitting and overfitting.

22

2.3.6 Cross-validation, Bootstrapping, ROC curves and Confusion matrices

Sometimes, in order to know if your classifier operates well, running validation test might not be enough. In

real life, the classifier will meet noise and sampling errors, so the data distribution won’t be the same as in the

tests. In order to get closer to the true behavior of our algorithm and improve its stability there exists two

popular techniques _ cross-validation and bootstrapping.

2.3.6.1 Cross-validation

Cross-validation implies dividing data into K subsets, learning from K-1 subtests and then testing on 1. The

key is that at every step i each fold gets a turn at being the validation set:

Pseudocode

1. Distribute n training samples randomly

2. Divide the samples into K chunks

3. For i=1,..K

3.1 Train the classifier on all the samples that do not belong to the subset i

3.2 Test the classifier on subset i

3.3 Compute the number of samples wrongly classified ei

4. Calculate the classifier error as :

K

i

i 1

eE

n

2.3.6.2 Bootstrapping

Bootstrapping is quite similar, only that the validation set is chosen at random from the training samples. The

points that were selected for a round will only be used for testing, not training. Then the procedure starts again

from scratch. Repeat this N times, each time randomly selecting a new chunk of validation data. It is easy to

notice that many of the data points are being reused in different validation stages.

2.3.6.3 ROC curves

Another option to adjust a classifier is to use the receiver operating characteristic or the confusion matrix. The

ROC curve determines the response over the performance parameter of the algorithm over the entire range of

settings of that parameter. For example, let’s suppose we are trying to recognize the blue color in an input

image. Obviously, we have a threshold that defines what means a blue color and what means a non-blue color.

Setting the blue threshold too high might end up in failing to recognize any shade of blue in the image,

yielding a false positive rate of 0, but at the cost of a true positive rate also at 0 (lower left part of the graphic).

On the other hand, setting the blue threshold to zero, then any color in the image will be “blue” for our

detector. So we will have a maximum false positive rate (upper right area of the curve). The ideal ROC curve

is the one that goes along the y axis up to 100% and then cuts horizontally over to the upper right corner. The

area under the curve versus the total area is a ratio of merit (the closer this ration is to one, the better our

classifier will be).

2.3.6.4 Confusion Matrices

The confusion matrix of the operating point OP illustrated in the image below is other way to asses the

performance of a classifier, being equivalent to the ROC curve at its left. A matrix of a perfect classifier

would have 100% along the principal diagonal and 0% elsewhere.

23

Figure 2-5. The ROC curve and the confusion matrix.

2.3.7 Binary Decision Trees

Binary Decision Trees are highly used in the OpenCV machine learning library. Their inventor Leo Breiman

named them “classification and regression tree algorithms” (CART).

The essence of this method is to define an impurity metric relative to the data in every node of the tree. For

instance, when regression is used in order to fit a function, the sum of squared distances between the true value

and the predicted value is used. The basic idea is to minimize this sum of differences known as “the impurity”

in each and every node of the tree.

When dealing with categorical labels, we define a measure which is minimal when most of the values inside a

node belong to the same class. There are three standard measures: Misclassification, Gini index and Entropy.

Once we have chosen a metric, the binary three algorithm searches through the feature vector in order to find

which feature together which threshold purifies most the data. By convention, the features below the threshold

are branched to the right and the features above the threshold are branched to the left. In this way, we will

recursively follow down each branch of the tree, until the data is pure enough for our needs. The impurity

equation i(N) is given below as [1]:

2.3.7.1 First case - Regression Impurity

We try to minimize the square of the distance between the node value y and the data value x:

2

k k

k

i(N) (y x ) (2-1)

2.3.7.2 Second case - Classification Impurity

There are three common measures for expressing the impurity. P(ω k) signifies the fraction of patterns at node

N that belong to the class ωk.

24

Misclassification impurity

ki(N) 1 max P( ) (2-2)

Entropy impurity

k k

k

i(N) P( ) log P( ) (2-3)

Gini index

p k

k p

i(N) P( ) log P( )

(2-4)

Figure 2-6. Decision tree impurity measures.

In classification, decision trees are probably the most widely used due to their facile implementation.

Moreover they are simply to interpret and flexible when it comes to working with different data types.

Decision trees constitute the basis of other algorithms such random trees and boosting.

2.3.8 Boosting

Although decision trees are effective, they are not the best-performing classifiers. The boosting technique

inherits a lot from decision trees, using them in the inner loop. To be noticed that boosting uses fewer decision

variables that a decision tree so it will save memory and reduce computation cost.

Within the category of supervised classification methods available in the OpenCV library there is a meta-

learning algorithm named “statistical boosting”. It was described for the first time by computer scientist

Michael Kerns in 1988. He wondered whether it is possible to create a strong classifier out of many other weak

ones.

Adaboost is the first boosting algorithm, that was formulated shortly afterwards by Robert Schapire and Yoav

Freund who won Gödel Prize in 2003 for their model.

https://en.wikipedia.org/wiki/G%C3%B6del_Prize

25

3 FACE DETECTION

3.1 The Viola-Jones Detection Framework - Theory

his chapter presents the Viola-Jones face detection method. The Viola-Jones detection framework was

proposed for the first time at a computer vision conference in 2001 by Paul Viola and Michael Jones.

Their approach outclassed any existing face detector at that moment. Although it can be trained to

identify many types of rigid objects, it is mostly used for face detection.

Viola and Jones claim that when it comes to face detection, their algorithm yields detection rates comparable

to the previous algorithms. But used in real-time situations, their detector is capable of running at 15 frames

per second without resorting to techniques such as image differencing or skin color detection. Moreover,

adding these alternative sources of information will result in achieving even higher frame rates. [2]

This detector is based on three strong concepts. The first one is known as the “Integral Image”. It allows the

features used by this detector to be computed very fast. The second one is a machine learning algorithm,

“Adaboost”, which selects only the important features from a larger dataset. The third concept is the

construction of a “cascade” structure by combining complex classifiers, which will reject background regions

of the input image while spending more computation time on the areas that might contain the object of our

interest.

The algorithm works best on frontal objects and not so well on side views, because these bring variations in

the template and the haar features (mouth, eyes, hairline) used in the detector cannot control them well. For

instance, the side view of an object should catch part of the changing scene behind the object profile. So, the

classifier is constrained to learn the background variability at the edge of the side view.

The first condition before starting to train our system is to collect the appropriate data for our situation.

“Good” data means data perfectly divided into categories. For example, we should not mix tilted objects with

upright objects. “Well-segmented” data is also a vital characteristic meaning that data is always boxed [1]. For

instance, different placement of the eyes locations inside the face box can make the classifier assume that the

eyes locations are not fixed and that they can move around. So you should normalize the picture and align the

eyes as much as possible. Performance will dramatically decrease when a classifier tries to correct for unreal

variability in the data.

T

Mathematics is the language in which God has written

the universe.

- Galileo Galilei -

26

3.1.1 The Haar-type Features

The facial detection algorithm will hunt for specific features that are common to every human face. These

“features” are basically black and white rectangles, as those illustrated below. For example, in the first image,

the white rectangle represents the lighted area of the cheeks, whereas the black one, the shadow of the eyes. In

the second image, the white area corresponds to the bridge of the nose which is brighter than the region of the

eyes.

Figure 3-1. Some Haar-type features.

Before starting to look for these matches in a given image, we need to turn it to grayscale (only black and

white pixels), every pixel having a value between 0 (black) and 255 (white), depending on its intensity. In

order to convert a RGB digital image to grayscale, the (R-G-B) triplet is mapped to a single value using

different methods, such as the lightness method, the average or the luminosity method. The last one is used by

OpenCV, being more closed to the human perception of colors. Due to the fact that humans are more sensitive

to green, this color will weigh more, as we can see in the formula below:

Y 0.2126R 0.7152G 0.0722B (3-1)

The following step is to calculate the sum of all pixel values under the black area and the sum of all pixel

values under the white area. Then, the second sum is subtracted from the first one and if the result is within a

specified threshold, we can affirm that the feature is “absorbed into” our image.

3.1.2 The Integral Image

Frank Crow has introduced the “Summed Area Table” to computer graphics in 1984. Then John

Lewis used this concept in Computer Vision. Later on, in 2001, Paul Viola and Michael Jones have

brought the equivalent term “integral image” within their object detection framework, which refers to

a fast and efficient method to calculate the sum of pixel values in any rectangular zone of a given

image.

27

Obtaining the Integral Image

The value of any point (x,y) in the integral image equals the sum of the values above and at its left, including

the pixel itself, as we can see in the formula below:

m' mn ' n

I(m,n) i(m',n ')

(3-2)

Example 3–1. Given the pixel values of an image we can calculate the integral image as follows:

Figure 3-2. The input image and the integral image.

We infer from the general formula (3-2) that the integral image can be calculated in a single pass over the

initial image, using the following relation:

I(m,n) i(m,n) I(m 1,n) I(m,n 1) I(m 1,n 1) (3-3)

Once we have the integral image computed, we can easily calculate the sum of pixel values in any rectangular

zone of our input image, using only four references from the summed area table ( E(m0,n0), F(m1,n0),

G(m0,n1), H(m1,n1) ) and the formula below:

0 1

0 1

m m mn n n

i(m,n) I(E) I(H) I(F) I(G)

(3-4)

I (2,2) = i(1,1)+ i(1,2)+ i(2,1)+ i(2,2) = 1+1+1+1 = 4

28

Figure 3-3. The four references needed in the integral image.

Example 3–2. Determine the sum of the pixel values within the gray area in our input image, using

the integral image method:

Figure 3-4. The integral image and the input image.

Using the integral image method (3-2) the sum of the highlighted red area would be:

S = I(E)+I(H)-I(F)-I(G)=12+57-22-32=15 (3-5)

Now, going back to our input image, we can check the result. The sum of pixel values equals:

S = 3+9+2+1=15 (3-6)

The results (3-5) and (3-6) are indeed equivalent.

29

3.1.2.1 Using the Integral Image for the Viola-Jones detector

When working with an algorithm in real-time applications, we should remember that efficiency is a crucial

aspect. So, summing up all the pixels under every haar feature for all the possible combination in a given

image is not a very quick task to do. So, whenever we need to calculate a haar feature value we won’t just sum

up all those values inside that area. We will use only four memory lookups and formula (3-4). As a

consequence, this method provides us with a constant calculation time independent of the size of the wanted

grid. So, we managed to reduce the complexity of calculation from O(n2) to O(n).

The examples above illustrate how many references are needed to calculate the corresponding value for every

type of haar feature:

Figure 3-5. Edge feature – 6 references.

∑i = (A+E-D-B)-(F+B-E-C) = A+2E-D-2B-F+C (3-7)

Figure 3-6. Line feature – 8 references.

∑i = (E+H-F-G)+(D+A-B-C)-(F+C-E-D) = 2E+H-2F-G+2D+A-B-2C (3-8)

Figure 3-7. . Four rectangular feature – 9 references

∑i = (H+D-G-E)+(F+B-E-C)-(I+E-H-F)-(E+A-D-B) = 2H+2D-G-4E+2F+2B-C-I-A (3-9)

30

3.1.3 Adaboost

The term Adaboost comes from “Adapting Boosting” and stands for a machine learning meta-algorithm

created my Robert Schapire and Yoav Freund in 2003. Adaboost can be used together with other types of

learning algorithms in order to improve their results. The main idea is to combine the output of some weak

classifiers (learners) into a weighted sum, in this way creating a strong final classifier, which error rate tends

exponentially to zero.

Moreover, if we consider all the possible combinations of the previously presented haar features (different

orientations and different sizes) which can be applied in a 24x24 scan window, we might end up with more

than 180.000 features [3]. This large number is a real disadvantage in terms of efficiency. Besides, there are

linear dependencies between those features. So, the main challenge is to find only those critical features which

combined together will form an effective classifier. So, this is another problem solved by Adaboost ( it will

eliminate all the redundant features and keeping only the critical ones).

3.1.3.1 Training weak classifiers

A weak classifier means a classifier that performs barely above chance, having an error rate slightly smaller

than 50%. All the haar type features could be seen as weak classifiers (the simple rectangles have an error rate

between 10-30%, whereas the complex ones between 40-50%) [4]. Usually these classifiers are decision trees

with only one split (“decision stumps”) up to three. We assign to each one of these classifiers a weighted vote

in the final decision.

The weak learning algorithm is created to choose only the rectangle features which best splits the positive and

negative samples. For each haar feature, the weak learner will calculate the optimal threshold classification

function in order to obtain a minimum number of misclassified samples. Therefore, a weak classifier will

consist of a haar feature fj, a threshold value tj and a parity (+/-1), as we can see in the formula below:

j j

j

j j

1,f th

1,f t

(3-10)

Creating a strong classifier

We have as input a dataset of images X=[x1,x2…xi ,…xn] used together with a vector of scalar labels

Y=[y1,y2…yi ,…yn], where yi = {±1}, depending if the image xi is a positive or a negative example. In the

following example, I will present the training process using an input consisting of 5 positive samples and other

5 negative samples.Below, I illustrated the procedure in three steps, although one will train as many weak

classifiers as it is needed till obtaining the wanted performance on a validation dataset.

Step 1:

At the beginning, all the samples in the original data set have a uniform weight, this meaning that the detector

will focus equally on each and every data point from our set:

31

Figure 3-8. DS1(original dataset), C1(trained classifier 1), C1’s mistakes.

Step 2:

At the second step, we will modify the previous dataset, assigning different weights to the points. For instance,

the points gotten wrong at the previous step will weight more this time, meaning that the classifier will pay

more attention to these particular samples:

Figure 3-9. DS2 (weighted dataset 2), C2 (trained classifier 2), C2’s mistakes.

Step 3:

At this phase, we will increase again the weights of the samples gotten wrong earlier and decrease the weights

of those gotten correct. So, the clasiffier number 3 will focus on the mistakes made by the clasiffier number 2:

32

Figure 3-10. DS3 (weighted dataset 3), C3 (trained classifier 3)

Final step:

At the end, the strong classifier is a linear combination of all the weak trained previouslly:

Figure 3-11. The final classifier.

Pseudocode for training a strong classifier

1. Given the pairs 1 1 2 2 n n(x ,y ),(x ,y ),...(x ,y ) , where iy { 1} .

2. Initialize weights:

1,i

1w

2m , for iy 1 , where m is the number of negative samples

1,i

1w

2p , for iy 1 , where p is the number of positive samples

3. For t= 1, 2,..T

3.1 Normalize the weights:

t,i

t,i n

t, j

j 1

ww

w

33

3.2 Choose the classifier ht with the lowest error rate εt:

t i j i i

i

min w *(h (x )! y )

3.3 Update weights:

i1 e

t 1,i t,i tw w *

, where ei=0/1 if xi was correctly/wrongly classified and tt

t1

4. Calculate the final classifier as:

T

t t

t 1

h(x) sign( *h (x))

, where tt

t t

1 1 1 1log( ) log( )

2 2

3.1.3.2 Interesting Observations

a) The next weight wt+1,i will remain equal to the previous weight wt,i if xi was misclassified or will

decrease with the factor βt if xi was correctly classified.

At the step 3.3 in the pseudocode, we have:

Right prediction case:

normalized 1 0 normalized

t 1,i t,i t t,iw w * w

(3-11)

Wrong prediction case:

normalized 1 1 normalized

t 1,i t,i t t,iw w * w

(3-12)

b) The new generation of weights is just a scale version of the old generation.

If z represents the normalizing factor of the next generation of weights, then:

normalized

t 1,i t 1,i

1w w

z (3-13)

We know that the normalized weights will sum up 1:

normalized

t 1,i

i

w 1 (3-14)

(3-13), (3-14) => t 1,i t 1,i

i i

1w 1 z w

z

34

We could split the sum above into the sum of the weights for the correct predictions and the sum of the

weights for the wrong ones, as follows:

t 1,i t 1,i

correct,i wrong,i

w w 1 (3-15)

Using the last relation (3-15), we infer that:

normalized normalized(3 11),(3 12)

t 1,i t 1,i t t,i t,i

correct,i wrong,i correct,i wrong,i

z w w z w w

(3-16)

The weights in the wrong cases summed up will equal the error rate, as we can see in the Adaboost

pseudocode (step 3.2). So:

normalized

t,i t

wrong,i

w (3-17)

normalized

t,i t

correct,i

w 1 (3-18)

Replacing in (3-16) with the relations (3-17) and (3-18), we obtain:

tt t t t t t

t

z (1 ) (1 ) 21

(3-19)

Summing up the new generation of weight for both cases :

normalized(3 12)

t 1,i t,i t,i t,i t

wrong,i wrong,i wrong,i wrong,i t

1 1 1 1w w w w

z z 2 2

normalized(3 11) tt 1,i t,i t t,i t t t,i t

correct,i correct,i wrong,i wrong,i t t

1 1 1 1w w w w (1 )

z z 2 1 2

This result leads us to a great conclusion in terms of computation: to obtain the new generation of weights, all

we need to do is scale the current weights so that they will sum up ½.

3.1.4 Cascade Filter

The Viola-Jones detector uses the Adaboost technique, but organizes the classifier as a rejection cascade of

nodes. “Cascading” means that for every node, a candidate that is classified as “not in class” terminates

instantly the computation. Only the candidate that makes it through the entire cascade will be classified as a

face. In this way, the computational cost is significantly reduced, because most of the areas that do not contain

the object of interest are rejected at an early stage in our cascade.

35

When it comes to face detection, there exists what is called a scan window that translates all over the input

image, at different scales, looking for a face. Every time this window shifts, a new area within its boundaries

will go through the cascade classifier stage by stage. If the current region fails to pass the threshold of a stage,

our classifier will instantly reject that area as a face. So, no more tests will be made from now on to that

particular area. On the other hand, if an area passes successfully all the stages, this means that there might be a

face.

Figure 3-12. Cascade classifier.

Each stage represents a multi-tree Adaboost classifier programmed to a high detection rate (few missed faces)

at the cost of many false positives. This means that almost 99.9% of the faces are found at each node, but also

about 50% of the non-faces are wrongly classified. In the end, forming a strong classifier out of 20 nodes, will

result in a detection rate of 98% with a false positive rate of 0.0001%. Furthermore, when the detection

window is being swept over the test image at different scales, 70-80% of the non-faces are eliminated in the

first two nodes (which use about ten decision stumps) [1].

The learning process can take several hours to a day, even on a fast machine, depending of the size of the data

set. After training our detector all the information is being saved to an .xml file, like the one below. It usually

contains from 20 to 30 stages: the first stage, stage 0, is a superficial scan through the image and the following

ones, as I already mentioned, become more and more detailed. The examples which are more complex and

made it further through the cascade will push the ROC characteristic curve downward. New stages will be

added until the overall target for detection rate and false positives is reached.

Figure 3-13. “haarcascade_frontalface_default”, available in the EmcuCV library.

36

As a matter of fact, this type of cascade training comes with a compromise. The classifiers that contain more

features will return higher detection rates and lower false positive rates. On the other hand, they will need more

time to compute. Unfortunately, finding a balance is close to art and it represents the real problem of the

engineer.

3.2 The Viola-Jones Detection Framework – Experimental Results

Viola and Jones described the performance of their detector in the article “Rapid Object Detection using a

Boosted Cascade of Simple Features”, presented at a conference on computer vision and pattern recognition in

2001 [2].

The complete detection cascade has 38 stages and more than 6000 features. However, on a dataset with 507

faces and 75 million sub-windows, the faces will be detected using about 10 features per scan window.

The training set used by Viola and Jones consisted of 4916 labeled faces at a resolution of 24x24 pixels. The

non-face sub-windows came from 9544 images. Each classifier in the cascade was learned using the set of

labeled faces and another 10.000 non-face sub-windows.

On a 700Mhz Pentium 3 processor the detector managed to compute a 384 by 288 pixel image in about 0.067

seconds.

The detector was tested on a set of 130 pictures from the real world with 507 labeled frontal faces. Below we

can see the ROC curve that illustrates the performance achieved by the detector. It was run with a step size of

1.0 and a starting scale of 1.0.

Figure 3-14. ROC characteristic.

Viola and Jones affirm that when it comes to face detection, their algorithm has detection rates comparable to

the previous algorithms. But used in real-time situations, their detector can run at 15 frames per second without

resorting to techniques such as image differencing or skin color detection.

37

4 FACE RECOGNITION

4.1 Face Recognition Overview

facial recognition application can identify a human being in a given digital image or video frame.

Such systems are mostly used in security areas together with other biometric authentication

technologies.

The facial recognition methods could be divided in “photometric” and “geometric” procedures. The first

category is based on extracting specific features from an image of a person. For instance, the system may

analyze the relative position, dimension and shape of the eyes, mouth, nose, cheekbones and so on. The

second category is more of a statistical approach. It uses database of images from which the face data is

extracted, normalized and compressed. Then, the test image is quantified in terms of that data.

Among the most popular recognition methods I would mention the Eigenfaces method, the Fisherface

algorithm, the Linear Discriminate Analysis, the Hidden Markov model, the Multilinear Subspace

Learning and the Dynamic Link Matching.

The ultimate tendency in facial recognition is represented by three-dimensional face-recognition. This

method uses 3D cameras to capture data about someone’s face. This technology has better results than the

classical 2D recognition because it is not sensitive to light changes, different face expressions or make-up

and can even identify profile views. For example, Microsoft’s video game console (Xbox 360)

implements this new technology. It works by projecting “structured light” onto the subject, as in the

image below. Then the system will infer depth information from how the projected pattern modified.

Lately, engineers tried to create even a more powerful system by combining three cameras that point at

different angles so that they could track and recognize with high precision a person who is moving.

A

Creativity is just connecting things.

- Steve Jobs -

38

Figure 4-1 Structured light.

Figure 4-2. XBOX 360.

Another interesting approach in facial recognition is skin texture recognition. The skin texture is captured into

what is called a “skinprint”. This patch will then be divided into smaller blocks and through specific

algorithms the skin will be turned into a mathematical space that can be measured.

Thermal cameras also deserve being mentioned in this chapter. They are a great tool for detecting and

identifying people 24 hours a day, 7 days a week. These cameras create representations based on the heat that

any object radiates. A thermal camera is immune to light changes and perform best indoors. However, a

drawback of this technology is represented by the limited thermal pictures database.

Compared to others biometric techniques, face recognition isn’t the most reliable and practical method. On the

other hand, it doesn’t require permission from the subject to be tested. Thus, face recognition systems are

installed everywhere in public places such as airports where mass identification is necessary in order to prevent

terrorist attacks.

Lately, some people complained about their privacy rights and civil liberties. They claim this kind of

surveillance can be used not only to identify a subject, but also to reveal personal data, such as social

networking profiles. For instance, Facebook’s DeepFace is considered to have violated the Biometric

Information Privacy Act. Facebook used the world’s biggest photo library to create a deep learning face-

recognition system trained on more than 4 million images uploaded by its users. The system has 97%

accuracy, whereas the FBI’s Next Generation Identification system only 85%. The Huffington Post American

newspaper described the technology as being “creepy” and announced that some European governments had

already forced Facebook to delete its face-recognition database [5]

4.2 The Eigenfaces Recognizer - Theory

The first face recognition system was developed by Woody Bledsoe, Helen Chan Wolf and Charles

Bisson in the 1960s. They created a semi-automated program that needs an administrator to locate

characteristic features in a given image, such as the eyes, nose, mouth or ears. Their system computed

relative distances between these features and created a list with specific ratios for each subject in the

https://en.wikipedia.org/wiki/Woody_Bledsoe

https://en.wikipedia.org/w/index.php?title=Helen_Chan_Wolf&action=edit&redlink=1

https://en.wikipedia.org/w/index.php?title=Charles_Bisson&action=edit&redlink=1

https://en.wikipedia.org/w/index.php?title=Charles_Bisson&action=edit&redlink=1

39

database. However, such approach has proven quite fragile over the years.The Eigen Faces procedure was

firstly introduced in 1987 by Kirby and Sirovich and later on developed by Alex Pentland and Matthew

Turk, in 1991. The term “eigen” refers to a set of eigenvectors, also known in linear algebra as

“characteristic/proper vectors”. The main advantage behind this method is that we can represent a set of

images using a base formed of “eigen” pictures whose dimension is a lot smaller than the original set.

Identification can be achieved by comparing two images both represented in the eigen base of the training

set.

The Eigen Faces approach started with the need to find a low-dimensional representation of face images.

Kirby and Sirovich demonstrated that Principal Component Analysis (PCA) concept can be used on a

group of face images to form a set of basic features. This set is known as “eigen pictures” and can be used

to reconstruct the original collection of images. Each original face would be rebuilt as a linear

combination of the base set. We want to extract only the critical information in a test image and encode it

as effectively as possible, then compare the selected information with a database of models encoded in

the same manner. Faces photographs are projected onto a feature space that best illustrates the variation

among known/learned face images. This feature space is defined by the eigenvectors/eigenfaces. The

vector of weights expresses the contribution of each eigenface to the input image.

These results were expanded and improved by two computer scientists who have found an efficient way

to compute the eigenvectors of a covariance matrix. Initially, a face image would occupy a high-

dimensional space and PCA method couldn’t be applied on large data sets. But Alex Pentland and

Matthew Turk discovered a form to extract the eigenvectors based on the number of input images, rather

than the number of pixels.

After performing the eigen decomposition on a set of given photos, one will obtain through statistical

analysis the specific “ingredients” that represent our data set. The features that the original collection of

images have in common will be found into what is called an “average/mean face”. On the other hand, the

differences between the images will appear in the eigenfaces. Furthermore, we can reverse the process

and reconstruct any initial image from the eigenfaces together with the average image. In this way, every

face will be stored as a list of values (each corresponding with an eigenpicture in the data base), instead of

a digital photograph, saving up memory space.

The Eigen technique is used also in other types of recognizing _ medical imaging, voice recognition,

gesture interpretation, lip reading, handwriting analysis. For this reason some prefer to use the term of

“eigen image” instead of “eigenface”, although they basically refer to the same thing.

Figure 4-3. Example of eigenfaces.

https://en.wikipedia.org/wiki/Alex_Pentland


40

4.2.1 Steps to obtain the eigenfaces

1. When collecting training images one should have in mind a couple of general rules. Firstly, the

photographs must be taken all under the same lighting conditions and then must be normalized so the

mouths and the eyes would be aligned across all images. Secondly, they also must have the same

resolution (r x c). Each picture will be treated as a vector with r x c elements after concatenating the

rows of pixels. The entire training set will be stored in a single matrix T in which each column

represents a different input image.

2. Calculate the average image M and subtract it from each original picture in the database.

1 2 d

1M (im im ...im )

d

Determine the eigenvectors and eigenvalues of the covariance matrix C of the probability distribution

over the dimensional vectorial space of a face image.

Each resulting eigenvector/eigenface has the same size as the original images and therefore it

could be seen as an image itself. They are basically directions in which each input image differs

from the mean.

3. Select from the eigenvectors only the principal components :

Sort in reverse the eigenvalues and organize the eigenvectors accordingly. The number of principal

components k is calculated by setting a threshold value ε on the total variance:

1 2 nV n( ... ) , where n is the number of input images. Tha value k is the smallest number

which satisfies:

1 2 nn( ... )

V

Now, these eigen images will help us represent the faces in the database, but also new faces. When projecting

a new face (mean-subtracted) on the eigenfaces space, we actually see how the new face differs from the mean

image. The characteristic values associated with the eigenfaces will help us determine how much the images in

the training set differ from the average image in that direction. Even though by projecting the images into

another subspace, we lose information, we will minimize this loss by picking only those eigen images with the

largest eigenvalues. As a consequence, we will keep only the striking differences from the main image. For

instance, at a resolution of 100 x 100 pixels, from a single image 10.000 eigenvectors will be obtained. In fact,

only 100-150 vectors will be needed [6].

4.2.2 Preprocessing the images

Both the training images and the test images must be turned to grayscale, histogram equalized and having their

background removed. In the section 3.1.1 (The Haar-type Features) I already explained what a grayscale

image is. Now, I am going to present the histogram equalization concept:

41

Figure 4-4. Histogram equalization.

In image processing area, the term “histogram of an image” usually refers to a graphical representation of the

intensity distribution of that image. The histogram of an image can be obtained by plotting pixel intensities

versus pixel frequency. In case of an 8-bit grayscale image there are from 0 to 255 shades of gray, so 256

possible values for the intensity. Therefore, the histogram will graphically display 256 numbers showing the

distribution of pixels amongst these numbers.

Histogram equalization is a procedure for adjusting image intensities in order to increase contrast. This implies

remapping the initial histogram to another distribution (a wider and more uniform one). As we can see in the

figure 4-4, the effect of histogram equalization is to stretch out the range of intensities. [7]

Example 4–1. I will explain the histogram equalization process on a digital image represented by a matrix of

pixels, like the one below, where each value is the intensity of the pixel found in that corresponding position:

3 2 4 5

7 7 8 2

3 1 2 3

5 4 6 7

Table 4-1. Matrix of pixel values.

http://homepages.inf.ed.ac.uk/rbf/HIPR2/gryimage.htm

42

We can notice that the intensities in our image vary between 1 and 8. I mentioned previously that the main

idea behind this procedure is to widen the range of intensities. For example, we could scale them to a range of

1-20.The first step is to count how many pixels have a certain value of intensity. The second step is to calculate

the probability of each pixel intensity in the given image. Using the resulting values, we will calculate the

cumulative probability. Since we want to change the intensity range from 1-8 to 1-20 we shall multiply the

cumulative probability with 20. Finally, the values resulted will be floor rounded.

Table 4-2. Histogram equalization steps.

4.2.3 The Covariance Matrix

4.2.3.1 Standard Deviation (σ)

The Standard Deviation (σ) defines how spread out numbers are in a certain dataset. A low value

indicates that the samples tend to be close to the expected value (the mean), while a high value indicates

that the samples are dispersed over a wider range of values.

V (4-1)

4.2.3.2 Variance (V)

The variance is a concept used in statistics and probability theory and it refers to the quadratic deviation

of a random variable from its mean value. For a data set X of equally likely values xi the variance can be

calculated as:

n2

i

i 1

1V(X) (x x)

n

(4-2)

The mean value or the expected value is:

n

i

i 1

1x x

n

(4-3)

Pixel intensity 1 2 3 4 5 6 7 8 9 10

Number of

pixels

1 3 3 2 2 1 3 1 0 0

Probability 0.0625 0.1875 0.1875 0.125 0.125 0.0625 0.1875 0.0625 0 0

Cumulative

probability

0.0625 0.25 0.4375 0.5625 0.6875 0.75 0.9375 1 1 1

Cumulative

probability*20

1.25 5 8.75 11.25 13.75 15 18.75 20 20 20

Floor rounding 1 5 8 11 13 15 18 20 20 20

43

In other words, the variance expresses how far a given set of numbers are spread out from their average

value.

4.2.3.3 Covariance (C)

The covariance between two random variables is a way to measure how much they change together. If

each one has a finite set of equal-probability values xi and yi , then the covariance can be calculated as:

n

i i

i 1

1C(X, Y) (x x)(y y)

n

(4-4)

If two variables are independent, their covariance will be 0. However, this is not true the other way. The

variance of a variable is equal to the covariance of this value with itself:

2V(X) C(X,X) (X) (4-5)

4.2.3.4 The Covariance Matrix

A variance-covariance matrix (covariance/dispersion matrix) is a matrix whose element (p,q) symbolizes

the covariance between the pth

and the qth

element of a random vector. A random vector can be seen as a

multi-dimensional random variable. Each element of this vector has either a finite number of values

resulted from empirical observations or a finite or infinite number of possible values. The potential values

are defined by a joint probability distribution.This concept is used when analyzing multivariate data.

Example 4–2. We can measure three variables such as the length, the width and the height of a certain object.

The results from all the measures taken (in our example, 5) will be arranged in a matrix 3 x 5, where each row

represents a different observation.

4.00 4.30 3.90 4.20 4.10

X 1.00 1.10 1.00 1.10 1.20

0.50 0.49 0.48 0.51 0.49

Obviously, the mean matrix is:

4.10 4.10 4.10 4.10 4.10

X 1.08 1.08 1.08 1.08 1.08

0.49 0.49 0.49 0.49 0.49

So, the variance-covariance matrix will be computed as:

T

0.1000 0.0300 0.0030

C(X) (X X)(X X) 0.0300 0.0280 0.0004

0.0030 0.0004 0.0006

44

We can notice that the resulting matrix is symmetric, because the covariance between the element i and

the element j is the same as the covariance between the element j and the element i. In the main diagonal

there are the variances of our variables and in the remaining positions the covariances between each pair

of variables. For example, 0.1000 is the variance of the first variable (length) and 0.0300 is the covariance

between the first variable (length) and the second one (width).

The average vector is often called “centroid” and the covariance matrix is referred to as “dispersion”.

This method it is often used to calculate functions of estimators that model the errors or differences

between a set of empirical results and a set of expected values. In feature extraction, this concept models

the spectral variability of a signal.

4.2.4 Eigenvectors and Eigenvalues

A characteristic or proper vector is a vector whose direction remains the same after applying a linear

transformation to it. The german term “eigen” (“own”) was introduced by the mathematician David

Hilbert to denote the eigenvalues and eigenvectors. This concept has many practical applications in

physics and engineering, especially in computer vision.

In the image below, we used the transformation T= [0.5 0; 0 2], meaning that we scaled with factor 2

vertically and with factor 0.5 horizontally.We can notice that the red vectors didn’t change their direction,

whereas the green vector yes. So, those vectors that were not affected by T are the characteristic vectors

of the transformation T.

Figure 4-5. The effect of a transformation matrix T.

In general, the eigenvector v of a matrix T satisfies the relation:

Tv = λv, where λ represents a scalar called “eigenvalue”. (4-6)

If v is not the null vector, then we can calculate the eigenvectors by resolving the equation:

| T-λI | = 0 (4-7)

45

Example 4–3. Caculate the eigenvalues and eigenvectors for a given matrix:

2 3

T2 1

(4-7) => (2-λ)(1-λ)-6 = 0 => λ1 = -1 and λ2 = 4.

Now by substituing these results in (4-6), we obtain:

11 11

1

12 12

21 21

2

22 22

v v2 3

v v2 1

v v2 3

v v2 1

So, v11 = - v12 and 3v21 = 2v22 .

In the section 3 we established that the variance-covariance matrix models the shape of the data: the

variance values define the spread of the data in the horizontal and vertical directions, whereas the

covariance, the diagonal distribution. For example, in the figure below, C = [5 4; 4 6]:

Figure 4-6. A dataset in a 2-dimensional space.

4.2.5 The Eigendecomposition of a Covariance Matrix

We would like to graphically represent the dispersion matrix using a vector whose orientation points into

the direction of the largest spread of data and whose module equals the spread in that direction.

In the previous section we have established that the pairs of eigenvectors and eigenvalues uniquely define

a matrix. Applying this to a dispersion matrix, we could say that the form of our data can be described in

terms of eigenvectors and eigenvalues.

46

For a data set D and a direction vector v, the projection of our data onto the vector v is vT

D. The

covariance matrix of the projected data is vT C v. We are searching for the vector v that indicates the

direction of the largest variance. So, the task is to maximize vT C v with respect to v. Using the Rayleigh

Quotient, it is known that the vector v is then equal to the largest eigenvector of the covariance matrix C.

To sum up, the largest proper vector of the covariance matrix always indicates the direction of the the

largest variance of data and its length equals to the corresponding proper values. The second largest

proper vector is always orthogonal to the largest eigenvector and points into the direction of the second

largest extension of data.

The examples above show the relation between the eigen values of a covariance matrix C and the shape

of the data [8]:

Case 1: C is diagonal, so the covariances are zero and the variances are equal to the proper values λ.

Figure 4-7. Diagonal Covariance Matrix.

Case 2: C is not diagonal, so the proper values represent now the variance of the data along the proper

vector’s directions, whereas the elements of the covariance matrix C define the spread along the axes x

and y. If there are no covariances, Case 2 becomes Case 1.

.

47

Figure 4-8. Non-diagonal Covariance Matrix.

Let’s consider the case of a white dataset D, which means its covariance matrix C equals the identity

matrix I and a transformation matrix T consisting of a scale matrix S and a rotation matrix R:

x

y

1 0C I

0 1

T RS

cos sinR

sin cos

s 0S

0 s

So, the transformed data set D’ can be expressed as:

D’ = TD

Previously, we demonstrated that the covariance matrix can be uniquely represented by its proper vectors

and values:

Cv = λv

In a two dimensional space, the covariance matrix will have two pairs of eigenvectors and eigenvalues.

The resulting system can be represented by the equation below:

CV = VL, where the columns in the matrix V are exactly the eigenvectors of C and the L is a diagonal

matrix, whose non-zero values are equal to the corresponding eigenvalues.

48

Now, we can rewrite the covariance matrix as a function of its proper values and vectors:

C = VLV-1

The equation above is called the eigendecomposition of the covariance matrix. The proper vectors

symbolize the directions of the largest variance of data and the corresponding proper values, the

magnitude of this variance in those directions. For this reason, V could be seen as a rotation matrix and

√L as a scaling matrix. Using these observations, we could rewrite C as:

C = RSSR-1

, where R = V and S = √L.

Moreover, R is orthogonal (R-1

= R T) and S is symmetric (S = S

-1). With these last observations

and knowing that the transformation matrix T consists of a scale matrix S and a rotation matrix R, C

becomes:

C = TTT

Figure 4-9. White data and Observed data.

To conclude, by applying a linear transformation T=RS to a white data set D, we get scaled and rotated

data D’ whose covariance matrix C is equal to T T T. So we have demonstrated that the covariance matrix

of observed data is in fact a linear transformation of uncorrelated data (white data). Moreover, the proper

vectors of the transformation correspond to the rotation change and the proper values correspond to the

scaling change.

4.2.6 Principal Component Analysis (PCA)

Principal component analysis is a statistical method applied to a dataset in order to emphasize variation

and identify strong patterns. It is frequently used in order to visualize and explore data more easily than in

its original form.

In the example below we have a two dimensional dataset. In case we are particularly interested in how

49

those data points vary, we would find another coordinate system where we can actually see the variation

better. After mapping the initial points to the new system, each and every point (x, y) will have another

value (pc1, pc2). The new axes do not have a physical signification, they were selected just to emphasize

variation. We can notice that the first and the second principal component (pc1 and pc2) were chosen in

the direction in which the samples vary the most (the red line and the green line).

Figure 4-10. Changing the data space with PCA.

Looking better at the principal components, we notice that the second principal component could be

eliminated because it doesn’t contribute very much to the variation.In conclusion, PCA procedure helped

us reduce dimensions, when describing our dataset.

In addition, when working with three dimensions, PCA is even more effective, because it is difficult to

see through a cloud of data. Instead, we can project the original data into a two dimensional space, after

finding the best angle (eliminate the dimension that has the lowest variation).

Using PCA one can detect patterns in data and then present the data is such a way as to highlight

differences and similarities. The biggest advantage of PCA is that once you have found those patterns, the

data can be compressed, without much loss.

4.2.7 PCA in Computer Vision

Let’s suppose each photograph in the training database has a resolution of NxN pixels. By concatenating

each row of pixels in a single image, starting from the top of it, a N2 dimensional vector will be obtained,

as the image below suggests:

50

Figure 4-11. Digital image stored as a vector.

If our database is a collection of “d” images, the preprocessed images will be stored as one matrix, where

each column is a mean-subtracted image from our database:

Figure 4-12. Database matrix T.

Now, the next step is to calculate the covariance matrix and perform the PCA algorithm on it. In this case,

the covariance matrix is the product between the database matrix and its transpose: C = T T T

.So, we

would later work with a N2 x N

2 dispersion matrix, which is very expensive in terms of computation.

51

Figure 4-13. The covariance matrix of the database matrix T

But here comes the brilliant idea of Alex Pentland and Matthew Turk, that thought of performing PCA on

T T

T, instead of T T T. In this case, the covariance matrix is d x d dimensional, a lot less than in the first

case. This is the reason why in the first chapter I mentioned that Alex Pentland and Matthew Turk

discovered a form to extract the eigenvectors based on the number of input images, rather than the

number of pixels.

Figure 4-14. The covariance matrix of the database matrix TT

For instance, in a database made of 200 pictures each having a resolution of 100x100 pixels, we could

perfom PCA on a matrix of 100.000.000 elements or on a matrix of 40.000 elements, which is 2.500

times cheaper.

So, in order to obtain the eigendecomposition of the covariance matrix we would have to resolve the next

equation:



52

First case (working with T):

C(T)vi = λivi , where vi is the proper vector and λi is the proper value of C(T) (4-8)

Second case (working with TT):

C’(T)vi = λivi => TTTwi = λiwi , where wi is the proper vector and λi is the proper value of C’(T) (4-9)

By pre-multiplying the equation (4-9) with the transformation matrix T, we get:

TTTTwi = λiTwi => C(T)(Twi) = λi(Twi)

From this we infer that if wi is the eigenvector of C’(T)=T T

T, then vi = T

wi is the eigenvector of

C(T)=TTT.

4.2.7.1 Single Value Decomposition (SVD)

Another technique that would help us increase the efficiency is the Single Value Decomposition /Factorization

of a matrix. We are going to apply this algorithm on our covariance matrix, in order to obtain the eigenvectors

and eigenvalues without actually computing the equation (4-9).

C’(T) = T T

T , (4-10)

where T represents the database matrix whose each column xi is a mean subtracted image vector and d the

number of images in the database (the number of observations).

We define the singular value decomposition of T as:

T = U ∑ VT

, (4-11)

where U is a unitary matrix (U* U = U U

* = I

), ∑ is a rectangular diagonal matrix and V is also a unitary

matrix .

We can substitute T in the equation (4-10) with the equation (4-11):

C’(T) = (U ∑ VT)

T (U ∑ V

T )

= V ∑

T U

T U ∑ V

T = V (∑ ∑

T )V

T,

where the non-zero components of ∑ are the square roots of the non-zero proper values of C’(T) (or C(T))

and the columns of V are the proper vectors of C’(T).

53

4.2.8 Pseudocode Eigenfaces Recognizer

0. Calculate the eigenvectors and remain only with the most significant ones (first k elements) as seen in

the section 4.2.1 (Steps to obtain the eigenfaces).

1. Subtract the mean image M from the input image In and calculate the weight of each eigenface

Eigi.

for i = 1: k

wi = Eigi T * (In-M)

2. Gather all the weights calculated previously and form a vector W that reflects the contribution of

each eigenface in the input image (this is equivalent to projecting the input image onto the face-

space).

W = [ w1 … wi ….wk ]

3. Calculate the distances between the test image In and every image in the database.

for j = 1:d

Disj = ||W - Wj||2

4. Choose the minimum distance.

minDis = minj=1:d (Disj)

5. Determine if the input image In is “known” or not, depending on a threshold t.

if (minDis < t) then In is “known”, else In is “unknown”

4.3 The Eigenfaces Recognizer – Experimental Results

In 1991 Matthew Turk and Alex Pentland published an article called “Eigenfaces for recognition” in the

Journal of Cognitive Neuroscience. They tested their method on a database composed of 2.500 images, all

digitalized under constrained conditions.Sixteen persons were photographed multiple times at all combinations

of three head sizes, three orientations and three lighting conditions. The training images were converted from a

resolution of 512 x 512 pixels to a resolution of 16 x 16.

During the first experiment, the acceptance threshold was kept at a maximum value. Matthew Turk and Alex

Pentland tried to see how the changes in lighting, head position and scale will affect the performance of their

system on the entire database. Different groups of sixteen pictures were selected and used as the training set.

Each group consisted of one photo of each subject, all taken under the same conditions of lighting, head size

and head rotation. During this experiment, no face was labeled as “unknown”. The system achieved 96%

correct classification averaged over lighting changing, 85% correct classification averaged over position

variation and 64% correct classification over dimension variation.

For the second experiment, the same process was followed, but this time the threshold was also varied. At

lower values, few mistakes were made when identifying the subjects, but many faces were rejected as

“unknown”. On the contrary, at a high value, almost all faces were classified as “known” at the cost of many

wrong associations. Finally, adjusting the acceptance threshold to achieve 100% accuracy, lowered the correct

classification rates as follows: 81% for lighting variation, 61% for orientation and 40% for scale lowers.




54

Figure 4-15. The “Eigenfaces for recognition” experiment (1991).

Each graph presented above depicts averaged performance.The y axis represent the number of correct

classifications (out of 16). The x axis represent conditions that were varied during the experiment, such as:

(a) lighting

(b) head scale

(c) head orientation

(d) lighting and orientation

(e) size and orientation _1

(f) size and orientation_2

(g) lighting and size_1

(h) lighting and size_2

In the same paper, Matthew Turk and Alex Pentland mentioned that this method could also be used to locate

faces in a given image. It starts with the same procedure: project the image onto the face-space. The real faces

do not change very much when projecting them onto the low-dimensional space, whereas the non-faces yes

[9].


55

5 IMPLEMENTATION AND TESTING RESULTS

y project consists of a C# software application that detects and recognizes human faces in an input

image or video frame from a live video source. The program was developed in Microsoft Visual

Studio 2010 using EmguCV (a cross platform .NET wrapper for the OpenCV image processing

library) and a Microsoft Access database file.

The application has two modules, one for face detection and the other for face recognition. The system can

detect, count and extract faces in a secondary window. It can also recognize people’s faces if they are

registered in the Microsoft Access database. The user can increase the database by adding new faces or delete

existing records directly from the Visual Studio application.

5.1 Required Software

5.1.1 Microsoft Visual Studio

Microsoft Visual Studio represents an IDE (Integrated Development Environment) from Microsoft. The

developer can use it to build computer programs for Microsoft Windows, web applications, web sites

or web services.Visual Studio can generate both native and managed code.This environment incorporated

a code editor supporting IntelliSense and also code refactoring. The integrated debugger works both as a

source-level debugger and a machine-level debugger.

Visual Studio has a forms designer for developing Graphical User Interface applications and also other

build-in tools, such as class designer, web designer and database designer.It supports numerous

programming languages: C, C++, C#, Visual C++, VB.NET, F#, JavaScript , HTML/XHTML,

XML/XSLT and CSS. If we want to code in other languages, such as Ruby, Node.js or Python we need to

install language services.

M

Tell me and I forget, teach me and I may remember,

involve me and I learn.

- Benjamin Franklin -

https://en.wikipedia.org/wiki/Integrated_development_environment

https://en.wikipedia.org/wiki/Microsoft

https://en.wikipedia.org/wiki/Computer_program

https://en.wikipedia.org/wiki/Microsoft_Windows

https://en.wikipedia.org/wiki/Web_application

https://en.wikipedia.org/wiki/Web_site

https://en.wikipedia.org/wiki/Web_service

https://en.wikipedia.org/wiki/Code_editor

https://en.wikipedia.org/wiki/IntelliSense

https://en.wikipedia.org/wiki/Code_refactoring

https://en.wikipedia.org/wiki/Microsoft_Visual_Studio_Debugger

https://en.wikipedia.org/wiki/JavaScript

https://en.wikipedia.org/wiki/HTML

https://en.wikipedia.org/wiki/XHTML

https://en.wikipedia.org/wiki/XML

https://en.wikipedia.org/wiki/XSLT

https://en.wikipedia.org/wiki/Cascading_Style_Sheets

56

Figure 5-1. Microsoft Visual Studio 2010 Environment.

The Windows Forms designer is used to develop GUI applications which are basically event-driven

applications. In this type of programming, the user’s actions, a sensor output or a signal from outside, can act

as a trigger and influence the flow of our program.

5.1.1.1 C# Language

C# is a type-safe object-oriented language that allows programmers to build safe and robust applications that

run on the .NET Framework. C# can be used to develop Windows client applications, database applications,

XML web services or client-server applications.

Its syntax is easy to understand by developers that worked previously with C, C++ or Java.Object-orientated

concepts such as inheritance, encapsulation and polymorphism are supported in C#.While reducing some

complexities of C++, it also provides powerful elements which cannot be found in Java, such as: nullable

value types, delegates, attributes, properties, enumerations, XML documentation comments, LINQ (Language-

Integrated Query) and direct memory access. It provides universal methods, types and iterators, increasing in

this way type safety and performance.

Applications written in C# run on the .NET Framework _ an integral element of Windows which includes a

virtual execution system named CLR (Common Language Runtime) and a unified set of class libraries. CLR

is a commercial version of CLI (Common Language Infrastructure) implemented by Microsoft.

The C# code is compiled into a language called IL (Intermediate Language) that matches the CLI

specifications. The code and resources are stored on the hard disk as an “assembly” (executable file) with the

extension .dll or .exe. When a software application is being executed, the executable will be loaded into CLR.

Then, the code will be compiled, turning the IL code into machine instructions. The next image shows the

connections between the C# source code, the assemblies, the .NET Framework and the class libraries:

57

Figure 5-2. C# code stages.

58

5.1.2 The EmguCV Library

EmguCV is a cross platform .NET wrapper for the OpenCV image processing library. As it is shown below,

this platform incorporates two layers. The first one is the basic layer having the same functionalities as

OpenCV and the second one contains classes from the .NET universe:

Figure 5-3. EmguCV layers.

59

5.1.3 Microsoft Access

Microsoft Access is part of the Microsoft Office suite of applications and represents a database

management system (DBMS) that connects the Microsoft Jet Database Engine with a graphical user

interface and other software-development tools.

It stores information in a specific form based on the Access Jet Database Engine. Thus, Microsoft Access

can also import or connect to data available in other databases. Microsoft Access is supported by VBA

(Visual Basic for Applications), an object-orientated language which can work with DAO (Data Access

Objects) and ActiveXData Objects.

Figure 5-4. Microsoft Access 2010.

5.1.3.1 Database overview

A database symbolizes an organized collection of data. It is composed of a group of tables, schemas, reports,

queries and views. The information inside is usually structured in a manner that supports processes requiring

information. When dealing with large amounts of data it is desirable being able to easily access, manage and

update it.

Databases can be classified according to their content: statistical, bibliographic, document-text or multimedia

objects.They can also be categorized depending on their organizational approach.A well-known type of

database is the relational database.In this case the information is modeled in a form that allows us to reorganize

and access it in many different ways. Data appears in two-dimensional tables that are “normalized” so the

information won’t repeat more often than necessary. Each table has what is called a “primary key”.This is a

field that uniquely identifies each record in the table.

An object-oriented programming database stores objects rather than integer, string or real numbers data.The

objects consist of attributes and methods and are used in object orientated languages such as C++, Java and

many others. So, objects will contain both data and executable code.

A database management system (DBMS) is a software application able to interact with itself, the user or other

applications in order to extract and analyze data. A DBMS let us define, create, query, update and administrate

databases. MySQL, Microsoft SQL Server, PostgreSQL, Sybase, SAP HANA, Oracle and IBM DB2 are

https://en.wikipedia.org/wiki/Microsoft_Office

https://en.wikipedia.org/wiki/Database_management_system

https://en.wikipedia.org/wiki/Database_management_system

https://en.wikipedia.org/wiki/Microsoft_Jet_Database_Engine

https://en.wikipedia.org/wiki/Graphical_user_interface

https://en.wikipedia.org/wiki/Graphical_user_interface

https://en.wikipedia.org/wiki/ActiveX

https://en.wikipedia.org/wiki/Bibliographic_database

https://en.wikipedia.org/wiki/MySQL

https://en.wikipedia.org/wiki/Microsoft_SQL_Server

https://en.wikipedia.org/wiki/PostgreSQL

https://en.wikipedia.org/wiki/Sybase

https://en.wikipedia.org/wiki/SAP_HANA

https://en.wikipedia.org/wiki/Oracle_Database

https://en.wikipedia.org/wiki/IBM_DB2

60

common examples of DBMSs.

A distributed databased represents a database whose data is split and stored in multiple physical locations

called “nodes”. A centralized distributed database management system (DDBMS) manages the data as if it

were stored at the same location. What happens at the DDBMS label will be reflected elsewhere in the

structure.

5.1.3.2 OLEDB

OLEDB or OLE DB (“Object Linking and Embedding, Database“) represents an application program

interface designed by Microsoft that let developers access data from numerous sources in a uniform way. It

provides a collection of interfaces implemented using the Component Object Model (COM). OLEDB is part

of the Microsoft Data Access Components (MDAC) stack.

Basically, OLEDB detaches the data (“provider”) from the software application (“consumer”). The last one

will have access to the information stored in the database using some special tools, which provide a level of

abstraction. In this way one can work with different sources of data, without necessary knowing technology-

specific methods.

OLEDB has a set of routines for writing and reading data.The OLEDB objects can be: data source objects,

session objects, command objects and rowset objects. A program which uses OLEDB will have the following

sequence: initialize the object, connect to the database, issue a command, compute the results and release the

data source object.

5.2 Application Configuration and Testing Results

The WinForm application divides into a face detection module and a face recognition module. Although

recognition builds upon detection, they are different concepts. Detection means identifying any human face in

a digital image, whereas recognition means identifying a known face and correlate it with a known name.

From the parent form illustrated bellow the user can choose a single task at a time, by clicking on a specific

button: Face Detection/Face Recognition:

Figure 5-5. Start Form.

https://en.wikipedia.org/wiki/Microsoft_Data_Access_Components

61

5.2.1 Face Detection Form

In case the user has chosen “Face Detection”, another form will open.In here we can detect faces using an

input image from our hard disk (“Browse Image” button) or from a live video stream (“Start Web” button), as

it is shown in the example below.

At the right, we can see a groupBox named “Tuning Parameters”, which is used to adjust our detector

depending on what we are interested of in the input image. In the second groupBox, called “Detection Results”

we can see the number of detected faces and also navigate through them.

Figure 5-6. Face Detection From.

The tuning parameters are:

The scan window size

This parameter basically represents the dimension of the smallest face to look for in a given image. In the

example bellow, we can see a red square, sliding from left to right, moving in this way across the entire

picture. At the following run, the square will increase its size, scanning again our image, searching this time

for bigger faces and so on.

62

Figure 5-7. The scan window.

The Minimum Neighbors Threshold

If at some point, the area within the scan window passes all the cascade stages, this indicates that there might

be a human face. As a consequence, that area will remain marked with a rectangle around it. But sometimes,

false detections may happen. So, we cannot trust and take as a final response the detector’s result after only

one scan. We need to wait for the detector to run the detection window over the test image more than one time

and at different scales. Finally, if a zone is multiply marked with a group of rectangles which overlap, than we

can consider the area between those rectangles a human face, as appears in the figure 5-8. On the other hand,

we should reject every isolated detection.

Figure 5-8. Detection rectangles overlapping.

The Minimum Neighbors Threshold parameter specifies how many rectangles we should find in a group after

running the detector over the image, before considering that group as a true human face. For example, if we set

this parameter to 3 or less, the detection will return an affirmative response for the image 5-8. On the other

hand, setting it higher than 3 the detector will find zero faces in this particular image.

63

The following examples illustrate the effects of adjusting this threshold at different values:

Figure 5-9. Minimum Neighbors = 1.


64



Setting this parameter too low will result in detecting all the faces at the cost of some false detections as the

figure 5-9 illustrates. Setting the parameter too high will result in missing faces (figure 5-10). Leaving the

threshold at zero, will let us see all the raw detections, as in the figure 5-12. Finally, there exists a range of

values that can help us identify all the faces that appear in our image (figure 5-11).

The Scale Increase Rate (“Window Size”)

This value establishes how fast the detector would increase its scale when running over a given image. A

lower value makes the detector run slower and therefore with more precision. On the contrary, setting it at a

higher value might result in missing some faces. By default, the scale increases with 10% (1.1) at each step.

The user can choose between four values (1.1, 1.2, 1.3 or 1.4) from the application interface.

65

Figure 5-13. Scale increase rate = 1.2.

Figure 5-14. Scale increase rate = 1.4.

66

5.2.2 Face Recognition Form

In case the user has chosen “Face Recognition”, a new form will open from the start form. In here the system

can recognize faces in an image from the hard disk (“Browse Image” button) or recognize face from a live

video stream (“Start Web” button).

For the recognition task I have used a database consisting of 80 images from 8 subjects (10 pictures per

subject). The images capture different facial expressions, as we can see below:

Figure 5-15. The database used.

The groupBox named “Tuning Parameters” is used to adjust our recognizer depending on what we are

interested of in the current frame. The second groupBox, called “Update Database”, let us add a new face to

the training database and label it or delete an existing faces from the database.

67

Figure 5-16. Face Recognition Form.

Setting the threshold at a low value will result in high accuracy at the cost of many unrecognized faces as we

can see in the figure 5-17. In the figure 5-18 we can notice what happens when increasing the threshold. On

the other hand, setting the threshold too high, the system will respond at any similarity between the test image

and an image in the database. For instance, the subject that appears in the figure 5-19 doesn’t even exist in the

database.

Figure 5-17. Threshold value = 500.

68



In the example below (figures 5-20 and 5-21) we can see how the performance is affected when changing the

light conditions. At a fixed threshold value, the system doesn’t recognize the same subject. This happens

because the eigenfaces are constructed based on the information available in the training images, rather then

specific features of the subject. So, when changing the testing environment the system will fail the test.

69

Figure 5-20. Normal light conditions.

Figure 5-21. Strong light on the right side of the face.

In conclusion, we have seen from the previous tests that there is not a “recipe” to tune the detector or the

recognizer. We should adjust all the parameters involved according to a given situation and to our needs.

70

5.3 Some Implementation Details

Below I explained some code sequences that are relevant to our topic and some aspects that should be take into

consideration when implementing a similar application.

The HaarCascade Object

This is how we create a HaarCascade object for frontal face detection using a classifier provided by the

EmguCV library. This object contains all the information stored an .xml file during the learning process.

The DetectHaarCascade Method

The method “DetectHaarCascade” searches for areas in the input grayscale image (in our case, the object

named “gray”) that may contain entities the cascade classifier has been trained for. It returns those parts as a

sequence of rectangles. This function will scan the input picture multiple times at different sizes. Finally, it

collects all the candidate rectangles from the photo (areas that successfully passed the cascade filter) and

groups them, returning a sequence of average rectangles for each large enough group. This method needs five

parameters, as we can see in the code below. The HaarCascade object (“face”) symbolizes the cascade

classifier itself. The scale factor (“ScaleIncreaseRate”) represents the factor by which the detect window will

be increased at the next pass over the image. The minimum number of neighbors (“MinNeighbors”) is the

minimum number of overlapping rectangle areas that point out to the wanted object. When this parameter is 0,

the detection method will not group at all the candidate rectangles, acting as a “raw” detection.The flag

parameter indicates possible ways to optimize the operation. For my case, I chose “DO_CANNY_PRUNING”

which rejects areas that contain too many or too few edges. When letting the flag at its default value, no

optimization will be done. And finally, the minimum scan window size represents the size at which the

detector starts to run over the input image.

The EigenObjectRecognizer

In the following code sequence a recognizer object was instantiated. As the name of the class suggests

(“Eigen”), the recognizer uses Principal Component Analysis method. Between the brackets we can see four

parameters. First of them is an array of images used for training (“trainingPictures”). The second one is a

collection of names (“labels”) that matches the array of images mentioned previously. The third parameter is

the eigen distance threshold. It represents the minimum distance at which a face is considered “known”.

Setting it at a high value will affect the accuracy of the recognizer, whereas setting it at a low value will make

the system “react” at any similarity between the test image and another image in the database.The last

parameter (“termCrit”) means the termination criterion, something that is specific to any iterative algorithm.

The stopping criterion consists of other two parameters. First of them is the number of iterations (usually set at

the number of images in our database) and the second one is the tolerance of the algorithm (epsilon error).

71

The OLEDB Interface

In order to make the C# Winform Application and the Microsoft Access Database file work together, the

OleDbConnection class will be used. Its instance represents a unique connection between the provider and the

consumer. As a consequence, the OleDbConnection object will know where the database is located on the

hard disk.

Because the OLEDB interface supports accessing information stored in any format (text file, spreadsheet or

database), an OLEDB Provider is needed. The provider will expose data from a certain type of source, in our

case a Microsoft Access Database file.

Anytime we want to make a change to the database, we won’t work directly on it for security reasons. Instead,

we will use a DataTable object, which could be seen as a local database. It contains rows and columns exactly

as a “mirror” of the original database. We can select, add or iterate over stored data.

Another entity needed is an OLEDB Data Adapter which serves as a bridge between the local table and the

data source. Any change made to the local table can be loaded to the original database and the other way back.

The OLEDB Data Adapter object has read/write properties in order to fill or update the database.

However, the Data Adapter doesn’t automatically generate queries such as INSERT, UPDATE or

DELETE. It works together with an OLEDB Command Builder object. We can see this association in the

code below .One can only correlate an OleDbDataAdapter object with an OleDbCommandBuilder object

at one time.

The sequence above presents a method that implemets the connection between the winform and the

database:

72

Training images are stored inside our database as OLE objects. Because of this, their format must be

converted from “Emgu.CV.Image” to an array of bytes and this is exactly what the next function does:

The same steps are taken backwards this time in order to read images from the database file:

The following OleDbCommand object represents an insert statement to be executed against our database:

The Dispose Method

When working with a video camera that acts as a resource for our applications we must manage it properly.

The next function (“FormClosingFct”) takes care of this crucial aspect. It will release the web camera just

before the winform closes. It is absolutely necessary to implement such a procedure, because the .NET

Framework garbage collector won’t allocate or release unmanaged memory zones. In the code below “grabb”

represents a “Capture” object that provides information about the current frame taken.Without disposing of the

unmanaged resources used by “grabb”, memory access errors will raise.

73

6 CONCLUSIONS

6.1 Face recognition in our everyday life

owadays, the facial recognition along with other biometric identification solutions is accepted by our

society for security reasons. For instance, the police and the military, as well as other organizations,

make use of it in order to maintain our community safe. Without a doubt, its main advantage remains

the fact that this procedure does not require the subject’s cooperation. As a consequence, face recognition

systems can be found in airports and other public places all around the world in order to prevent terrorism and

criminality. However, there are many other fields where this technology is employed, such as:

Searching for missing children

More than 1.2 million children disappear annually in the world. Thus, the android application called

“Helping Faceless” helps kidnapped children reunite with their families. The user would take photos

of a street child and would upload it on a server. Then the application will try to find the child in

question in a database of lost children.

Online examination

Today one can take an exam only by sitting in front of its personal computer. Moreover, the teacher

has no reason to doubt the identity of the student that is being examined.

Eliminate duplicates in a voting system

Sometimes there are cases when the same person uses two or even more accounts in order to vote or

register on a network multiple times. When using a facial recognition system, no one will ever be able

to commit this kind of fraud.

N

I am still learning.

- Michelangelo -

74

Prosopagnosia disorder

Prosopagnosia or “face blindness” represents the inability of a person to recognize faces. It is known

that no cure was found yet. Therefore, a portable face recognition system could definitely help the

patient to avoid unpleasant situations. The system would identify the face of a certain human being

and pass along descriptive information to our patient.

Video banking

Credit card companies are now creating a way to pay using nothing else, but facial recognition, so we

won’t need to worry anymore about identity theft or simply forgetting a password.

6.2 Indoor Facial Detection and Recognition application

The facial recognition system implemented by me could be successfully used inside an office or any indoor

space, where real-time performance is needed. When working in a controlled environment, this application

represents an effective solution in terms of computational resources. The program would also be appropriate

for a home surveillance system. You cannot “feel at home” as long as you are not safe. Hence, my application

would provide an accurate solution to limit the access of strangers or simply notify the householder about

somebody’s arrival.

6.3 Conclusions

All in all, the facial recognition techniques have surprisingly evolved in the last decade. However, it remains

an important field in computer vision research area. We can easily imagine a future where this technology will

be able to recognize any face in a crowded place. “Creepy” as it may seem, I consider we can all agree that our

safety is far more important than the desire to stay anonymous. And, not to forget, just like every technological

progress, it will save precious time and make life easier for all of us.

75

BIBLIOGRAPHY

[1] G. B. A.Kaehler, Learning OpenCV, 2008.

[2] M. J. Paul Viola, «Rapid Object Detection using a Boosted Cascade of Simple Features,» de Accepted

Conference on Computer Vision and Pattern Recognition, 2001.

[3] M. J. Paul Viola, «Robust Real-Time Face Detection,» International Journal of Computer Vision 57, vol.

2, pp. 137-154, 2004.

[4] P. H.Winston, «Lecture 17: Learning: Boosting,» MIT 6.034 Artificial Intelligence, 2010. [En línea].

Available: https://www.youtube.com/watch?v=UHBmv7qCey4. [Último acceso: 01 03 2016].

[5] H. Geiger, «Facial Recognition and Privacy,» Center for Democracy & Technology, 2012.

[6] G. S. R. Kimmel, «The Mathematics of Face Recognition,» SIAM News, 2003.

[7] [En línea]. Available: https://en.wikipedia.org/wiki/Histogram_equalization. [Último acceso: 02 05 2016].

[8] [En línea]. Available: http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-

matrix/. [Último acceso: 12 06 2016].

[9] A. P. M. Turk, «Face recognition using eigenfaces,» de Proc. IEEE Conference on Computer Vision and

Pattern Recognition, 1991.

[10] F. Crow, «Summed-area tables for texture mapping,» de Proceedings of the 11th annual conference on

Computer graphics and interactive techniques, 1984.

[11] A. P. M. Turk, «Eigenfaces for recognition,» Journal of Cognitive Neuroscience , vol. 3, p. 71–86, 1991.

[12] L. I. Smith, «A tutorial on Principal Components Analysis,» 26 02 2002. [En línea]. Available:

http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf. [Último acceso: 10 04

2016].

Indoor Facial Detection and Recognition...

Documents

Transcript of Indoor Facial Detection and Recognition...