Indoor Facial Detection and Recognition...
Transcript of Indoor Facial Detection and Recognition...
Equation Chapter 1 Section 1
Trabajo Fin de Grado
Ingeniería Electrónica, Robótica y Mecatrónica
Indoor Facial Detection and Recognition Application
Autor: Pătraşcu Viorica Andreea
Tutor: Jesus Capitan Fernandez
Dep. Ingeniería de Sistemas y Automática
Escuela Técnica Superior de Ingeniería
Universidad de Sevilla
Sevilla, 2016
2
3
Trabajo Fin de Grado
Ingeniería Electrónica, Robótica y Mecatrónica
Indoor Facial Detection and Recognition
Application
Autor:
Pătraşcu Viorica Andreea
Tutor:
Jesus Capitan Fernandez
Profesor Ayudante Doctor
Dep. Ingeniería de Sistemas y Automática
Escuela Técnica Superior de Ingeniería
Universidad de Sevilla
Sevilla, 2016
4
5
Trabajo Fin de Grado: Indoor Facial Detection and Recognition Application
Autor: Pătraşcu Viorica Andreea
Tutor: Jesus Capitan Fernandez
El tribunal nombrado para juzgar el Trabajo arriba indicado, compuesto por los siguientes miembros:
Presidente:
Vocales:
Secretario:
Acuerdan otorgarle la calificación de:
Sevilla, 2016
El Secretario del Tribunal
6
7
Acknowledgements
Through my educational journey, I was lucky enough to meet several mentors who had a great influence on
me and truly shaped the way I am today. To all of them, I say thank you!
Sincere appreciation to my family and my dearest friends for supporting me through the past year and always
encouraging me not to give up.
And last but not least, I want to express my gratitude towards all the beautiful people I have met here, in
Seville. Thank you for giving me a sense of belonging during my Erasmus experience!
Viorica Andreea Pătraşcu
Seville, 2016
8
Abstract
Automatic face recognition represents a fascinating biometric identification method which uses the same
identifier as humans do to distinguish one person from another: their faces. Although it is a new concept, this
technology has evolved incredibly fast and has reached such a level where it can even distinguish between
identical twins. Face recognition has many applications in our daily life, such as: security surveillance, access
control, smart payment cards and even helping individuals who suffer from Prosopagnosia disorder.
For my Final Year Project I implemented a C# software application which detects and recognizes human faces
in an input image or video frame from a live video source. This system could be successfully used inside an
office, a household or any indoor space, where the working environment is a constrained one.
As a background for our topic, this report gives a brief introduction about Artificial Intelligence, Machine
Learning and Computer Vision concepts. Then it covers in detail the theory behind the Viola-Jones method for
Face Detection and the Eigenfaces algorithm for Face Recognition. Last but not least, it describes the
application implementation, the experimental results and it finally outlines the advantages of the system
created.
9
Table of Contents
Acknowledgements 7
Abstract 8
Table of Contents 9
List of Tables 11
List of Figures 12
1 Introduction 15 1.1 Overview 15 1.2 Motivation 15 1.3 About the project 16 1.4 Achievements 16 1.5 Objectives 16 1.6 About the report 16
2 Background 17 2.1 Artificial Intelligence 17 2.2 Computer Vision 18 2.3 Machine Learning (ML) 19
2.3.1 Supervised and Unsupervised Dataset 19 2.3.2 Discriminative and Generative Models 20 2.3.3 Machine Learning and Computer Vision 21 2.3.4 Variable Significance 21 2.3.5 Common Problems 21 2.3.6 Cross-validation, Bootstrapping, ROC curves and Confusion matrices 22 2.3.7 Binary Decision Trees 23 2.3.8 Boosting 24
3 Face Detection 25 3.1 The Viola-Jones Detection Framework - Theory 25
3.1.1 The Haar-type Features 26 3.1.2 The Integral Image 26 3.1.3 Adaboost 30 3.1.4 Cascade Filter 34
3.2 The Viola-Jones Detection Framework – Experimental Results 36
4 Face Recognition 37 4.1 Face Recognition Overview 37 4.2 The Eigenfaces Recognizer - Theory 38
10
4.2.1 Steps to obtain the eigenfaces 40 4.2.2 Preprocessing the images 40 4.2.3 The Covariance Matrix 42 4.2.4 Eigenvectors and Eigenvalues 44 4.2.5 The Eigendecomposition of a Covariance Matrix 45 4.2.6 Principal Component Analysis (PCA) 48 4.2.7 PCA in Computer Vision 49 4.2.8 Pseudocode Eigenfaces Recognizer 53
4.3 The Eigenfaces Recognizer – Experimental Results 53
5 Implementation and Testing Results 55 5.1 Required Software 55
5.1.1 Microsoft Visual Studio 55 5.1.2 The EmguCV Library 58 5.1.3 Microsoft Access 59
5.2 Application Configuration and Testing Results 60 5.2.1 Face Detection Form 61 5.2.2 Face Recognition Form 66
5.3 Some Implementation Details 70
6 Conclusions 73 6.1 Face recognition in our everyday life 73 6.2 Indoor Facial Detection and Recognition application 74 6.3 Conclusions 74
Bibliography 75
11
LIST OF TABLES
Table 4-1. Matrix of pixel values. 41
Table 4-2. Histogram equalization steps. 42
12
LIST OF FIGURES
Figure 2-1. Inference with the scope of propositional logic. 17
Figure 2-2. The Turing machine structure. 18
Figure 2-3. Machine learning algorihms available in the OpenCV library. 20
Figure 2-4. Underfitting, best fitting and overfitting. 21
Figure 2-5. The ROC curve and the confusion matrix. 23
Figure 2-6. Decision tree impurity measures. 24
Figure 3-1. Some Haar-type features. 26
Figure 3-2. The input image and the integral image. 27
Figure 3-3. The four references needed in the integral image. 28
Figure 3-4. The integral image and the input image. 28
Figure 3-5. Edge feature – 6 references. 29
Figure 3-6. Line feature – 8 references. 29
Figure 3-7. Four rectangular feature – 9 references 29
Figure 3-8. DS1(original dataset), C1(trained classifier 1), C1’s mistakes. 31
Figure 3-9. DS2 (weighted dataset 2), C2 (trained classifier 2), C2’s mistakes. 31
Figure 3-10. DS3 (weighted dataset 3), C3 (trained classifier 3) 32
Figure 3-11. The final classifier. 32
Figure 3-12. Cascade classifier. 35
Figure 3-13. “haarcascade_frontalface_default”, available in the EmcuCV library. 35
Figure 3-14. ROC characteristic. 36
Figure 4-1 Structured light. 38
Figure 4-2. XBOX 360. 38
Figure 4-3. Example of eigenfaces. 39
Figure 4-4. Histogram equalization. 41
Figure 4-5. The effect of a transformation matrix T. 44
Figure 4-6. A dataset in a 2-dimensional space. 45
Figure 4-7. Diagonal Covariance Matrix. 46
Figure 4-8. Non-diagonal Covariance Matrix. 47
Figure 4-9. White data and Observed data. 48
13
Figure 4-10. Changing the data space with PCA. 49
Figure 4-11. Digital image stored as a vector. 50
Figure 4-12. Database matrix T. 50
Figure 4-13. The covariance matrix of the database matrix T 51
Figure 4-14. The covariance matrix of the database matrix TT 51
Figure 4-15. The “Eigenfaces for recognition” experiment (1991). 54
Figure 5-1. Microsoft Visual Studio 2010 Environment. 56
Figure 5-2. C# code stages. 57
Figure 5-3. EmguCV layers. 58
Figure 5-4. Microsoft Access 2010. 59
Figure 5-5. Start Form. 60
Figure 5-6. Face Detection From. 61
Figure 5-7. The scan window. 62
Figure 5-8. Detection rectangles overlapping. 62
Figure 5-9. Minimum Neighbors = 1. 63
Figure 5-10. Minimum Neighbors = 10. 63
Figure 5-11. Minimum Neighbors = 4. 64
Figure 5-12. Minimum Neighbors = 0. 64
Figure 5-13. Scale increase rate = 1.2. 65
Figure 5-14. Scale increase rate = 1.4. 65
Figure 5-15. The database used. 66
Figure 5-16. Face Recognition Form. 67
Figure 5-17. Threshold value = 500. 67
Figure 5-18. Threshold value = 3500. 68
Figure 5-19. Threshold value = 5000. 68
Figure 5-20. Normal light conditions. 69
Figure 5-21. Strong light on the right side of the face. 69
14
15
1 INTRODUCTION
1.1 Overview
owadays, Biometrics play a vital role in our everyday life. Since it is highly secure and convenient, our
society makes use of this technology almost everywhere, from airport surveillance to intelligent houses.
Compared to other biometric solutions, face recognition yields greater advantages because it does not
require the subject’s interaction or permission. From this point of view, it represents a fast and effective way to
increase our security level.
Automated facial recognition is a modern concept. It was born in the 1960s and it is still under constant
development today. In 2006, the Face Recognition Grand Challenge (FRGC) project evaluated the facial
recognition algorithms available at that time. Tests involved 3D scans, high quality images and iris
photographs. The FRGC proved that the algorithms available then were 10 times more precise than those of
2002 and 100 better than those of 1995. Some recognition methods were able to outperform humans in
recognizing faces and could even distinguish between identical twins.
1.2 Motivation
I was fascinated by the fast pace of the development of facial recognition technologies and the progress
they have undergone to reach the level it is at today. In highschool, when I first heard about it, it seemed
extremely compelling, yet rather difficult to grasp as a whole. However, as the years have gone by, my
academic studies helped me to see concepts like Artificial Intelligence with other eyes. For this reason, in the
past year I wanted to take a closer look at how a face recognition system works and implement one myself.
N
I have no special talent. I am only passionately curious.
- Albert Einstein -
16
1.3 About the project
For my Final Year Project I implemented a C# software application which detects and recognizes human faces
in an input image or video frame from a live video source. I developed it in Microsoft Visual Studio 2010
using EmguCV (a cross platform .NET wrapper for the OpenCV image processing library) and a Microsoft
Access database file.
Despite the fact that face recognition builds upon face detection, they represent different concepts. This is
why I have chosen to divide my application into two modules, one for each problem. The system can detect,
count and extract faces in a given image or video frame. The program can also recognize people’s faces if they
are registered in the Microsoft Access database. The user can add new faces to the database together with their
names or delete existing faces.
The algorithms behind the software application are the following: Viola-Jones method for face detection and
Eigenfaces for face recognition.
1.4 Achievements
This report aims to present in detail how a face detection and recognition system can be built, outline the
theoretical frameworks used and, last but not least, which are the connections between such systems and some
related fields, including Machine Learning and Computer Vision.
I claim to have succeeded in implementing a face recognition software application which can be used for
security purposes in constrained environments. For instance, one can install such a program in order to control
the access inside an office or household, without any concerns regarding identity theft.
1.5 Objectives
This project represents my first attempt to implement a facial detection and recognition system. In the near
future, I definitely intend to study other algorithms available today and improve my application. Furthermore,
I look forward to deepen my knowledge and potentially work in the field of Artificial Intelligence someday.
1.6 About the report
The remainder of my Final Year Project report is divided into 6 chapters. They are as follows:
Chapter 2 presents concepts such as Artificial Intelligence, Machine Learning and Computer Vision.
Chapter 3 covers the theory behind Viola-Jones method for Face Detection.
Chapter 4 explains the Eigenfaces algorithm for Face Recognition.
Chapter 5 describes implementation and testing details about the software application.
Chapter 6 concludes this report and highlights the advantages of the system.
17
2 BACKGROUND
2.1 Artificial Intelligence
rtificial intelligence (AI) can be defined as the intelligence possessed by computers. An intelligent
agent perceives the surrounding environment and acts according to an unknown scenario. It is a
flexible machine capable of performing “cognitive” tasks such as learning in order to solve different
problems. Some modern examples of artificial intelligent computers include systems that can play and win
chess games (“Deep Blue”) or even self-driven cars that can navigate through crowded roads.
Today, many goals have been reached in this scientific area: learning, planning, reasoning, communication,
perception and manipulation. However, Artificial General Intelligence (AGI) still remains a primary goal and a
topic for the SF writers and futurists all over the world.
Going back to the origins, the first functional calculating machine was constructed by scientist Wilhelm
Schickond around 1623. In the 19th
century, the mathematician and philosopher George Boole formulated the
“propositional calculus” (or “sentential logic”) and Gottlob Fredge developed what is known today as the
“first-order logic/predicate”, both representations still being used nowadays.
Figure 2-1. Inference with the scope of propositional logic.
A
Success in creating Artificial Intelligence would be the
biggest event in human history.
-Stephen Hawking -
18
In 1936, Alan Turing brought the mathematical model of “computation”, inventing an abstract machinery
named after him. Nowadays, nearly all programming languages are Turing complete _ capable to simulate a
Turing machine. This invention inspired researchers all over the world to build an electronic brain.
Figure 2-2. The Turing machine structure.
The area of artificial intelligence research was officially founded at a conference on the campus of Dartmonth
College in 1956. At the beginning of 21st
century, reached its apogee and started to be used in the industrial
sector, logistics and medical diagnosis.
Today, we can proudly talk about the IBM’s first question answering system (“Watson”), about the “Kinect”
which provides a 3D body motion interface for XboxOne and Xbox360 and, last but not least, about the
intelligent personal assistants in our smartphones. Although this discipline has evolved so much lately, still no
agent achieved to pass the Turing test formulated in 1950, exhibiting the same intelligence as a human being.
2.2 Computer Vision
Computer Vision is a scientific discipline that includes methods for acquiring, processing, analyzing and
finally interpreting images from the world we live in. This field is concerned with the theory of extracting
information from input images using models and learning algorithms. These days, we can enumerate
many subdomains related with computer vision, such as object recognition, image restoration, scene
reconstruction, event detection, video tracking or motion estimation.
Computer Vision is concerned with constructing systems that obtain data from a set of given images.
Understanding an image means transforming it into a “description”, using patterns constructed with the
aid of geometry, statistics, physics and machine learning theory.
The main parts of artificial intelligence cover planning the mechanical movements of a robot through an
environment. In order to achieve this kind of task, the robot needs input data provided by a computer
vision module, acting as his “eyes”_ a bridge between his world and ours.
The input information can appear in many forms, such as video sequences, representations from multiple
cameras or even multi-dimensional data from a medical scanner. We could think of computer vision as
the reverse of computer graphics. While computer graphics creates image data from tridimensional
models, computer vision produces such models from the images received. There is also a tendency to
combine these two technologies, as explored in augmented reality.
A standard problem in computer vision is determining if the input image contains a certain feature, object
or activity. Here we could divide computer vision into detection (the image in question is scanned for a
specific condition _ for example, scanning a tissue to search for abnormal cells), object classification or
recognition (programs such as “Blippar”, “LikeThat” and “Google Goggles”) and identification (a
specific instance of an object is recognized, such as a person’s fingerprint or face).
Among numerous applications of computer vision we could refer to the industrial process control, quality
inspection in manufacturing applications, navigation of mobile robots, detecting events for security
19
purposes, modeling objects as in medical image analysis, human-machine interaction and missile
guidance in military applications.
2.3 Machine Learning (ML)
As we can imagine, the supreme goal in Computer Vision filed is to use machines that emulate perfectly
human vision, being able to take actions based only on visual input information. However, taking
decisions would be impossible without a learning technique. Machine Learning (ML) aims to transform
data into information. A system can learn from a dataset by extracting certain patterns and then be able to
answer questions regarding a new set of data. In 1959, Arthur Samuel defined machine learning as a
"field of study that gives computers the ability to learn without being explicitly programmed".
When it comes to binary decisions, one usually breaks up an original dataset of 10.000 faces into a large
set for training (for example 9000 faces) and another one for testing (the remaining 1000 faces). The
classifier will run over the first set, constructing its own model of how a face looks. Then, the classifier
will be tested on the smaller dataset and see how well it worked. If the results are bad, we might consider
adding some more features to our first dataset or even try a different type of classifier. Sometimes,
jumping directly from training to testing might be an exaggeration. Instead of that, we could split the
samples into three: 8000 faces for learning, 1000 for validation and the last 1000 for the final testing.
During the validation phase, we can “sneak” at the results and see its performance. Only when we are
completely satisfied with this middle stage we should run the classifier on the final test.
2.3.1 Supervised and Unsupervised Dataset
The input information given to a machine will have labels that go together with the data feature vectors. For
example, we could associate a name to a face or even numerical data, such as the age of a person. This type of
learning is called “categorical”.When data comes with labels, the system will do what is known as
“classification”, whereas when the data is numeric, the learner is doing “regression”.
Supervised learning could involve, as in the examples above, one to one pairing of label with data vectors or it
could be “deferred/reinforcement learning”. In the last case, the labels (called “punishment” or “reward”)
would come along after the data vectors were observed. So the machine receives a delayed signal, inferring a
plan of taking decisions for the future runs.
On the contrary, we could not attach label to the input data if we are interested in seeing how the data falls
naturally into categories. This type of ML algorithms are called “clustering algorithms”. Maybe the machine
will group the faces given into short, long, thin or wide faces.
These two forms of machine learning overlap with two frequent tasks in Computer Vision _ recognition
(“what?”) and segmentation (“where?”). We need our computer to identify the object in an image and also
estimate its position.
Since Computer Vision makes such great use of machine learning, OpenCV includes many machine learning
algorithms in the ML library, as we can see in the image below (Multilayer percepcion (MLP), Boosting,
Decision trees, Random trees, K-nearest neighbors, Normal Bayes, Suport Vector Machine (SVM),
Mahalanobis, K-means, Face detector/ Haar classifier). Although it is frequently used for vision tasks, the
OpenCV ML code is universal.
20
Figure 2-3. Machine learning algorihms available in the OpenCV library.
2.3.2 Discriminative and Generative Models
OpenCV deals with the most currently used statistical approaches to machine learning. Probabilistic
approaches such as graphical models or Bayesian networks are still being improved.
OpenCV gives support for discriminative models, rather than generative ones. A discriminative algorithm will
give us the probability of the signal y (label) given the data x, so the machine will learn the conditional
probability distribution p(y|x) , whereas a generative model will help learn the joint probability distribution
p(x,y). To be more specific, p(y|x) helps us classify a given input x into a class y and p(x,y) could be used to
generate likely pairs (x,y).
Example 2–1. Given the pairs (1,0), (1,0), (3,0), (3,1), the conditional probability distribution p(y|x) and the
joint probability distribution p(x,y) are as follows:
Figure 2-4. The conditional probability distribution p(y|x) and the joint probability distribution p(x,y).
We can notice that in the first case the sum of the values at every row equals 1, whereas in the second case the
total sum equals 1.
Generative models are more facile to interpret. Thinking of imagining a car, you would be generating
information given the condition “car”. On the opposite, discriminative learning resumes to taking a decision
based on some thresholds, that can often be confusing. For instance, if we are trying to detect a car in a given
image and the image passes a stage of detection this doesn’t mean that there is certainly a car, but that there is
a “candidate” for a car. [1]
21
2.3.3 Machine Learning and Computer Vision
All the algorithms mentioned before take as input a vector of many features. When trying to detect if a certain
object is present or not in an image, the first problem we encounter is how to gather training data that falls into
negative and positive cases. Another issue is that the objects may appear at different scales or in different
postures.You need to define what you mean by saying the the wanted object is present in the image.
When collecting data it is important to train your system in the same conditions in which it will later work_
same camera, same lighting conditions. In other words, any variation in the data must be taken into account
and captured.
There exits many ways to improve the system’s performance, such as subtract the background. Then process
the image using techniques of normalization (histogram equalization, rescaling, rotating).
The next step after collecting the data is to break it up into training, validation and testing sets.
And finally choose an appropriate classifier depending of time, data, accuracy or memory constraints.
2.3.4 Variable Significance
Some algorithms let you asses more or less importance to a variable. For instance, some features might be
more important than others for the classification accuracy. For instance, binary trees select which feature best
splits the data at each node. Top node variable is the more important one and the following variable decrease
in importance. The biggest advantage of binary threes is that they reduce the number of features considered for
our classifier. The training starts with many variables, finding the importance of each variable relative to the
others. Then you can eliminate the irrelevant features, improving speed performance.
2.3.5 Common Problems
Firstly, I would like to mention that “more data beats less data and better features beat better algorithms” [1].It
is important to maximize the independence between each feature and to minimize their variations under
different conditions.
Apart from this, we could enumerate some common problems. First of all, the concept of bias appears due to
the wrong assumptions made in the learning phase, when the model does not fit well. Variance happens when
memorizing the information including the noise, so the data won’t be general. High bias causes underfitting
issues and high variance causes overfitting.
A possible solution to bias is to collect more features or even try a more powerful algorithm. In the variance
case, we might need more training data and less features or a less powerful algorithm.
Figure 2-4. Underfitting, best fitting and overfitting.
22
2.3.6 Cross-validation, Bootstrapping, ROC curves and Confusion matrices
Sometimes, in order to know if your classifier operates well, running validation test might not be enough. In
real life, the classifier will meet noise and sampling errors, so the data distribution won’t be the same as in the
tests. In order to get closer to the true behavior of our algorithm and improve its stability there exists two
popular techniques _ cross-validation and bootstrapping.
2.3.6.1 Cross-validation
Cross-validation implies dividing data into K subsets, learning from K-1 subtests and then testing on 1. The
key is that at every step i each fold gets a turn at being the validation set:
Pseudocode
1. Distribute n training samples randomly
2. Divide the samples into K chunks
3. For i=1,..K
3.1 Train the classifier on all the samples that do not belong to the subset i
3.2 Test the classifier on subset i
3.3 Compute the number of samples wrongly classified ei
4. Calculate the classifier error as :
K
i
i 1
eE
n
2.3.6.2 Bootstrapping
Bootstrapping is quite similar, only that the validation set is chosen at random from the training samples. The
points that were selected for a round will only be used for testing, not training. Then the procedure starts again
from scratch. Repeat this N times, each time randomly selecting a new chunk of validation data. It is easy to
notice that many of the data points are being reused in different validation stages.
2.3.6.3 ROC curves
Another option to adjust a classifier is to use the receiver operating characteristic or the confusion matrix. The
ROC curve determines the response over the performance parameter of the algorithm over the entire range of
settings of that parameter. For example, let’s suppose we are trying to recognize the blue color in an input
image. Obviously, we have a threshold that defines what means a blue color and what means a non-blue color.
Setting the blue threshold too high might end up in failing to recognize any shade of blue in the image,
yielding a false positive rate of 0, but at the cost of a true positive rate also at 0 (lower left part of the graphic).
On the other hand, setting the blue threshold to zero, then any color in the image will be “blue” for our
detector. So we will have a maximum false positive rate (upper right area of the curve). The ideal ROC curve
is the one that goes along the y axis up to 100% and then cuts horizontally over to the upper right corner. The
area under the curve versus the total area is a ratio of merit (the closer this ration is to one, the better our
classifier will be).
2.3.6.4 Confusion Matrices
The confusion matrix of the operating point OP illustrated in the image below is other way to asses the
performance of a classifier, being equivalent to the ROC curve at its left. A matrix of a perfect classifier
would have 100% along the principal diagonal and 0% elsewhere.
23
Figure 2-5. The ROC curve and the confusion matrix.
2.3.7 Binary Decision Trees
Binary Decision Trees are highly used in the OpenCV machine learning library. Their inventor Leo Breiman
named them “classification and regression tree algorithms” (CART).
The essence of this method is to define an impurity metric relative to the data in every node of the tree. For
instance, when regression is used in order to fit a function, the sum of squared distances between the true value
and the predicted value is used. The basic idea is to minimize this sum of differences known as “the impurity”
in each and every node of the tree.
When dealing with categorical labels, we define a measure which is minimal when most of the values inside a
node belong to the same class. There are three standard measures: Misclassification, Gini index and Entropy.
Once we have chosen a metric, the binary three algorithm searches through the feature vector in order to find
which feature together which threshold purifies most the data. By convention, the features below the threshold
are branched to the right and the features above the threshold are branched to the left. In this way, we will
recursively follow down each branch of the tree, until the data is pure enough for our needs. The impurity
equation i(N) is given below as [1]:
2.3.7.1 First case - Regression Impurity
We try to minimize the square of the distance between the node value y and the data value x:
2
k k
k
i(N) (y x ) (2-1)
2.3.7.2 Second case - Classification Impurity
There are three common measures for expressing the impurity. P(ω k) signifies the fraction of patterns at node
N that belong to the class ωk.
24
Misclassification impurity
ki(N) 1 max P( ) (2-2)
Entropy impurity
k k
k
i(N) P( ) log P( ) (2-3)
Gini index
p k
k p
i(N) P( ) log P( )
(2-4)
Figure 2-6. Decision tree impurity measures.
In classification, decision trees are probably the most widely used due to their facile implementation.
Moreover they are simply to interpret and flexible when it comes to working with different data types.
Decision trees constitute the basis of other algorithms such random trees and boosting.
2.3.8 Boosting
Although decision trees are effective, they are not the best-performing classifiers. The boosting technique
inherits a lot from decision trees, using them in the inner loop. To be noticed that boosting uses fewer decision
variables that a decision tree so it will save memory and reduce computation cost.
Within the category of supervised classification methods available in the OpenCV library there is a meta-
learning algorithm named “statistical boosting”. It was described for the first time by computer scientist
Michael Kerns in 1988. He wondered whether it is possible to create a strong classifier out of many other weak
ones.
Adaboost is the first boosting algorithm, that was formulated shortly afterwards by Robert Schapire and Yoav
Freund who won Gödel Prize in 2003 for their model.
25
3 FACE DETECTION
3.1 The Viola-Jones Detection Framework - Theory
his chapter presents the Viola-Jones face detection method. The Viola-Jones detection framework was
proposed for the first time at a computer vision conference in 2001 by Paul Viola and Michael Jones.
Their approach outclassed any existing face detector at that moment. Although it can be trained to
identify many types of rigid objects, it is mostly used for face detection.
Viola and Jones claim that when it comes to face detection, their algorithm yields detection rates comparable
to the previous algorithms. But used in real-time situations, their detector is capable of running at 15 frames
per second without resorting to techniques such as image differencing or skin color detection. Moreover,
adding these alternative sources of information will result in achieving even higher frame rates. [2]
This detector is based on three strong concepts. The first one is known as the “Integral Image”. It allows the
features used by this detector to be computed very fast. The second one is a machine learning algorithm,
“Adaboost”, which selects only the important features from a larger dataset. The third concept is the
construction of a “cascade” structure by combining complex classifiers, which will reject background regions
of the input image while spending more computation time on the areas that might contain the object of our
interest.
The algorithm works best on frontal objects and not so well on side views, because these bring variations in
the template and the haar features (mouth, eyes, hairline) used in the detector cannot control them well. For
instance, the side view of an object should catch part of the changing scene behind the object profile. So, the
classifier is constrained to learn the background variability at the edge of the side view.
The first condition before starting to train our system is to collect the appropriate data for our situation.
“Good” data means data perfectly divided into categories. For example, we should not mix tilted objects with
upright objects. “Well-segmented” data is also a vital characteristic meaning that data is always boxed [1]. For
instance, different placement of the eyes locations inside the face box can make the classifier assume that the
eyes locations are not fixed and that they can move around. So you should normalize the picture and align the
eyes as much as possible. Performance will dramatically decrease when a classifier tries to correct for unreal
variability in the data.
T
Mathematics is the language in which God has written
the universe.
- Galileo Galilei -
26
3.1.1 The Haar-type Features
The facial detection algorithm will hunt for specific features that are common to every human face. These
“features” are basically black and white rectangles, as those illustrated below. For example, in the first image,
the white rectangle represents the lighted area of the cheeks, whereas the black one, the shadow of the eyes. In
the second image, the white area corresponds to the bridge of the nose which is brighter than the region of the
eyes.
Figure 3-1. Some Haar-type features.
Before starting to look for these matches in a given image, we need to turn it to grayscale (only black and
white pixels), every pixel having a value between 0 (black) and 255 (white), depending on its intensity. In
order to convert a RGB digital image to grayscale, the (R-G-B) triplet is mapped to a single value using
different methods, such as the lightness method, the average or the luminosity method. The last one is used by
OpenCV, being more closed to the human perception of colors. Due to the fact that humans are more sensitive
to green, this color will weigh more, as we can see in the formula below:
Y 0.2126R 0.7152G 0.0722B (3-1)
The following step is to calculate the sum of all pixel values under the black area and the sum of all pixel
values under the white area. Then, the second sum is subtracted from the first one and if the result is within a
specified threshold, we can affirm that the feature is “absorbed into” our image.
3.1.2 The Integral Image
Frank Crow has introduced the “Summed Area Table” to computer graphics in 1984. Then John
Lewis used this concept in Computer Vision. Later on, in 2001, Paul Viola and Michael Jones have
brought the equivalent term “integral image” within their object detection framework, which refers to
a fast and efficient method to calculate the sum of pixel values in any rectangular zone of a given
image.
27
Obtaining the Integral Image
The value of any point (x,y) in the integral image equals the sum of the values above and at its left, including
the pixel itself, as we can see in the formula below:
m' mn ' n
I(m,n) i(m',n ')
(3-2)
Example 3–1. Given the pixel values of an image we can calculate the integral image as follows:
Figure 3-2. The input image and the integral image.
We infer from the general formula (3-2) that the integral image can be calculated in a single pass over the
initial image, using the following relation:
I(m,n) i(m,n) I(m 1,n) I(m,n 1) I(m 1,n 1) (3-3)
Once we have the integral image computed, we can easily calculate the sum of pixel values in any rectangular
zone of our input image, using only four references from the summed area table ( E(m0,n0), F(m1,n0),
G(m0,n1), H(m1,n1) ) and the formula below:
0 1
0 1
m m mn n n
i(m,n) I(E) I(H) I(F) I(G)
(3-4)
I (2,2) = i(1,1)+ i(1,2)+ i(2,1)+ i(2,2) = 1+1+1+1 = 4
28
Figure 3-3. The four references needed in the integral image.
Example 3–2. Determine the sum of the pixel values within the gray area in our input image, using
the integral image method:
Figure 3-4. The integral image and the input image.
Using the integral image method (3-2) the sum of the highlighted red area would be:
S = I(E)+I(H)-I(F)-I(G)=12+57-22-32=15 (3-5)
Now, going back to our input image, we can check the result. The sum of pixel values equals:
S = 3+9+2+1=15 (3-6)
The results (3-5) and (3-6) are indeed equivalent.
29
3.1.2.1 Using the Integral Image for the Viola-Jones detector
When working with an algorithm in real-time applications, we should remember that efficiency is a crucial
aspect. So, summing up all the pixels under every haar feature for all the possible combination in a given
image is not a very quick task to do. So, whenever we need to calculate a haar feature value we won’t just sum
up all those values inside that area. We will use only four memory lookups and formula (3-4). As a
consequence, this method provides us with a constant calculation time independent of the size of the wanted
grid. So, we managed to reduce the complexity of calculation from O(n2) to O(n).
The examples above illustrate how many references are needed to calculate the corresponding value for every
type of haar feature:
Figure 3-5. Edge feature – 6 references.
∑i = (A+E-D-B)-(F+B-E-C) = A+2E-D-2B-F+C (3-7)
Figure 3-6. Line feature – 8 references.
∑i = (E+H-F-G)+(D+A-B-C)-(F+C-E-D) = 2E+H-2F-G+2D+A-B-2C (3-8)
Figure 3-7. . Four rectangular feature – 9 references
∑i = (H+D-G-E)+(F+B-E-C)-(I+E-H-F)-(E+A-D-B) = 2H+2D-G-4E+2F+2B-C-I-A (3-9)
30
3.1.3 Adaboost
The term Adaboost comes from “Adapting Boosting” and stands for a machine learning meta-algorithm
created my Robert Schapire and Yoav Freund in 2003. Adaboost can be used together with other types of
learning algorithms in order to improve their results. The main idea is to combine the output of some weak
classifiers (learners) into a weighted sum, in this way creating a strong final classifier, which error rate tends
exponentially to zero.
Moreover, if we consider all the possible combinations of the previously presented haar features (different
orientations and different sizes) which can be applied in a 24x24 scan window, we might end up with more
than 180.000 features [3]. This large number is a real disadvantage in terms of efficiency. Besides, there are
linear dependencies between those features. So, the main challenge is to find only those critical features which
combined together will form an effective classifier. So, this is another problem solved by Adaboost ( it will
eliminate all the redundant features and keeping only the critical ones).
3.1.3.1 Training weak classifiers
A weak classifier means a classifier that performs barely above chance, having an error rate slightly smaller
than 50%. All the haar type features could be seen as weak classifiers (the simple rectangles have an error rate
between 10-30%, whereas the complex ones between 40-50%) [4]. Usually these classifiers are decision trees
with only one split (“decision stumps”) up to three. We assign to each one of these classifiers a weighted vote
in the final decision.
The weak learning algorithm is created to choose only the rectangle features which best splits the positive and
negative samples. For each haar feature, the weak learner will calculate the optimal threshold classification
function in order to obtain a minimum number of misclassified samples. Therefore, a weak classifier will
consist of a haar feature fj, a threshold value tj and a parity (+/-1), as we can see in the formula below:
j j
j
j j
1,f th
1,f t
(3-10)
Creating a strong classifier
We have as input a dataset of images X=[x1,x2…xi ,…xn] used together with a vector of scalar labels
Y=[y1,y2…yi ,…yn], where yi = {±1}, depending if the image xi is a positive or a negative example. In the
following example, I will present the training process using an input consisting of 5 positive samples and other
5 negative samples.Below, I illustrated the procedure in three steps, although one will train as many weak
classifiers as it is needed till obtaining the wanted performance on a validation dataset.
Step 1:
At the beginning, all the samples in the original data set have a uniform weight, this meaning that the detector
will focus equally on each and every data point from our set:
31
Figure 3-8. DS1(original dataset), C1(trained classifier 1), C1’s mistakes.
Step 2:
At the second step, we will modify the previous dataset, assigning different weights to the points. For instance,
the points gotten wrong at the previous step will weight more this time, meaning that the classifier will pay
more attention to these particular samples:
Figure 3-9. DS2 (weighted dataset 2), C2 (trained classifier 2), C2’s mistakes.
Step 3:
At this phase, we will increase again the weights of the samples gotten wrong earlier and decrease the weights
of those gotten correct. So, the clasiffier number 3 will focus on the mistakes made by the clasiffier number 2:
32
Figure 3-10. DS3 (weighted dataset 3), C3 (trained classifier 3)
Final step:
At the end, the strong classifier is a linear combination of all the weak trained previouslly:
Figure 3-11. The final classifier.
Pseudocode for training a strong classifier
1. Given the pairs 1 1 2 2 n n(x ,y ),(x ,y ),...(x ,y ) , where iy { 1} .
2. Initialize weights:
1,i
1w
2m , for iy 1 , where m is the number of negative samples
1,i
1w
2p , for iy 1 , where p is the number of positive samples
3. For t= 1, 2,..T
3.1 Normalize the weights:
t,i
t,i n
t, j
j 1
ww
w
33
3.2 Choose the classifier ht with the lowest error rate εt:
t i j i i
i
min w *(h (x )! y )
3.3 Update weights:
i1 e
t 1,i t,i tw w *
, where ei=0/1 if xi was correctly/wrongly classified and tt
t1
4. Calculate the final classifier as:
T
t t
t 1
h(x) sign( *h (x))
, where tt
t t
1 1 1 1log( ) log( )
2 2
3.1.3.2 Interesting Observations
a) The next weight wt+1,i will remain equal to the previous weight wt,i if xi was misclassified or will
decrease with the factor βt if xi was correctly classified.
At the step 3.3 in the pseudocode, we have:
Right prediction case:
normalized 1 0 normalized
t 1,i t,i t t,iw w * w
(3-11)
Wrong prediction case:
normalized 1 1 normalized
t 1,i t,i t t,iw w * w
(3-12)
b) The new generation of weights is just a scale version of the old generation.
If z represents the normalizing factor of the next generation of weights, then:
normalized
t 1,i t 1,i
1w w
z (3-13)
We know that the normalized weights will sum up 1:
normalized
t 1,i
i
w 1 (3-14)
(3-13), (3-14) => t 1,i t 1,i
i i
1w 1 z w
z
34
We could split the sum above into the sum of the weights for the correct predictions and the sum of the
weights for the wrong ones, as follows:
t 1,i t 1,i
correct,i wrong,i
w w 1 (3-15)
Using the last relation (3-15), we infer that:
normalized normalized(3 11),(3 12)
t 1,i t 1,i t t,i t,i
correct,i wrong,i correct,i wrong,i
z w w z w w
(3-16)
The weights in the wrong cases summed up will equal the error rate, as we can see in the Adaboost
pseudocode (step 3.2). So:
normalized
t,i t
wrong,i
w (3-17)
normalized
t,i t
correct,i
w 1 (3-18)
Replacing in (3-16) with the relations (3-17) and (3-18), we obtain:
tt t t t t t
t
z (1 ) (1 ) 21
(3-19)
Summing up the new generation of weight for both cases :
normalized(3 12)
t 1,i t,i t,i t,i t
wrong,i wrong,i wrong,i wrong,i t
1 1 1 1w w w w
z z 2 2
normalized(3 11) tt 1,i t,i t t,i t t t,i t
correct,i correct,i wrong,i wrong,i t t
1 1 1 1w w w w (1 )
z z 2 1 2
This result leads us to a great conclusion in terms of computation: to obtain the new generation of weights, all
we need to do is scale the current weights so that they will sum up ½.
3.1.4 Cascade Filter
The Viola-Jones detector uses the Adaboost technique, but organizes the classifier as a rejection cascade of
nodes. “Cascading” means that for every node, a candidate that is classified as “not in class” terminates
instantly the computation. Only the candidate that makes it through the entire cascade will be classified as a
face. In this way, the computational cost is significantly reduced, because most of the areas that do not contain
the object of interest are rejected at an early stage in our cascade.
35
When it comes to face detection, there exists what is called a scan window that translates all over the input
image, at different scales, looking for a face. Every time this window shifts, a new area within its boundaries
will go through the cascade classifier stage by stage. If the current region fails to pass the threshold of a stage,
our classifier will instantly reject that area as a face. So, no more tests will be made from now on to that
particular area. On the other hand, if an area passes successfully all the stages, this means that there might be a
face.
Figure 3-12. Cascade classifier.
Each stage represents a multi-tree Adaboost classifier programmed to a high detection rate (few missed faces)
at the cost of many false positives. This means that almost 99.9% of the faces are found at each node, but also
about 50% of the non-faces are wrongly classified. In the end, forming a strong classifier out of 20 nodes, will
result in a detection rate of 98% with a false positive rate of 0.0001%. Furthermore, when the detection
window is being swept over the test image at different scales, 70-80% of the non-faces are eliminated in the
first two nodes (which use about ten decision stumps) [1].
The learning process can take several hours to a day, even on a fast machine, depending of the size of the data
set. After training our detector all the information is being saved to an .xml file, like the one below. It usually
contains from 20 to 30 stages: the first stage, stage 0, is a superficial scan through the image and the following
ones, as I already mentioned, become more and more detailed. The examples which are more complex and
made it further through the cascade will push the ROC characteristic curve downward. New stages will be
added until the overall target for detection rate and false positives is reached.
Figure 3-13. “haarcascade_frontalface_default”, available in the EmcuCV library.
36
As a matter of fact, this type of cascade training comes with a compromise. The classifiers that contain more
features will return higher detection rates and lower false positive rates. On the other hand, they will need more
time to compute. Unfortunately, finding a balance is close to art and it represents the real problem of the
engineer.
3.2 The Viola-Jones Detection Framework – Experimental Results
Viola and Jones described the performance of their detector in the article “Rapid Object Detection using a
Boosted Cascade of Simple Features”, presented at a conference on computer vision and pattern recognition in
2001 [2].
The complete detection cascade has 38 stages and more than 6000 features. However, on a dataset with 507
faces and 75 million sub-windows, the faces will be detected using about 10 features per scan window.
The training set used by Viola and Jones consisted of 4916 labeled faces at a resolution of 24x24 pixels. The
non-face sub-windows came from 9544 images. Each classifier in the cascade was learned using the set of
labeled faces and another 10.000 non-face sub-windows.
On a 700Mhz Pentium 3 processor the detector managed to compute a 384 by 288 pixel image in about 0.067
seconds.
The detector was tested on a set of 130 pictures from the real world with 507 labeled frontal faces. Below we
can see the ROC curve that illustrates the performance achieved by the detector. It was run with a step size of
1.0 and a starting scale of 1.0.
Figure 3-14. ROC characteristic.
Viola and Jones affirm that when it comes to face detection, their algorithm has detection rates comparable to
the previous algorithms. But used in real-time situations, their detector can run at 15 frames per second without
resorting to techniques such as image differencing or skin color detection.
37
4 FACE RECOGNITION
4.1 Face Recognition Overview
facial recognition application can identify a human being in a given digital image or video frame.
Such systems are mostly used in security areas together with other biometric authentication
technologies.
The facial recognition methods could be divided in “photometric” and “geometric” procedures. The first
category is based on extracting specific features from an image of a person. For instance, the system may
analyze the relative position, dimension and shape of the eyes, mouth, nose, cheekbones and so on. The
second category is more of a statistical approach. It uses database of images from which the face data is
extracted, normalized and compressed. Then, the test image is quantified in terms of that data.
Among the most popular recognition methods I would mention the Eigenfaces method, the Fisherface
algorithm, the Linear Discriminate Analysis, the Hidden Markov model, the Multilinear Subspace
Learning and the Dynamic Link Matching.
The ultimate tendency in facial recognition is represented by three-dimensional face-recognition. This
method uses 3D cameras to capture data about someone’s face. This technology has better results than the
classical 2D recognition because it is not sensitive to light changes, different face expressions or make-up
and can even identify profile views. For example, Microsoft’s video game console (Xbox 360)
implements this new technology. It works by projecting “structured light” onto the subject, as in the
image below. Then the system will infer depth information from how the projected pattern modified.
Lately, engineers tried to create even a more powerful system by combining three cameras that point at
different angles so that they could track and recognize with high precision a person who is moving.
A
Creativity is just connecting things.
- Steve Jobs -
38
Figure 4-1 Structured light.
Figure 4-2. XBOX 360.
Another interesting approach in facial recognition is skin texture recognition. The skin texture is captured into
what is called a “skinprint”. This patch will then be divided into smaller blocks and through specific
algorithms the skin will be turned into a mathematical space that can be measured.
Thermal cameras also deserve being mentioned in this chapter. They are a great tool for detecting and
identifying people 24 hours a day, 7 days a week. These cameras create representations based on the heat that
any object radiates. A thermal camera is immune to light changes and perform best indoors. However, a
drawback of this technology is represented by the limited thermal pictures database.
Compared to others biometric techniques, face recognition isn’t the most reliable and practical method. On the
other hand, it doesn’t require permission from the subject to be tested. Thus, face recognition systems are
installed everywhere in public places such as airports where mass identification is necessary in order to prevent
terrorist attacks.
Lately, some people complained about their privacy rights and civil liberties. They claim this kind of
surveillance can be used not only to identify a subject, but also to reveal personal data, such as social
networking profiles. For instance, Facebook’s DeepFace is considered to have violated the Biometric
Information Privacy Act. Facebook used the world’s biggest photo library to create a deep learning face-
recognition system trained on more than 4 million images uploaded by its users. The system has 97%
accuracy, whereas the FBI’s Next Generation Identification system only 85%. The Huffington Post American
newspaper described the technology as being “creepy” and announced that some European governments had
already forced Facebook to delete its face-recognition database [5]
4.2 The Eigenfaces Recognizer - Theory
The first face recognition system was developed by Woody Bledsoe, Helen Chan Wolf and Charles
Bisson in the 1960s. They created a semi-automated program that needs an administrator to locate
characteristic features in a given image, such as the eyes, nose, mouth or ears. Their system computed
relative distances between these features and created a list with specific ratios for each subject in the
39
database. However, such approach has proven quite fragile over the years.The Eigen Faces procedure was
firstly introduced in 1987 by Kirby and Sirovich and later on developed by Alex Pentland and Matthew
Turk, in 1991. The term “eigen” refers to a set of eigenvectors, also known in linear algebra as
“characteristic/proper vectors”. The main advantage behind this method is that we can represent a set of
images using a base formed of “eigen” pictures whose dimension is a lot smaller than the original set.
Identification can be achieved by comparing two images both represented in the eigen base of the training
set.
The Eigen Faces approach started with the need to find a low-dimensional representation of face images.
Kirby and Sirovich demonstrated that Principal Component Analysis (PCA) concept can be used on a
group of face images to form a set of basic features. This set is known as “eigen pictures” and can be used
to reconstruct the original collection of images. Each original face would be rebuilt as a linear
combination of the base set. We want to extract only the critical information in a test image and encode it
as effectively as possible, then compare the selected information with a database of models encoded in
the same manner. Faces photographs are projected onto a feature space that best illustrates the variation
among known/learned face images. This feature space is defined by the eigenvectors/eigenfaces. The
vector of weights expresses the contribution of each eigenface to the input image.
These results were expanded and improved by two computer scientists who have found an efficient way
to compute the eigenvectors of a covariance matrix. Initially, a face image would occupy a high-
dimensional space and PCA method couldn’t be applied on large data sets. But Alex Pentland and
Matthew Turk discovered a form to extract the eigenvectors based on the number of input images, rather
than the number of pixels.
After performing the eigen decomposition on a set of given photos, one will obtain through statistical
analysis the specific “ingredients” that represent our data set. The features that the original collection of
images have in common will be found into what is called an “average/mean face”. On the other hand, the
differences between the images will appear in the eigenfaces. Furthermore, we can reverse the process
and reconstruct any initial image from the eigenfaces together with the average image. In this way, every
face will be stored as a list of values (each corresponding with an eigenpicture in the data base), instead of
a digital photograph, saving up memory space.
The Eigen technique is used also in other types of recognizing _ medical imaging, voice recognition,
gesture interpretation, lip reading, handwriting analysis. For this reason some prefer to use the term of
“eigen image” instead of “eigenface”, although they basically refer to the same thing.
Figure 4-3. Example of eigenfaces.
40
4.2.1 Steps to obtain the eigenfaces
1. When collecting training images one should have in mind a couple of general rules. Firstly, the
photographs must be taken all under the same lighting conditions and then must be normalized so the
mouths and the eyes would be aligned across all images. Secondly, they also must have the same
resolution (r x c). Each picture will be treated as a vector with r x c elements after concatenating the
rows of pixels. The entire training set will be stored in a single matrix T in which each column
represents a different input image.
2. Calculate the average image M and subtract it from each original picture in the database.
1 2 d
1M (im im ...im )
d
Determine the eigenvectors and eigenvalues of the covariance matrix C of the probability distribution
over the dimensional vectorial space of a face image.
Each resulting eigenvector/eigenface has the same size as the original images and therefore it
could be seen as an image itself. They are basically directions in which each input image differs
from the mean.
3. Select from the eigenvectors only the principal components :
Sort in reverse the eigenvalues and organize the eigenvectors accordingly. The number of principal
components k is calculated by setting a threshold value ε on the total variance:
1 2 nV n( ... ) , where n is the number of input images. Tha value k is the smallest number
which satisfies:
1 2 nn( ... )
V
Now, these eigen images will help us represent the faces in the database, but also new faces. When projecting
a new face (mean-subtracted) on the eigenfaces space, we actually see how the new face differs from the mean
image. The characteristic values associated with the eigenfaces will help us determine how much the images in
the training set differ from the average image in that direction. Even though by projecting the images into
another subspace, we lose information, we will minimize this loss by picking only those eigen images with the
largest eigenvalues. As a consequence, we will keep only the striking differences from the main image. For
instance, at a resolution of 100 x 100 pixels, from a single image 10.000 eigenvectors will be obtained. In fact,
only 100-150 vectors will be needed [6].
4.2.2 Preprocessing the images
Both the training images and the test images must be turned to grayscale, histogram equalized and having their
background removed. In the section 3.1.1 (The Haar-type Features) I already explained what a grayscale
image is. Now, I am going to present the histogram equalization concept:
41
Figure 4-4. Histogram equalization.
In image processing area, the term “histogram of an image” usually refers to a graphical representation of the
intensity distribution of that image. The histogram of an image can be obtained by plotting pixel intensities
versus pixel frequency. In case of an 8-bit grayscale image there are from 0 to 255 shades of gray, so 256
possible values for the intensity. Therefore, the histogram will graphically display 256 numbers showing the
distribution of pixels amongst these numbers.
Histogram equalization is a procedure for adjusting image intensities in order to increase contrast. This implies
remapping the initial histogram to another distribution (a wider and more uniform one). As we can see in the
figure 4-4, the effect of histogram equalization is to stretch out the range of intensities. [7]
Example 4–1. I will explain the histogram equalization process on a digital image represented by a matrix of
pixels, like the one below, where each value is the intensity of the pixel found in that corresponding position:
3 2 4 5
7 7 8 2
3 1 2 3
5 4 6 7
Table 4-1. Matrix of pixel values.
42
We can notice that the intensities in our image vary between 1 and 8. I mentioned previously that the main
idea behind this procedure is to widen the range of intensities. For example, we could scale them to a range of
1-20.The first step is to count how many pixels have a certain value of intensity. The second step is to calculate
the probability of each pixel intensity in the given image. Using the resulting values, we will calculate the
cumulative probability. Since we want to change the intensity range from 1-8 to 1-20 we shall multiply the
cumulative probability with 20. Finally, the values resulted will be floor rounded.
Table 4-2. Histogram equalization steps.
4.2.3 The Covariance Matrix
4.2.3.1 Standard Deviation (σ)
The Standard Deviation (σ) defines how spread out numbers are in a certain dataset. A low value
indicates that the samples tend to be close to the expected value (the mean), while a high value indicates
that the samples are dispersed over a wider range of values.
V (4-1)
4.2.3.2 Variance (V)
The variance is a concept used in statistics and probability theory and it refers to the quadratic deviation
of a random variable from its mean value. For a data set X of equally likely values xi the variance can be
calculated as:
n2
i
i 1
1V(X) (x x)
n
(4-2)
The mean value or the expected value is:
n
i
i 1
1x x
n
(4-3)
Pixel intensity 1 2 3 4 5 6 7 8 9 10
Number of
pixels
1 3 3 2 2 1 3 1 0 0
Probability 0.0625 0.1875 0.1875 0.125 0.125 0.0625 0.1875 0.0625 0 0
Cumulative
probability
0.0625 0.25 0.4375 0.5625 0.6875 0.75 0.9375 1 1 1
Cumulative
probability*20
1.25 5 8.75 11.25 13.75 15 18.75 20 20 20
Floor rounding 1 5 8 11 13 15 18 20 20 20
43
In other words, the variance expresses how far a given set of numbers are spread out from their average
value.
4.2.3.3 Covariance (C)
The covariance between two random variables is a way to measure how much they change together. If
each one has a finite set of equal-probability values xi and yi , then the covariance can be calculated as:
n
i i
i 1
1C(X, Y) (x x)(y y)
n
(4-4)
If two variables are independent, their covariance will be 0. However, this is not true the other way. The
variance of a variable is equal to the covariance of this value with itself:
2V(X) C(X,X) (X) (4-5)
4.2.3.4 The Covariance Matrix
A variance-covariance matrix (covariance/dispersion matrix) is a matrix whose element (p,q) symbolizes
the covariance between the pth
and the qth
element of a random vector. A random vector can be seen as a
multi-dimensional random variable. Each element of this vector has either a finite number of values
resulted from empirical observations or a finite or infinite number of possible values. The potential values
are defined by a joint probability distribution.This concept is used when analyzing multivariate data.
Example 4–2. We can measure three variables such as the length, the width and the height of a certain object.
The results from all the measures taken (in our example, 5) will be arranged in a matrix 3 x 5, where each row
represents a different observation.
4.00 4.30 3.90 4.20 4.10
X 1.00 1.10 1.00 1.10 1.20
0.50 0.49 0.48 0.51 0.49
Obviously, the mean matrix is:
4.10 4.10 4.10 4.10 4.10
X 1.08 1.08 1.08 1.08 1.08
0.49 0.49 0.49 0.49 0.49
So, the variance-covariance matrix will be computed as:
T
0.1000 0.0300 0.0030
C(X) (X X)(X X) 0.0300 0.0280 0.0004
0.0030 0.0004 0.0006
44
We can notice that the resulting matrix is symmetric, because the covariance between the element i and
the element j is the same as the covariance between the element j and the element i. In the main diagonal
there are the variances of our variables and in the remaining positions the covariances between each pair
of variables. For example, 0.1000 is the variance of the first variable (length) and 0.0300 is the covariance
between the first variable (length) and the second one (width).
The average vector is often called “centroid” and the covariance matrix is referred to as “dispersion”.
This method it is often used to calculate functions of estimators that model the errors or differences
between a set of empirical results and a set of expected values. In feature extraction, this concept models
the spectral variability of a signal.
4.2.4 Eigenvectors and Eigenvalues
A characteristic or proper vector is a vector whose direction remains the same after applying a linear
transformation to it. The german term “eigen” (“own”) was introduced by the mathematician David
Hilbert to denote the eigenvalues and eigenvectors. This concept has many practical applications in
physics and engineering, especially in computer vision.
In the image below, we used the transformation T= [0.5 0; 0 2], meaning that we scaled with factor 2
vertically and with factor 0.5 horizontally.We can notice that the red vectors didn’t change their direction,
whereas the green vector yes. So, those vectors that were not affected by T are the characteristic vectors
of the transformation T.
Figure 4-5. The effect of a transformation matrix T.
In general, the eigenvector v of a matrix T satisfies the relation:
Tv = λv, where λ represents a scalar called “eigenvalue”. (4-6)
If v is not the null vector, then we can calculate the eigenvectors by resolving the equation:
| T-λI | = 0 (4-7)
45
Example 4–3. Caculate the eigenvalues and eigenvectors for a given matrix:
2 3
T2 1
(4-7) => (2-λ)(1-λ)-6 = 0 => λ1 = -1 and λ2 = 4.
Now by substituing these results in (4-6), we obtain:
11 11
1
12 12
21 21
2
22 22
v v2 3
v v2 1
v v2 3
v v2 1
So, v11 = - v12 and 3v21 = 2v22 .
In the section 3 we established that the variance-covariance matrix models the shape of the data: the
variance values define the spread of the data in the horizontal and vertical directions, whereas the
covariance, the diagonal distribution. For example, in the figure below, C = [5 4; 4 6]:
Figure 4-6. A dataset in a 2-dimensional space.
4.2.5 The Eigendecomposition of a Covariance Matrix
We would like to graphically represent the dispersion matrix using a vector whose orientation points into
the direction of the largest spread of data and whose module equals the spread in that direction.
In the previous section we have established that the pairs of eigenvectors and eigenvalues uniquely define
a matrix. Applying this to a dispersion matrix, we could say that the form of our data can be described in
terms of eigenvectors and eigenvalues.
46
For a data set D and a direction vector v, the projection of our data onto the vector v is vT
D. The
covariance matrix of the projected data is vT C v. We are searching for the vector v that indicates the
direction of the largest variance. So, the task is to maximize vT C v with respect to v. Using the Rayleigh
Quotient, it is known that the vector v is then equal to the largest eigenvector of the covariance matrix C.
To sum up, the largest proper vector of the covariance matrix always indicates the direction of the the
largest variance of data and its length equals to the corresponding proper values. The second largest
proper vector is always orthogonal to the largest eigenvector and points into the direction of the second
largest extension of data.
The examples above show the relation between the eigen values of a covariance matrix C and the shape
of the data [8]:
Case 1: C is diagonal, so the covariances are zero and the variances are equal to the proper values λ.
Figure 4-7. Diagonal Covariance Matrix.
Case 2: C is not diagonal, so the proper values represent now the variance of the data along the proper
vector’s directions, whereas the elements of the covariance matrix C define the spread along the axes x
and y. If there are no covariances, Case 2 becomes Case 1.
.
47
Figure 4-8. Non-diagonal Covariance Matrix.
Let’s consider the case of a white dataset D, which means its covariance matrix C equals the identity
matrix I and a transformation matrix T consisting of a scale matrix S and a rotation matrix R:
x
y
1 0C I
0 1
T RS
cos sinR
sin cos
s 0S
0 s
So, the transformed data set D’ can be expressed as:
D’ = TD
Previously, we demonstrated that the covariance matrix can be uniquely represented by its proper vectors
and values:
Cv = λv
In a two dimensional space, the covariance matrix will have two pairs of eigenvectors and eigenvalues.
The resulting system can be represented by the equation below:
CV = VL, where the columns in the matrix V are exactly the eigenvectors of C and the L is a diagonal
matrix, whose non-zero values are equal to the corresponding eigenvalues.
48
Now, we can rewrite the covariance matrix as a function of its proper values and vectors:
C = VLV-1
The equation above is called the eigendecomposition of the covariance matrix. The proper vectors
symbolize the directions of the largest variance of data and the corresponding proper values, the
magnitude of this variance in those directions. For this reason, V could be seen as a rotation matrix and
√L as a scaling matrix. Using these observations, we could rewrite C as:
C = RSSR-1
, where R = V and S = √L.
Moreover, R is orthogonal (R-1
= R T) and S is symmetric (S = S
-1). With these last observations
and knowing that the transformation matrix T consists of a scale matrix S and a rotation matrix R, C
becomes:
C = TTT
Figure 4-9. White data and Observed data.
To conclude, by applying a linear transformation T=RS to a white data set D, we get scaled and rotated
data D’ whose covariance matrix C is equal to T T T. So we have demonstrated that the covariance matrix
of observed data is in fact a linear transformation of uncorrelated data (white data). Moreover, the proper
vectors of the transformation correspond to the rotation change and the proper values correspond to the
scaling change.
4.2.6 Principal Component Analysis (PCA)
Principal component analysis is a statistical method applied to a dataset in order to emphasize variation
and identify strong patterns. It is frequently used in order to visualize and explore data more easily than in
its original form.
In the example below we have a two dimensional dataset. In case we are particularly interested in how
49
those data points vary, we would find another coordinate system where we can actually see the variation
better. After mapping the initial points to the new system, each and every point (x, y) will have another
value (pc1, pc2). The new axes do not have a physical signification, they were selected just to emphasize
variation. We can notice that the first and the second principal component (pc1 and pc2) were chosen in
the direction in which the samples vary the most (the red line and the green line).
Figure 4-10. Changing the data space with PCA.
Looking better at the principal components, we notice that the second principal component could be
eliminated because it doesn’t contribute very much to the variation.In conclusion, PCA procedure helped
us reduce dimensions, when describing our dataset.
In addition, when working with three dimensions, PCA is even more effective, because it is difficult to
see through a cloud of data. Instead, we can project the original data into a two dimensional space, after
finding the best angle (eliminate the dimension that has the lowest variation).
Using PCA one can detect patterns in data and then present the data is such a way as to highlight
differences and similarities. The biggest advantage of PCA is that once you have found those patterns, the
data can be compressed, without much loss.
4.2.7 PCA in Computer Vision
Let’s suppose each photograph in the training database has a resolution of NxN pixels. By concatenating
each row of pixels in a single image, starting from the top of it, a N2 dimensional vector will be obtained,
as the image below suggests:
50
Figure 4-11. Digital image stored as a vector.
If our database is a collection of “d” images, the preprocessed images will be stored as one matrix, where
each column is a mean-subtracted image from our database:
Figure 4-12. Database matrix T.
Now, the next step is to calculate the covariance matrix and perform the PCA algorithm on it. In this case,
the covariance matrix is the product between the database matrix and its transpose: C = T T T
.So, we
would later work with a N2 x N
2 dispersion matrix, which is very expensive in terms of computation.
51
Figure 4-13. The covariance matrix of the database matrix T
But here comes the brilliant idea of Alex Pentland and Matthew Turk, that thought of performing PCA on
T T
T, instead of T T T. In this case, the covariance matrix is d x d dimensional, a lot less than in the first
case. This is the reason why in the first chapter I mentioned that Alex Pentland and Matthew Turk
discovered a form to extract the eigenvectors based on the number of input images, rather than the
number of pixels.
Figure 4-14. The covariance matrix of the database matrix TT
For instance, in a database made of 200 pictures each having a resolution of 100x100 pixels, we could
perfom PCA on a matrix of 100.000.000 elements or on a matrix of 40.000 elements, which is 2.500
times cheaper.
So, in order to obtain the eigendecomposition of the covariance matrix we would have to resolve the next
equation:
52
First case (working with T):
C(T)vi = λivi , where vi is the proper vector and λi is the proper value of C(T) (4-8)
Second case (working with TT):
C’(T)vi = λivi => TTTwi = λiwi , where wi is the proper vector and λi is the proper value of C’(T) (4-9)
By pre-multiplying the equation (4-9) with the transformation matrix T, we get:
TTTTwi = λiTwi => C(T)(Twi) = λi(Twi)
From this we infer that if wi is the eigenvector of C’(T)=T T
T, then vi = T
wi is the eigenvector of
C(T)=TTT.
4.2.7.1 Single Value Decomposition (SVD)
Another technique that would help us increase the efficiency is the Single Value Decomposition /Factorization
of a matrix. We are going to apply this algorithm on our covariance matrix, in order to obtain the eigenvectors
and eigenvalues without actually computing the equation (4-9).
C’(T) = T T
T , (4-10)
where T represents the database matrix whose each column xi is a mean subtracted image vector and d the
number of images in the database (the number of observations).
We define the singular value decomposition of T as:
T = U ∑ VT
, (4-11)
where U is a unitary matrix (U* U = U U
* = I
), ∑ is a rectangular diagonal matrix and V is also a unitary
matrix .
We can substitute T in the equation (4-10) with the equation (4-11):
C’(T) = (U ∑ VT)
T (U ∑ V
T )
= V ∑
T U
T U ∑ V
T = V (∑ ∑
T )V
T,
where the non-zero components of ∑ are the square roots of the non-zero proper values of C’(T) (or C(T))
and the columns of V are the proper vectors of C’(T).
53
4.2.8 Pseudocode Eigenfaces Recognizer
0. Calculate the eigenvectors and remain only with the most significant ones (first k elements) as seen in
the section 4.2.1 (Steps to obtain the eigenfaces).
1. Subtract the mean image M from the input image In and calculate the weight of each eigenface
Eigi.
for i = 1: k
wi = Eigi T * (In-M)
2. Gather all the weights calculated previously and form a vector W that reflects the contribution of
each eigenface in the input image (this is equivalent to projecting the input image onto the face-
space).
W = [ w1 … wi ….wk ]
3. Calculate the distances between the test image In and every image in the database.
for j = 1:d
Disj = ||W - Wj||2
4. Choose the minimum distance.
minDis = minj=1:d (Disj)
5. Determine if the input image In is “known” or not, depending on a threshold t.
if (minDis < t) then In is “known”, else In is “unknown”
4.3 The Eigenfaces Recognizer – Experimental Results
In 1991 Matthew Turk and Alex Pentland published an article called “Eigenfaces for recognition” in the
Journal of Cognitive Neuroscience. They tested their method on a database composed of 2.500 images, all
digitalized under constrained conditions.Sixteen persons were photographed multiple times at all combinations
of three head sizes, three orientations and three lighting conditions. The training images were converted from a
resolution of 512 x 512 pixels to a resolution of 16 x 16.
During the first experiment, the acceptance threshold was kept at a maximum value. Matthew Turk and Alex
Pentland tried to see how the changes in lighting, head position and scale will affect the performance of their
system on the entire database. Different groups of sixteen pictures were selected and used as the training set.
Each group consisted of one photo of each subject, all taken under the same conditions of lighting, head size
and head rotation. During this experiment, no face was labeled as “unknown”. The system achieved 96%
correct classification averaged over lighting changing, 85% correct classification averaged over position
variation and 64% correct classification over dimension variation.
For the second experiment, the same process was followed, but this time the threshold was also varied. At
lower values, few mistakes were made when identifying the subjects, but many faces were rejected as
“unknown”. On the contrary, at a high value, almost all faces were classified as “known” at the cost of many
wrong associations. Finally, adjusting the acceptance threshold to achieve 100% accuracy, lowered the correct
classification rates as follows: 81% for lighting variation, 61% for orientation and 40% for scale lowers.
54
Figure 4-15. The “Eigenfaces for recognition” experiment (1991).
Each graph presented above depicts averaged performance.The y axis represent the number of correct
classifications (out of 16). The x axis represent conditions that were varied during the experiment, such as:
(a) lighting
(b) head scale
(c) head orientation
(d) lighting and orientation
(e) size and orientation _1
(f) size and orientation_2
(g) lighting and size_1
(h) lighting and size_2
In the same paper, Matthew Turk and Alex Pentland mentioned that this method could also be used to locate
faces in a given image. It starts with the same procedure: project the image onto the face-space. The real faces
do not change very much when projecting them onto the low-dimensional space, whereas the non-faces yes
[9].
55
5 IMPLEMENTATION AND TESTING RESULTS
y project consists of a C# software application that detects and recognizes human faces in an input
image or video frame from a live video source. The program was developed in Microsoft Visual
Studio 2010 using EmguCV (a cross platform .NET wrapper for the OpenCV image processing
library) and a Microsoft Access database file.
The application has two modules, one for face detection and the other for face recognition. The system can
detect, count and extract faces in a secondary window. It can also recognize people’s faces if they are
registered in the Microsoft Access database. The user can increase the database by adding new faces or delete
existing records directly from the Visual Studio application.
5.1 Required Software
5.1.1 Microsoft Visual Studio
Microsoft Visual Studio represents an IDE (Integrated Development Environment) from Microsoft. The
developer can use it to build computer programs for Microsoft Windows, web applications, web sites
or web services.Visual Studio can generate both native and managed code.This environment incorporated
a code editor supporting IntelliSense and also code refactoring. The integrated debugger works both as a
source-level debugger and a machine-level debugger.
Visual Studio has a forms designer for developing Graphical User Interface applications and also other
build-in tools, such as class designer, web designer and database designer.It supports numerous
programming languages: C, C++, C#, Visual C++, VB.NET, F#, JavaScript , HTML/XHTML,
XML/XSLT and CSS. If we want to code in other languages, such as Ruby, Node.js or Python we need to
install language services.
M
Tell me and I forget, teach me and I may remember,
involve me and I learn.
- Benjamin Franklin -
56
Figure 5-1. Microsoft Visual Studio 2010 Environment.
The Windows Forms designer is used to develop GUI applications which are basically event-driven
applications. In this type of programming, the user’s actions, a sensor output or a signal from outside, can act
as a trigger and influence the flow of our program.
5.1.1.1 C# Language
C# is a type-safe object-oriented language that allows programmers to build safe and robust applications that
run on the .NET Framework. C# can be used to develop Windows client applications, database applications,
XML web services or client-server applications.
Its syntax is easy to understand by developers that worked previously with C, C++ or Java.Object-orientated
concepts such as inheritance, encapsulation and polymorphism are supported in C#.While reducing some
complexities of C++, it also provides powerful elements which cannot be found in Java, such as: nullable
value types, delegates, attributes, properties, enumerations, XML documentation comments, LINQ (Language-
Integrated Query) and direct memory access. It provides universal methods, types and iterators, increasing in
this way type safety and performance.
Applications written in C# run on the .NET Framework _ an integral element of Windows which includes a
virtual execution system named CLR (Common Language Runtime) and a unified set of class libraries. CLR
is a commercial version of CLI (Common Language Infrastructure) implemented by Microsoft.
The C# code is compiled into a language called IL (Intermediate Language) that matches the CLI
specifications. The code and resources are stored on the hard disk as an “assembly” (executable file) with the
extension .dll or .exe. When a software application is being executed, the executable will be loaded into CLR.
Then, the code will be compiled, turning the IL code into machine instructions. The next image shows the
connections between the C# source code, the assemblies, the .NET Framework and the class libraries:
57
Figure 5-2. C# code stages.
58
5.1.2 The EmguCV Library
EmguCV is a cross platform .NET wrapper for the OpenCV image processing library. As it is shown below,
this platform incorporates two layers. The first one is the basic layer having the same functionalities as
OpenCV and the second one contains classes from the .NET universe:
Figure 5-3. EmguCV layers.
59
5.1.3 Microsoft Access
Microsoft Access is part of the Microsoft Office suite of applications and represents a database
management system (DBMS) that connects the Microsoft Jet Database Engine with a graphical user
interface and other software-development tools.
It stores information in a specific form based on the Access Jet Database Engine. Thus, Microsoft Access
can also import or connect to data available in other databases. Microsoft Access is supported by VBA
(Visual Basic for Applications), an object-orientated language which can work with DAO (Data Access
Objects) and ActiveXData Objects.
Figure 5-4. Microsoft Access 2010.
5.1.3.1 Database overview
A database symbolizes an organized collection of data. It is composed of a group of tables, schemas, reports,
queries and views. The information inside is usually structured in a manner that supports processes requiring
information. When dealing with large amounts of data it is desirable being able to easily access, manage and
update it.
Databases can be classified according to their content: statistical, bibliographic, document-text or multimedia
objects.They can also be categorized depending on their organizational approach.A well-known type of
database is the relational database.In this case the information is modeled in a form that allows us to reorganize
and access it in many different ways. Data appears in two-dimensional tables that are “normalized” so the
information won’t repeat more often than necessary. Each table has what is called a “primary key”.This is a
field that uniquely identifies each record in the table.
An object-oriented programming database stores objects rather than integer, string or real numbers data.The
objects consist of attributes and methods and are used in object orientated languages such as C++, Java and
many others. So, objects will contain both data and executable code.
A database management system (DBMS) is a software application able to interact with itself, the user or other
applications in order to extract and analyze data. A DBMS let us define, create, query, update and administrate
databases. MySQL, Microsoft SQL Server, PostgreSQL, Sybase, SAP HANA, Oracle and IBM DB2 are
60
common examples of DBMSs.
A distributed databased represents a database whose data is split and stored in multiple physical locations
called “nodes”. A centralized distributed database management system (DDBMS) manages the data as if it
were stored at the same location. What happens at the DDBMS label will be reflected elsewhere in the
structure.
5.1.3.2 OLEDB
OLEDB or OLE DB (“Object Linking and Embedding, Database“) represents an application program
interface designed by Microsoft that let developers access data from numerous sources in a uniform way. It
provides a collection of interfaces implemented using the Component Object Model (COM). OLEDB is part
of the Microsoft Data Access Components (MDAC) stack.
Basically, OLEDB detaches the data (“provider”) from the software application (“consumer”). The last one
will have access to the information stored in the database using some special tools, which provide a level of
abstraction. In this way one can work with different sources of data, without necessary knowing technology-
specific methods.
OLEDB has a set of routines for writing and reading data.The OLEDB objects can be: data source objects,
session objects, command objects and rowset objects. A program which uses OLEDB will have the following
sequence: initialize the object, connect to the database, issue a command, compute the results and release the
data source object.
5.2 Application Configuration and Testing Results
The WinForm application divides into a face detection module and a face recognition module. Although
recognition builds upon detection, they are different concepts. Detection means identifying any human face in
a digital image, whereas recognition means identifying a known face and correlate it with a known name.
From the parent form illustrated bellow the user can choose a single task at a time, by clicking on a specific
button: Face Detection/Face Recognition:
Figure 5-5. Start Form.
61
5.2.1 Face Detection Form
In case the user has chosen “Face Detection”, another form will open.In here we can detect faces using an
input image from our hard disk (“Browse Image” button) or from a live video stream (“Start Web” button), as
it is shown in the example below.
At the right, we can see a groupBox named “Tuning Parameters”, which is used to adjust our detector
depending on what we are interested of in the input image. In the second groupBox, called “Detection Results”
we can see the number of detected faces and also navigate through them.
Figure 5-6. Face Detection From.
The tuning parameters are:
The scan window size
This parameter basically represents the dimension of the smallest face to look for in a given image. In the
example bellow, we can see a red square, sliding from left to right, moving in this way across the entire
picture. At the following run, the square will increase its size, scanning again our image, searching this time
for bigger faces and so on.
62
Figure 5-7. The scan window.
The Minimum Neighbors Threshold
If at some point, the area within the scan window passes all the cascade stages, this indicates that there might
be a human face. As a consequence, that area will remain marked with a rectangle around it. But sometimes,
false detections may happen. So, we cannot trust and take as a final response the detector’s result after only
one scan. We need to wait for the detector to run the detection window over the test image more than one time
and at different scales. Finally, if a zone is multiply marked with a group of rectangles which overlap, than we
can consider the area between those rectangles a human face, as appears in the figure 5-8. On the other hand,
we should reject every isolated detection.
Figure 5-8. Detection rectangles overlapping.
The Minimum Neighbors Threshold parameter specifies how many rectangles we should find in a group after
running the detector over the image, before considering that group as a true human face. For example, if we set
this parameter to 3 or less, the detection will return an affirmative response for the image 5-8. On the other
hand, setting it higher than 3 the detector will find zero faces in this particular image.
63
The following examples illustrate the effects of adjusting this threshold at different values:
Figure 5-9. Minimum Neighbors = 1.
Figure 5-10. Minimum Neighbors = 10.
64
Figure 5-11. Minimum Neighbors = 4.
Figure 5-12. Minimum Neighbors = 0.
Setting this parameter too low will result in detecting all the faces at the cost of some false detections as the
figure 5-9 illustrates. Setting the parameter too high will result in missing faces (figure 5-10). Leaving the
threshold at zero, will let us see all the raw detections, as in the figure 5-12. Finally, there exists a range of
values that can help us identify all the faces that appear in our image (figure 5-11).
The Scale Increase Rate (“Window Size”)
This value establishes how fast the detector would increase its scale when running over a given image. A
lower value makes the detector run slower and therefore with more precision. On the contrary, setting it at a
higher value might result in missing some faces. By default, the scale increases with 10% (1.1) at each step.
The user can choose between four values (1.1, 1.2, 1.3 or 1.4) from the application interface.
65
Figure 5-13. Scale increase rate = 1.2.
Figure 5-14. Scale increase rate = 1.4.
66
5.2.2 Face Recognition Form
In case the user has chosen “Face Recognition”, a new form will open from the start form. In here the system
can recognize faces in an image from the hard disk (“Browse Image” button) or recognize face from a live
video stream (“Start Web” button).
For the recognition task I have used a database consisting of 80 images from 8 subjects (10 pictures per
subject). The images capture different facial expressions, as we can see below:
Figure 5-15. The database used.
The groupBox named “Tuning Parameters” is used to adjust our recognizer depending on what we are
interested of in the current frame. The second groupBox, called “Update Database”, let us add a new face to
the training database and label it or delete an existing faces from the database.
67
Figure 5-16. Face Recognition Form.
Setting the threshold at a low value will result in high accuracy at the cost of many unrecognized faces as we
can see in the figure 5-17. In the figure 5-18 we can notice what happens when increasing the threshold. On
the other hand, setting the threshold too high, the system will respond at any similarity between the test image
and an image in the database. For instance, the subject that appears in the figure 5-19 doesn’t even exist in the
database.
Figure 5-17. Threshold value = 500.
68
Figure 5-18. Threshold value = 3500.
Figure 5-19. Threshold value = 5000.
In the example below (figures 5-20 and 5-21) we can see how the performance is affected when changing the
light conditions. At a fixed threshold value, the system doesn’t recognize the same subject. This happens
because the eigenfaces are constructed based on the information available in the training images, rather then
specific features of the subject. So, when changing the testing environment the system will fail the test.
69
Figure 5-20. Normal light conditions.
Figure 5-21. Strong light on the right side of the face.
In conclusion, we have seen from the previous tests that there is not a “recipe” to tune the detector or the
recognizer. We should adjust all the parameters involved according to a given situation and to our needs.
70
5.3 Some Implementation Details
Below I explained some code sequences that are relevant to our topic and some aspects that should be take into
consideration when implementing a similar application.
The HaarCascade Object
This is how we create a HaarCascade object for frontal face detection using a classifier provided by the
EmguCV library. This object contains all the information stored an .xml file during the learning process.
The DetectHaarCascade Method
The method “DetectHaarCascade” searches for areas in the input grayscale image (in our case, the object
named “gray”) that may contain entities the cascade classifier has been trained for. It returns those parts as a
sequence of rectangles. This function will scan the input picture multiple times at different sizes. Finally, it
collects all the candidate rectangles from the photo (areas that successfully passed the cascade filter) and
groups them, returning a sequence of average rectangles for each large enough group. This method needs five
parameters, as we can see in the code below. The HaarCascade object (“face”) symbolizes the cascade
classifier itself. The scale factor (“ScaleIncreaseRate”) represents the factor by which the detect window will
be increased at the next pass over the image. The minimum number of neighbors (“MinNeighbors”) is the
minimum number of overlapping rectangle areas that point out to the wanted object. When this parameter is 0,
the detection method will not group at all the candidate rectangles, acting as a “raw” detection.The flag
parameter indicates possible ways to optimize the operation. For my case, I chose “DO_CANNY_PRUNING”
which rejects areas that contain too many or too few edges. When letting the flag at its default value, no
optimization will be done. And finally, the minimum scan window size represents the size at which the
detector starts to run over the input image.
The EigenObjectRecognizer
In the following code sequence a recognizer object was instantiated. As the name of the class suggests
(“Eigen”), the recognizer uses Principal Component Analysis method. Between the brackets we can see four
parameters. First of them is an array of images used for training (“trainingPictures”). The second one is a
collection of names (“labels”) that matches the array of images mentioned previously. The third parameter is
the eigen distance threshold. It represents the minimum distance at which a face is considered “known”.
Setting it at a high value will affect the accuracy of the recognizer, whereas setting it at a low value will make
the system “react” at any similarity between the test image and another image in the database.The last
parameter (“termCrit”) means the termination criterion, something that is specific to any iterative algorithm.
The stopping criterion consists of other two parameters. First of them is the number of iterations (usually set at
the number of images in our database) and the second one is the tolerance of the algorithm (epsilon error).
71
The OLEDB Interface
In order to make the C# Winform Application and the Microsoft Access Database file work together, the
OleDbConnection class will be used. Its instance represents a unique connection between the provider and the
consumer. As a consequence, the OleDbConnection object will know where the database is located on the
hard disk.
Because the OLEDB interface supports accessing information stored in any format (text file, spreadsheet or
database), an OLEDB Provider is needed. The provider will expose data from a certain type of source, in our
case a Microsoft Access Database file.
Anytime we want to make a change to the database, we won’t work directly on it for security reasons. Instead,
we will use a DataTable object, which could be seen as a local database. It contains rows and columns exactly
as a “mirror” of the original database. We can select, add or iterate over stored data.
Another entity needed is an OLEDB Data Adapter which serves as a bridge between the local table and the
data source. Any change made to the local table can be loaded to the original database and the other way back.
The OLEDB Data Adapter object has read/write properties in order to fill or update the database.
However, the Data Adapter doesn’t automatically generate queries such as INSERT, UPDATE or
DELETE. It works together with an OLEDB Command Builder object. We can see this association in the
code below .One can only correlate an OleDbDataAdapter object with an OleDbCommandBuilder object
at one time.
The sequence above presents a method that implemets the connection between the winform and the
database:
72
Training images are stored inside our database as OLE objects. Because of this, their format must be
converted from “Emgu.CV.Image” to an array of bytes and this is exactly what the next function does:
The same steps are taken backwards this time in order to read images from the database file:
The following OleDbCommand object represents an insert statement to be executed against our database:
The Dispose Method
When working with a video camera that acts as a resource for our applications we must manage it properly.
The next function (“FormClosingFct”) takes care of this crucial aspect. It will release the web camera just
before the winform closes. It is absolutely necessary to implement such a procedure, because the .NET
Framework garbage collector won’t allocate or release unmanaged memory zones. In the code below “grabb”
represents a “Capture” object that provides information about the current frame taken.Without disposing of the
unmanaged resources used by “grabb”, memory access errors will raise.
73
6 CONCLUSIONS
6.1 Face recognition in our everyday life
owadays, the facial recognition along with other biometric identification solutions is accepted by our
society for security reasons. For instance, the police and the military, as well as other organizations,
make use of it in order to maintain our community safe. Without a doubt, its main advantage remains
the fact that this procedure does not require the subject’s cooperation. As a consequence, face recognition
systems can be found in airports and other public places all around the world in order to prevent terrorism and
criminality. However, there are many other fields where this technology is employed, such as:
Searching for missing children
More than 1.2 million children disappear annually in the world. Thus, the android application called
“Helping Faceless” helps kidnapped children reunite with their families. The user would take photos
of a street child and would upload it on a server. Then the application will try to find the child in
question in a database of lost children.
Online examination
Today one can take an exam only by sitting in front of its personal computer. Moreover, the teacher
has no reason to doubt the identity of the student that is being examined.
Eliminate duplicates in a voting system
Sometimes there are cases when the same person uses two or even more accounts in order to vote or
register on a network multiple times. When using a facial recognition system, no one will ever be able
to commit this kind of fraud.
N
I am still learning.
- Michelangelo -
74
Prosopagnosia disorder
Prosopagnosia or “face blindness” represents the inability of a person to recognize faces. It is known
that no cure was found yet. Therefore, a portable face recognition system could definitely help the
patient to avoid unpleasant situations. The system would identify the face of a certain human being
and pass along descriptive information to our patient.
Video banking
Credit card companies are now creating a way to pay using nothing else, but facial recognition, so we
won’t need to worry anymore about identity theft or simply forgetting a password.
6.2 Indoor Facial Detection and Recognition application
The facial recognition system implemented by me could be successfully used inside an office or any indoor
space, where real-time performance is needed. When working in a controlled environment, this application
represents an effective solution in terms of computational resources. The program would also be appropriate
for a home surveillance system. You cannot “feel at home” as long as you are not safe. Hence, my application
would provide an accurate solution to limit the access of strangers or simply notify the householder about
somebody’s arrival.
6.3 Conclusions
All in all, the facial recognition techniques have surprisingly evolved in the last decade. However, it remains
an important field in computer vision research area. We can easily imagine a future where this technology will
be able to recognize any face in a crowded place. “Creepy” as it may seem, I consider we can all agree that our
safety is far more important than the desire to stay anonymous. And, not to forget, just like every technological
progress, it will save precious time and make life easier for all of us.
75
BIBLIOGRAPHY
[1] G. B. A.Kaehler, Learning OpenCV, 2008.
[2] M. J. Paul Viola, «Rapid Object Detection using a Boosted Cascade of Simple Features,» de Accepted
Conference on Computer Vision and Pattern Recognition, 2001.
[3] M. J. Paul Viola, «Robust Real-Time Face Detection,» International Journal of Computer Vision 57, vol.
2, pp. 137-154, 2004.
[4] P. H.Winston, «Lecture 17: Learning: Boosting,» MIT 6.034 Artificial Intelligence, 2010. [En línea].
Available: https://www.youtube.com/watch?v=UHBmv7qCey4. [Último acceso: 01 03 2016].
[5] H. Geiger, «Facial Recognition and Privacy,» Center for Democracy & Technology, 2012.
[6] G. S. R. Kimmel, «The Mathematics of Face Recognition,» SIAM News, 2003.
[7] [En línea]. Available: https://en.wikipedia.org/wiki/Histogram_equalization. [Último acceso: 02 05 2016].
[8] [En línea]. Available: http://www.visiondummy.com/2014/04/geometric-interpretation-covariance-
matrix/. [Último acceso: 12 06 2016].
[9] A. P. M. Turk, «Face recognition using eigenfaces,» de Proc. IEEE Conference on Computer Vision and
Pattern Recognition, 1991.
[10] F. Crow, «Summed-area tables for texture mapping,» de Proceedings of the 11th annual conference on
Computer graphics and interactive techniques, 1984.
[11] A. P. M. Turk, «Eigenfaces for recognition,» Journal of Cognitive Neuroscience , vol. 3, p. 71–86, 1991.
[12] L. I. Smith, «A tutorial on Principal Components Analysis,» 26 02 2002. [En línea]. Available:
http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf. [Último acceso: 10 04
2016].