“Automatic Facial Expression Recognition”
A dissertation submitted to The University of Manchester for the degree of Master of Science in the Faculty of Engineering and
Physical Sciences
2016
by
Hugo Gamboa Valero
School of Computer Science
2 | C o n t e n t
Content
Abstract .......................................................................................................................... 8
Declaration ..................................................................................................................... 9
Copyright ..................................................................................................................... 10
Acknowledgements ..................................................................................................... 11
Chapter 1. Introduction .............................................................................................. 12
1.1 Motivation ......................................................................................................... 12
1.2 Aim ................................................................................................................... 13
1.3 Objectives ........................................................................................................ 14
1.3.1 Learning objectives .................................................................................... 14
1.3.2 Deliverable objectives ................................................................................ 14
1.4 Structure of the dissertation ............................................................................. 14
Chapter 2. Literature Review ..................................................................................... 16
2.1 General approach for facial expression recognition ......................................... 16
2.2 Classification .................................................................................................... 18
2.2.1 Linear classifiers ........................................................................................ 19
2.2.2 Nonlinear classifiers .................................................................................. 19
2.3 Subtleties of creating a model .......................................................................... 32
2.3.1 Bias-Variance trade-off, overfitting and underfitting ................................... 32
2.3.2 Vanishing gradient and exploding gradient ................................................ 33
Chapter 3. Deep Convolutional Neural Network ...................................................... 35
3.1 Overview .......................................................................................................... 35
3.1.1 Convolutional layer .................................................................................... 35
3.1.2 Pooling/subsampling layer ......................................................................... 38
C o n t e n t | 3
3.1.3 Fully-connected layer ................................................................................ 40
3.2 Backpropagation in DCNNs ............................................................................. 40
3.3 Softmax classifier ............................................................................................. 41
3.4 Discrete convolution ......................................................................................... 42
3.5 Deep Convolutional Neural Network Architectures .......................................... 43
3.5.1 Architecture Considerations ....................................................................... 43
3.5.2 Improvement Strategies............................................................................. 44
3.6 Related work .................................................................................................... 48
3.6.1 AlexNet ...................................................................................................... 48
3.6.2 Visual Geometry Group ............................................................................. 48
3.6.3 GoogLeNet ................................................................................................ 49
3.6.4 Convolutional Neural Network for Facial Expression Recognition ............. 50
Chapter 4. Research methodology ........................................................................... 52
4.1 Project phases ................................................................................................. 52
4.1.1 Data collection ........................................................................................... 52
4.1.2 Pre-processing .......................................................................................... 54
4.1.3 System design ........................................................................................... 58
4.1.4 Development ............................................................................................. 62
4.1.5 Training and Testing .................................................................................. 63
4.1.6 Evaluation .................................................................................................. 66
Chapter 5. Implementation ........................................................................................ 67
5.1 Training modules .............................................................................................. 67
5.1.1 Input module .............................................................................................. 67
5.1.2 DCNN module ........................................................................................... 68
5.2 AFER modules ................................................................................................. 68
4 | C o n t e n t
5.2.1 Graphical user interface module ................................................................ 68
5.2.2 Input processing module............................................................................ 72
5.2.3 DCNN application module ......................................................................... 72
Chapter 6. Experiments ............................................................................................. 74
6.1 Training stage .................................................................................................. 74
6.1.1 Hyper parameters ...................................................................................... 74
6.1.2 Other alternatives to improve the DCNN ................................................... 81
6.2 Test stage ........................................................................................................ 86
6.2.1 Difficult scenarios ...................................................................................... 87
6.2.2 Innovative architectures ............................................................................. 90
6.2.3 Additional test ............................................................................................ 92
Chapter 7. Discussion ............................................................................................... 97
7.1 Creating a Deep Convolutional Neural Networks ............................................. 97
7.2 Limitations ........................................................................................................ 99
7.3 Building a Graphical User Interface ................................................................ 100
7.4 Future Work ................................................................................................... 101
Chapter 8. Conclusions ........................................................................................... 102
Reference ................................................................................................................... 104
Appendix .................................................................................................................... 112
Setting the environment on Ubuntu 14.04.3 LTS with GPU support ........................ 112
Installing OpenCV on Ubuntu 14.04.3 LTS .............................................................. 114
Running experiments using theano, lasagne, and nolear. ....................................... 115
Number of words: 17819
Appendix: 699
L i s t o f f i g u r e s | 5
List of figures
Figure 2-1: General facial expression recognition framework . ..................................... 16
Figure 2-2: Comparison between standard momentum and Nesterov Accelerated
Gradient . ................................................................................................................ 27
Figure 2-3: Bias and variance representation . .............................................................. 32
Figure 3-1: Local connectivity and receptive field. ......................................................... 36
Figure 3-2: Parameter sharing. ..................................................................................... 36
Figure 3-3: Activation maps . ......................................................................................... 37
Figure 3-4: Local translation invariance and spatial size reduction. .............................. 39
Figure 3-5: Downsampling (left) and MAX pooling (right) . ............................................ 39
Figure 3-6: Example of discrete convolution. ................................................................ 42
Figure 3-7: Inception module: naïve version (left) and with dimensionality reduction
(right). ..................................................................................................................... 50
Figure 3-8: Architectures from left to right: AlexNet, VGG, GoogLeNet, and CNN for
facial recognition . ................................................................................................... 51
Figure 4-1: Sample images from the KDEF dataset. ..................................................... 53
Figure 4-2: Sample images from the JAFFE dataset. ................................................... 54
Figure 4-3: Haar features . ............................................................................................ 55
Figure 4-4: Area calculation using Integral Image. ........................................................ 55
Figure 5-1: Face detection and numpy array creation. .................................................. 67
Figure 5-2: AFER Graphical User Interface. .................................................................. 69
Figure 5-3: Result from input image. ............................................................................. 70
Figure 5-4: Result from input video. .............................................................................. 70
Figure 5-5: AFER process. ............................................................................................ 71
Figure 6-1: Accuracy of applying diverse GDO algorithms after varying the learning rate.
................................................................................................................................ 75
Figure 6-2: Validation and test accuracy for different learning rate values. ................... 76
Figure 6-3: Highest accuracy result of each GDO algorithms. ...................................... 77
Figure 6-4: Accuracy after varying the size of the max-pooling filter. ............................ 78
Figure 6-5: Accuracy for different activation functions. .................................................. 79
6 | L i s t o f f i g u r e s
Figure 6-6: Validation and test accuracy for FC layers with varying number of neurons.
................................................................................................................................ 80
Figure 6-7: Validation and test accuracy for different number of FC layers. .................. 81
Figure 6-8: Accuracy of different pre-processing methods. ........................................... 82
Figure 6-9: Transformations applied to one image. ....................................................... 83
Figure 6-10: Accuracy of using different data augmentation techniques. ...................... 84
Figure 6-11: Accuracy of using normalisation and dropout. .......................................... 86
Figure 6-12: Images with 10% (top), 30% (middle) and 50% (bottom) occlusion. ......... 87
Figure 6-13: Original images (first row) and images with low (second row), medium
(third row), and high (fourth row) levels of gamma variation. .................................. 88
Figure 6-14: Accuracy for each DCNN model. .............................................................. 90
Figure 6-15: Architectures: AlexNet (Left), VGG (Centre), and GoogLeNet (Right). ..... 91
Figure 6-16: Test accuracy for different models using different datasets. ..................... 93
Figure 6-17: Confusion matrix for the new model trained with both KDEF and JAFFE
datasets (0: Fear, 1: Anger, 2: Disgust, 3: Happiness, 4: Neutral, 5: Sadness, and 6:
Surprise). ................................................................................................................ 96
Figure 7-1: False positive face detection. ...................................................................... 99
Figure 7-2: Incorrect face detection in noisy background. ........................................... 100
L i s t o f t a b l e s | 7
List of tables
Table 3-1: Data augmentation transformations. ............................................................ 45
Table 6-1: Number of images in each dataset. .............................................................. 84
Table 6-2: Occlusion test results. .................................................................................. 89
Table 6-3: Illumination test results. ................................................................................ 90
Table 6-4: AlexNet, GoogleNet, and VGG accuracy tests. ............................................ 92
Table 6-5: Validation and test accuracy for improved model and ensemble. ................ 94
Table 6-6: Precision, recall and f1-score for both models tested against the KDEF
dataset. ................................................................................................................... 95
Table 6-7: Precision, recall and f1-score for both models tested against the JAFFE
dataset. ................................................................................................................... 95
8 | A b s t r a c t
Abstract
Automatic Facial Expression Recognition
This dissertation presents the design, implementation, test, and evaluation of an
Automatic Facial Expression Recognition (AFER) system that applies a machine
learning algorithm based on Deep Convolutional Neural Networks (DCNNs) with the aim
of correctly classifying seven facial expressions (namely surprise, happiness, sadness,
fear, anger, disgust, and neutral). The DCNN module and the AFER system were built
in python, but only the training module exploited the Graphic Processing Unit (GPU)
computational power in order to accelerate this process.
Facial expressions convey helpful information that is difficult to detect for an ordinary
system. However, being capable of recognising them could lead to more responsive
and intelligent systems that might improve the user experience. By experimenting with
different models and architectures on different benchmark facial datasets such as the
Japanesse Female Facial Expression (JAFFE) and the Karolinska Directed Emotional
Faces (KDEF), the most suitable hyper parameters that yielded a good level of
performance were obtained. Additionally, a deep understanding of the strengths and
limitations of DCNNs was gained.
Results from these experiments show that special care must be taken during different
parts of the development process such as architecture selection or hyper parameter
tuning. By selecting the correct combination between these two elements, the accuracy
of the model and the convergence time improve.
D e c l a r a t i o n | 9
Declaration
No portion of the work referred to in this dissertation has been submitted in
support of an application for another degree or qualification of this or any other
university or other institute of learning.
10 | C o p y r i g h t
Copyright
i. The author of this thesis (including any appendices and/or schedules to this
thesis) owns certain copyright or related rights in it (the “Copyright”) and s/he has
given The University of Manchester certain rights to use such Copyright,
including for administrative purposes.
ii. Copies of this thesis, either in full or in extracts and whether in hard or electronic
copy, may be made only in accordance with the Copyright, Designs and Patents
Act 1988 (as amended) and regulations issued under it or, where appropriate, in
accordance with licensing agreements which the University has from time to
time. This page must form part of any such copies made.
iii. The ownership of certain Copyright, patents, designs, trade marks and other
intellectual property (the “Intellectual Property”) and any reproductions of
copyright works in the thesis, for example graphs and tables (“Reproductions”),
which may be described in this thesis, may not be owned by the author and may
be owned by third parties. Such Intellectual Property and Reproductions cannot
and must not be made available for use without the prior written permission of
the owner(s) of the relevant Intellectual Property and/or Reproductions.
iv. Further information on the conditions under which disclosure, publication and
commercialisation of this thesis, the Copyright and any Intellectual Property
and/or Reproductions described in it may take place is available in the University
IP Policy (see http://documents.manchester.ac.uk/DocuInfo.aspx?DocID=487), in
any relevant Thesis restriction declarations deposited in the University Library,
The University Library’s regulations (see
http://www.manchester.ac.uk/library/aboutus/regulations) and in The University’s
policy on presentation of Theses.
A c k n o w l e d g e m e n t s | 11
Acknowledgements
First and foremost, I would like to thank my supervisor, Dr Ke Chen. I am grateful for his expert
advice and guidance during this dissertation.
I would also like to thank my family; their constant support has been the cornerstone of every
project.
Finally, I would like to thank Bere for her financial support; and Coco, Dimitra and Lucas for
showing me a different perspective of life.
This study was done with financial support of the Mexican Council of Science and Technology
(CONACyT) under the scholarship number CVU 669499.
12 | I n t r o d u c t i o n
Chapter 1. Introduction
In this chapter, an explanation about the motivation behind the project is provided.
Afterwards, the major aim and objectives are listed. Finally section 1.4, the structure of
the dissertation is described.
1.1 Motivation
As computers become increasingly ubiquitous and their relationship with users
changes, they need new tools to obtain feedback from their interactions with those
users, and respond accordingly. Nowadays, there are different alternatives to extract
feedback from users such as heart rate, tone of voice, body movement, body language,
etc. However, some of those alternatives are obtrusive to users; or do not provide
enough or accurate feedback in order for a system to be reliable.
Feedback, in the form of user’s emotions, offers valuable information that could have a
positive impact in different areas such as e-marketing, robotics, smart products, etc. For
instance, developers could create a music application that gradually adjusted the type of
music that is being played according to the emotions detected so that people feeling a
negative mood such as sadness or anger would change those emotions and feel better.
One unobtrusive alternative that offers a reasonable amount of feedback is the face,
particularly facial expressions. Facial expressions have been considered a good source
of information to determine the true emotions of an individual [1]. Even before Charles
Darwin conducted “studies on how people recognize emotion in faces” [2], ancient
thinkers, such as Aristoteles, already knew the importance of facial expressions [3].
However, it was until Paul Ekman conducted cross cultural experiments around the
world that a set of universal emotions, namely surprise, happiness, sadness, fear,
anger, and disgust; were finally accepted [4, 5].
In the past, automating facial expression recognition accurately was unimaginable not
only because the computational power was limited and expensive, but also because the
I n t r o d u c t i o n | 13
techniques used performed poorly on image recognition from raw pixels [12]. However,
with the advances in faster Graphic Processing Units (GPUs) and parallelization, the
development of special-purpose machine learning models, and the availability of
sizeable amounts of data, the unimaginable has become possible.
To a certain extent, faster GPUs and parallelization are helpful, but getting to a specific
answer faster does not guarantee that it is the correct response. Finding the answer
which is closer to the correct one or, in other words, learning from data is the purpose of
machine learning models.
Recently, a branch of machine learning that has become popular is deep learning. Deep
learning models have achieved better accuracy than traditional approaches such as
SVM or kNN [13, 26]. For instance, models trained using deep learning in the ImageNet
Challenge took the first three places in the competition. Therefore, a reasonable
approach would be to use a deep learning model in order to train an automatic facial
expression recognition (AFER) system and improve the accuracy of this task.
This project is concerned with developing an AFER system using a state-of-the-art deep
learning model, i.e. Deep Convolutional Neural Networks (DCNNs) in the facial
expression domain. The prototype should analyse images captured in real time via
webcam and files from the local disk, and display the result through the Graphical User
Interface (GUI).
1.2 Aim
Design, implement, test, and analyse an automatic system that classifies correctly
seven facial expressions (namely surprise, happiness, sadness, fear, anger, disgust,
and neutral) by using a machine learning algorithm model based on DCNNs trained with
facial expression datasets.
14 | I n t r o d u c t i o n
1.3 Objectives
The main objectives of this project are:
1.3.1 Learning objectives
Investigate and understand the advantages of using deep convolutional neural
network models over other deep learning models.
Research the advantages and disadvantages of applying DCNNs in the domain
of facial expression images.
Study the hyper parameters, properties, and methods for training DCNN.
Investigate and understand the process of facial expression recognition from
images.
Analyse efficient and rapid hardware and software alternatives in order to train
DCNNs.
1.3.2 Deliverable objectives
Design and implement a GUI application using Python that is able to recognise
seven facial expressions online.
Train a DCNN using Python and Theano on Amazon Web Services
infrastructure, and utilise this model in the GUI application to predict facial
expressions.
Test and measure the application accuracy under different conditions such as
illumination and pose variation, and partial facial occlusion.
Experiment with different hyper parameters and the impact of these elements on
the accuracy of the models.
Analyse the test results from the previous experiments.
1.4 Structure of the dissertation
The structure of this dissertation is as follows:
I n t r o d u c t i o n | 15
Chapter 1 specifies the motivations to solve the problem domain along with the
aim and objectives to achieve in the project.
Chapter 2 provides a literature review of the topics related to the project including
the general approach that has been used to solve this problem; a description of
neural networks, and the complexity of creating a model due to the bias-variance
trade-off.
Chapter 3 describes characteristics and components of the DCNN, the way in
which these components interact with each other and the crucial values that must
be taken into consideration when designing this type of network.
Chapter 4 focuses on the research methodology and the stages in this project,
namely data collection, pre-processing, system design, development, training,
testing, and evaluation.
Chapter 5 details the implementation process and provides information about the
training and Automatic Facial Expression Recognition (AFER) modules.
Chapter 6 specifies the experiments that were performed during the training
stage in order to select the best hyper parameters; and the test stage to improve
the accuracy.
Chapter 7 describes the lessons that were learned during the development of the
system, including the challenges faced in the implementation, training and test
stages. Additionally, it describes new directions for future work.
Chapter 8 summarizes the project, and analyses the objectives that were fulfilled.
16 | L i t e r a t u r e R e v i e w
Chapter 2. Literature Review
This chapter describes the background material for this project. Section 2.1 explains the
general approach to facial expression recognition. Then, section 2.2 describes linear
and nonlinear classifiers, particularly, neural networks, their features, and important
elements that need to be considered when a model is built. Finally, section 2.3 details
the problems faced when a model is created.
2.1 General approach for facial expression recognition
The general framework to approach Facial Expression Recognition is shown in Figure
2-1. It consists of five steps: data acquisition, pre-processing, feature extraction,
classification, and post-processing [10].
Figure 2-1: General facial expression recognition framework [10].
The first step is responsible for obtaining static images or image sequences showing
different perspectives of the face. Both types of images can be two- or three-
dimensional, but image sequences provides more information because they are able to
represent the temporal characteristics of an expression [10].
L i t e r a t u r e R e v i e w | 17
In general, object recognition is a challenging task because of the following elements:
segmentation problems, deformation, illumination, affordance, and viewpoints.
Segmentation problems occur because real-life images are cluttered with other objects
which complicates the segmentation task; deformation happens when objects are
modified in non-affine ways such as the hand-written two which can have a large loop or
cusp; illumination is related to the source of lighting and its effect on the intensity of
each pixel; affordance refers to how the relationship between objects and actions
defines the classes of objects, e.g. chairs have different physical shapes, but all chairs
are used for sitting. Finally, viewpoints involve the different perspectives of an object
which some learning methods cannot handle [13].
Pre-processing is an alternative against the aforementioned problems. In this stage,
images are modified against head translation, rotation, and scale using geometric
normalization; moreover, there are different techniques used in order to solve
illumination problems, e.g. histogram equalization. Additionally, image segmentation is
achieved using different alternatives such as Gaussian mixture model of the skin or
deformable models of face parts [10].
In FER, face detection is a crucial task. There are different approaches for this, namely
appearance based approach, template based approach, feature based approach, and
the local-global graph approach [9].
● In the appearance based approach, the face is recognized as a whole, and the
classifier is trained with face and non-face patterns. Nevertheless, one limitation
of this approach is that it is only accurate in frontal images that are well-
illuminated and have a simple background.
● In the template based approach, a standard face pattern is generated and
correlated with the image to find the face. The limitation with this approach is the
difficulty to generalize to different shapes, sizes and poses.
● In the feature based approach, not-changing facial features are detected. Then,
candidate faces are grouped and verified. The limitation with this approach is that
18 | L i t e r a t u r e R e v i e w
it is difficult to find features in pictures that do not have simple backgrounds or
are well-illuminated.
● Finally, the local-global graph approach generates a graph with nodes created
according to colour similarity which are, to an extent, invariant to translation,
rotation and scale. The limitations with this approach are the low-image size, and
the poor-image quality.
In the feature extraction phase, the objective is to obtain discriminatory and stable facial
features according to one of two perspectives, geometric feature-based methods, and
appearance-based methods. On the one hand, geometric feature-based methods
involve defining geometric relationships of facial components in terms of their location
and shape. One disadvantage of the facial feature extraction using this approach is the
complexity of selecting precise and reliable detection techniques to find those geometric
relationships; On the other hand, appearance-based methods apply image filters or filter
banks to the face or a section of the face. For these methods, Principal Component
Analysis (PCA), Independent Component Analysis (ICA) and Gabor wavelets have
been used with varying degrees of success [6, 8, 9, 10].
The next step, classification, is performed using one or more parametric or non-
parametric classifiers such as Hidden Markov models, k-Nearest Neighbour, Support
Vector Machine (SVM), etc. The goal of this step is to determine the type of facial
expression that was input using action units (AU) or prototypic facial expressions [10].
Finally, the objective of post-processing is to improve FER accuracy by taking
advantage of the domain knowledge or combining several levels of a classification
hierarchy [10].
2.2 Classification
Learning a model that is able to classify facial expressions requires two important
components: a score function and a loss function. The first component maps the input
data to class scores; and the second component quantifies how close the prediction of
L i t e r a t u r e R e v i e w | 19
the model is to the true value. For any model, then, it is crucial to establish a
relationship between these functions by creating an optimization problem that minimizes
the loss function with respect to the parameters of the score function [15, 16].
2.2.1 Linear classifiers
Linear classifiers are score functions that produce a linear mapping of the input data.
They are based on linear combinations of fixed nonlinear basis functions 𝜙𝑗(𝒙) and
adopt the form [19]:
𝑦(𝒙, 𝒘) = 𝑓 (∑ 𝑤𝑗
𝑀
𝑗=1
𝜙𝑗(𝒙))
(2-1)
Where 𝑓(∙) is a nonlinear activation function, in the case of classification problems; and,
the identity, in the case of regression problems [19]; 𝜙𝑗(𝒙) is a nonlinear basis function
that depends on parameters; and 𝑊 is formed by weight coefficients and biases.
Although linear classifiers are simple to understand, their linearity presents an important
limitation when they are used to model real-life data given that, often, this data is
nonlinearly separable.
2.2.2 Nonlinear classifiers
In order to solve this problem, an alternative is to use models that learn nonlinear
features such as neural networks.
2.2.2.1 Neural Networks
Neural networks are a series of functional transformations that are formed by a set of
computational units, also known as neurons, that were inspired by biological neural
networks. These networks use basis functions similar to (2-1), in which each of these is
a nonlinear function from a set of input variables to a set of output variables controlled
by a vector 𝑊 of adjustable parameters [16, 19].
20 | L i t e r a t u r e R e v i e w
According to [19], neural network models can be “significantly more compact, and
hence faster to evaluate, than a support vector machine having the same generalization
performance.” However, the cost of compactness increases the complexity of the
function of the model parameters as it is no longer convex.
2.2.2.1.1 Feed-forward Network function
The basic neural network model is built using M linear combinations of the input
variables 𝑥1, … , 𝑥𝐷, in the form:
𝑎𝑗 = ∑ 𝑤𝑗𝑖(1)
𝑀
𝑖=1
𝑥𝑖 + 𝑤𝑗0(1)
(2-2)
Where 𝑗 = 1, … , 𝑀; 𝑎𝑗 refers to activations; the superscript (1) refers to elements in the
first layer of the network; 𝑤𝑗𝑖(1)
refers to the weights and 𝑤𝑗0(1)
refers to the bias [19].
Activations are transformed applying a differentiable, nonlinear activation function ℎ(∙)
that creates:
𝑧𝑗 = ℎ(𝑎𝑗) (2-3)
The output quantities of this basis function are known as hidden units which perform
feature extraction or feature construction in order to learn nonlinear combinations of the
original input data. This process is “useful for problems where the original input features
are not very individually informative” [21].
The activation functions correspond to the basis function, and are usually sigmoidal
functions such as the logistic sigmoid, the 𝑡𝑎𝑛ℎ function, or Rectified Linear Units
function (ReLU). One important reason for this is that the neural network function
becomes differentiable with respect to the network parameters. These activation units
are grouped into layers which are classified in input, hidden and output layers. For the
L i t e r a t u r e R e v i e w | 21
output units, the choice of the activation function is determined by “the data and the
assumed distribution of target variables” [19]. In the case of regression problems, an
identity function is used; in the case of binary classification problems, a logistic sigmoid
function is used; and in the case of multiclass problems, a softmax activation function is
used [19, 21].
Following (2-1), the outputs from the hidden layers can be combined to give the output
unit activations:
𝑎𝑘 = ∑ 𝑤𝑘𝑗(1)
𝑀
𝑗=1
𝑧𝑗 + 𝑤𝑘0(1)
(2-4)
Where 𝑘 = 1, … , 𝐾, and 𝐾 is the total number of outputs; 𝑎𝑘 is the output unit activations;
𝑤𝑘𝑗(1)
are weights and 𝑤𝑘0(1)
are bias parameters.
Finally, by combining the previous equations, the overall neural network function can be
represented in the following form:
𝑦𝑘(𝒙, 𝒘) = 𝑓 (∑ 𝑤𝑘𝑗(2)
𝑀
𝑗=0
ℎ (∑ 𝑤𝑗𝑖(1)
𝐷
𝑖=0
𝑥𝑖))
(2-5)
Where all the weights and bias parameters were grouped into a vector 𝒘; and 𝑓 and ℎ
are non-linear activation function.
The main idea of supervised neural networks is that using the training data, the
algorithm learns a model that represents the relationship between the input vector and
the target. Thus, the network becomes more accurate the closer the output values are
to the target values. This is closely related to the selection of the activation and the error
functions which are defined by the type of problem being solved. In the case of
22 | L i t e r a t u r e R e v i e w
regression, linear outputs and a sum-of-square error are used; in the case of binary
classification, logistic sigmoid outputs and a cross-entropy error function are used; and
in the case of multiclass classification, softmax outputs and a multi-class cross-entropy
error function are used. In turn, these objective functions are utilized to evaluate the
model by searching for the weights that minimize this function [19, 20, 21].
2.2.2.1.2 Activation functions
An activation function, or non-linearity, performs a fixed mathematical operation on an
input. There are different activation functions; however, the four most common
activation functions are: sigmoid, tanh, ReLU, and maxout.
2.2.2.1.2.1 Sigmoid
The sigmoid activation function fits the input value within the range [0, 1]. It becomes 0
for large negative numbers, and 1 for large positive numbers. The function has the
following mathematical form:
𝜎(𝑥) =1
1 + 𝑒−𝑥
(2-6)
Where 𝑥 is an input value; and 𝑒 refers to the Euler’s number.
The function presents two disadvantages: 1) the sigmoid saturate and kills the gradients
(See section 2.3.2); and 2) sigmoid outputs are not zero-centred [58].
2.2.2.1.2.2 Tanh
The tanh activation function fits the input value within the range [-1, 1]. The
mathematical formula of this function is:
𝑡𝑎𝑛ℎ(𝑥) =𝑒𝑥 − 𝑒−𝑥
𝑒𝑥 + 𝑒−𝑥
(2-7)
L i t e r a t u r e R e v i e w | 23
Similarly to the sigmoid function, the activation units saturate; however, its output is
zero-centred [58].
2.2.2.1.2.3 ReLU and Leaky ReLU
The Rectified Linear Unit is another activation function that does not squash the input
value within a range; instead, it tresholds the values at zero by using the following
mathematical form [17, 58]:
𝑓(𝑥) = max (0, 𝑥) (2-8)
The advantages of the ReLU function are: 1) it accelerates the convergence of the
Gradient Descent Optimization algorithm; and 2) it has a simple implementation. The
main disadvantage of this function is that the activation units could die when the
learning rate is too high.
An alternative that tries to fix ReLU is Leaky ReLU. It attempts to do that by having a
small negative slope for values where 𝑥 < 0. This function Is defined as:
𝑓(𝑥) = 1(𝑥 < 0)(𝛼𝑥) + 1(𝑥 ≥ 0)(𝑥) (2-9)
Where 𝛼 is a small constant (approximately 0.01).
The disadvantage with this function is that the results presented by some researchers
have not been consistent [58].
2.2.2.1.2.4 Maxout
Maxout activation function has “beneficial characteristics both for optimization and
model averaging with dropout” [71]. The maxout function is defined as:
24 | L i t e r a t u r e R e v i e w
𝑓(𝑥) = max𝑗∈[1,𝑘]
(𝑥𝑇𝑊𝑗 + 𝑏𝑗) (2-10)
Where 𝑊 ∈ ℝ𝑑×𝑚×𝑘 and 𝑏 ∈ ℝ𝑚×𝑘 are learned parameters.
In terms of advantages, maxout: 1) is well-suited for training with dropout; 2) is capable
of training deeper networks than it is possible using ReLU; 3) ensures that every
parameter in the model benefits of using dropout and emulates bagging training; and 4)
has the benefits of ReLU without its drawbacks [58, 71].
One important disadvantage is that maxout “doubles the number of parameters for
every single neuron, leading to a high total number of parameters” [58].
2.2.2.1.3 Network training and Gradient Descent
The purpose of network training is to find the vector 𝑤 that produces the smallest value
𝐸(𝒘). That smallest value occurs at a point in the weight space where the gradient of
the error function vanishes, or in other words:
∇𝐸(𝒘) = 0 (2-11)
The points where this happens are called stationary points, and are further subdivided
in minima, maxima, and saddle points. Moreover, as the objective function has a highly
nonlinear dependence on the weights and bias parameters, there will be many of these
points that are equivalent. In the case of minima points, the points with equivalent value
receive the name of local minima, while the point with the smallest value receives the
name of global minimum [19].
According to [19], for a successful neural network, “it may not be necessary to find the
global minimum, but it may be necessary to compare several local minima in order to
find a sufficiently good solution.”
L i t e r a t u r e R e v i e w | 25
In order to find the solution for ∇𝐸(𝒘) = 0, most techniques require selecting initial
weight values and then moving through the weight space iteratively in successive steps
of the form:
𝑤(𝜏+1) = 𝑤𝜏 + ∆𝑤𝜏 (2-12)
Where 𝜏 refers to the iteration step.
The updates can be performed using gradient information. This information is used to
improve the speed with which the weight vector that produces the sufficiently good
solution is located. There are three common approaches for this update: batch, mini-
batch, and online method. The batch method uses the whole data set at once; the mini-
batch method uses part of the data set; and the online method uses one data point at a
time.
When the batch method is used, the approach is known as gradient descent or steepest
descent. It is defined by the form:
𝑤(𝜏+1) = 𝑤𝜏 − 𝜂∇𝐸(𝑤𝜏) (2-13)
However for batch optimization, gradient descent is less robust and slower than
conjugate gradient and quasi-Newton methods. The advantage of these methods over
gradient descent is that the error function always decreases at each iteration unless the
weight vector is at a local or global minimum [19].
When the mini-batch method is used, the approach is known as mini-batch gradient
descent. As it was mentioned before, the difference is that mini-batch uses b examples
in each iteration instead of the whole set.
26 | L i t e r a t u r e R e v i e w
Finally, when the online method is used, the approach is known as sequential gradient
descent or stochastic gradient descent. This method makes an update of the weight
vector using only one data point at a time either by cycling through the data in sequence
or selecting points at random with replacement [15, 19]. It is defined by:
𝑤(𝜏+1) = 𝑤𝜏 − 𝜂∇𝐸𝑛(𝑤𝜏) (2-14)
There are two important advantages of stochastic gradient descent (SGD) over batch
and mini-batch gradient descent: 1) SGD handles redundancy in the data more
efficiently; and 2) it can escape local minima.
2.2.2.1.3.1 Gradient Descent Optimization Algorithms
Gradient Descent optimization algorithms were developed in order to avoid getting
trapped in suboptimal points. There are numerous algorithms to solve this problem;
however, the most commonly used are: Momentum, Nesterov Accelerated Gradient,
Adagrad, Adadelta, RMSprop, and Adam. The following subsections describe these
algorithms.
2.2.2.1.3.1.1 Momentum
This method helps gradient descent to move in the correct direction by reducing the
effect of ravines which are areas where the surface “curves more steeply in one
dimension than in another” [65], and are common around local minima.
In order to reduce this effect, the method adds a fraction of the previous update to the
current update which can be expressed as:
𝑣𝜏+1 = 𝜇𝑣𝜏 − 𝜂∇𝐸(𝑤𝜏)
(2-15)
𝑤𝜏+1 = 𝑤𝜏 + 𝑣𝜏+1 (2-16)
L i t e r a t u r e R e v i e w | 27
Where the momentum 𝜇 ∈ [0,1] represents the weight of the previous update; and 𝑣𝜏+1
and 𝑣𝜏 are the updated values at the current and previous iteration, respectively.
In other words, the updated value increases when the movement of the update is in the
same direction, and decreases when the movement of the updates change directions.
The result of this is faster convergence, and reduced oscillation [65].
2.2.2.1.3.1.2 Nesterov Accelerated Gradient
Nesterov Accelerated Gradient (NAG) is an improvement over the momentum method
as it takes into account the future value of the gradient, and makes a correction before
jumping towards the value [62, 65].
Figure 2-2: Comparison between standard momentum and Nesterov Accelerated
Gradient [13].
In practice, it “consistently works slightly better than standard momentum” [62] as a
result of the “lookahead” gradient step. This is expressed as:
𝑣𝜏+1 = 𝜇𝑣𝜏 − 𝜂∇𝐸(𝑤𝜏 − 𝜇𝑣𝜏)
(2-17)
𝑤𝜏+1 = 𝑤𝜏 + 𝑣𝜏+1 (2-18)
Where ∇𝐸(𝑤𝜏 − 𝜇𝑣𝜏) is the “lookahead” gradient step, and the other elements are the
same as before.
28 | L i t e r a t u r e R e v i e w
2.2.2.1.3.1.3 Adagrad
It is an adaptive learning rate method that adjusts the dynamic learning rate of each
parameter by updating infrequent parameters using a larger learning rate, and updating
frequent parameters using a smaller learning rate. A nice property of this approach is
that the progress of infrequent and frequent parameters even out over time.
𝑤𝜏+1 = 𝑤𝜏 − 𝜂
√∑ ∇𝐸(𝑤𝑡)2𝜏𝑡=1 + 𝜖
⋅ ∇𝐸(𝑤𝜏) (2-19)
Where 𝜖 is a smoothing term (set somewhere between 1e-4 to 1e-8) in order to avoid
division by zero.
An advantage of Adagrad is that the learning rate is tuned automatically. Due to this, the
learning rate value used in the implementation is usually 0.01 [65].
Adagrad has a couple of drawbacks. The first disadvantage is that this method is
sensitive to the choice of the learning rate given that this value becomes low when the
initial gradient is large. The second disadvantage is that the L2 norm of all the previous
parameters, calculated in the denominator, makes the monotonic learning rate too
aggressive. For deep learning, this means that the learning rate vanishes as the
accumulated sum keeps growing [62, 65, 66].
2.2.2.1.3.1.4 Adadelta
This method is an extension of Adagrad, but it tries to reduce the aggressive monotonic
learning rate by restricting the number of accumulated past gradients to a fixed number
of past gradients defined by a window of size 𝜔.
In order to define the sum of gradients over a window, Adadelta does not inefficiently
store 𝜔 previous squared gradients. Instead, the method uses a decaying average of all
past squared gradients. This decaying running average depends on previous average
and the current gradient.
L i t e r a t u r e R e v i e w | 29
Adadelta is defined as follows:
𝐷[ϕ]𝜏 = 𝜌𝐷[ϕ]𝜏−1 + (1 − 𝜌)ϕ𝜏
(2-20)
Δ𝑤𝜏 =√𝐷[Δ𝑤]𝜏−1 + 𝜖
√𝐷[∇𝐸(𝑤𝜏)2]𝜏 + 𝜖⋅ ∇𝐸(𝑤𝜏)
(2-21)
𝑤𝜏+1 = 𝑤𝜏 − Δ𝑤𝜏 (2-22)
Where 𝐷[ϕ]𝜏 is the running average at iteration 𝜏; 𝜌 is a constant that controls the
decay of the previous parameter updates; and Δ𝑤 refers to the previous weight update.
The advantage of Adadelta is that it solves the drawbacks of Adagrad. Additionally, it
makes unnecessary to tune the hyper parameters despite the variation of input data
types, number of hidden units, and nonlinearities. This makes Adadelta a robust
learning rate method [66].
2.2.2.1.3.1.5 RMSprop
This unpublished adaptive learning rate method, proposed by Geoffrey Hinton in [67], is
a different approach at improving Adagrad. RMSprop utilizes a moving average of
squared gradients in order to modulate the learning rate of each weight [62].
It is defined as:
𝐷[ϕ]𝜏 = 𝐷𝑅 ⋅ 𝐷[ϕ]𝜏−1 + (1 − 𝐷𝑅)ϕ𝜏
(2-23)
𝑤𝜏+1 = 𝑤𝜏 − 𝜂
√𝐷[∇𝐸(𝑤𝜏)2]𝜏 + 𝜖⋅ ∇𝐸(𝑤𝜏) (2-24)
30 | L i t e r a t u r e R e v i e w
Where 𝐷[∇𝐸(𝑤𝜏)2]𝜏 is the moving average of squared gradients at iteration 𝜏; 𝐷𝑅
(decay rate) is a hyper parameter whose usual values are: 0.9, 0.99, or 0.999 [62, 65,
67].
In the same way as Adadelta, RMSprop solves the vanishing learning rate caused by
the monotonically smaller updates [62].
2.2.2.1.3.1.6 Adam
Adaptive Moment Estimation (Adam) is an alternative method that calculates the
adaptive learning rates for each parameter. Adam combines “the ability of Adagrad to
deal with sparse gradients, and the ability of RMSprop to deal with non-stationary
objectives” [68] by keeping track of both an exponentially decaying average of past
square gradients 𝑣𝜏, and an exponentially decaying average of past gradients 𝑚𝜏. The
first element corresponds to the second moment of the gradients (variance). The
second element represents the first moment of the gradients (mean). 𝑣𝜏 and 𝑚𝜏 values
correspond to 𝐷[ϕ2]𝜏 and 𝐷[ϕ]𝜏, respectively [65].
As it is mentioned in [68], 𝑣𝜏 and 𝑚𝜏 are initialised as vectors of zeros. This causes the
moment estimates to be biased towards zero, particularly when the decay rates are
small and during the initial time steps. They counteract these biases in the following
way:
��𝜏 =𝑚𝜏
1 − 𝛽1𝜏
(2-25)
𝑣𝜏 =𝑣𝜏
1 − 𝛽2𝜏 (2-26)
Where 𝛽1 and 𝛽2 are exponential decay rates for the moment estimates; and ��𝜏 and 𝑣𝜏
are the bias-corrected first moment estimate and second raw moment estimate,
respectively.
L i t e r a t u r e R e v i e w | 31
These exponentially decaying averages are then used to update the parameters as
follows:
𝑤𝜏+1 = 𝑤𝜏 − 𝜂
√𝑣𝜏 + 𝜖⋅ ��𝜏 (2-27)
According to [68], Adam 1) requires little memory; 2) outperforms other adaptive
learning methods for a variety of models and datasets; and 3) scales to large-scale high
dimensional machine learning problems.
2.2.2.1.4 Backpropagation
Backpropagation provides a computationally efficient method for evaluating the
derivatives of the error function with respect to the weights [19]. The purpose of this
algorithm is to calculate the error contribution of a particular node in the output.
The following block describes the algorithm:
Algorithm 1 Backpropagation algorithm [19, 26]
Apply a feed-forward to an input vector 𝑥𝑛 using 𝑎𝑗 = ∑ 𝑤𝑗𝑖𝑥𝑖𝑗=0 and 𝑧𝑗 = ℎ(𝑎𝑗) to
find the activations of all the hidden and output units.
Evaluate 𝛿𝑘 for all the output units using 𝛿𝑘 = 𝑦𝑘 − 𝑡𝑘.
Backpropagate the 𝛿𝑘 using 𝛿𝑘 = ℎ′(𝑎𝑗) ∑ 𝑤𝑘𝑗𝛿𝑘𝑘 to obtain 𝛿𝑘 for each hidden unit
in the network.
Use 𝜕𝐸𝑛
𝜕𝑤𝑖𝑗= 𝛿𝑗𝑧𝑖 to evaluate the required derivatives.
32 | L i t e r a t u r e R e v i e w
2.3 Subtleties of creating a model
Building a machine learning model is not a simple task. This subsection describes
general problems in every model and particular drawbacks when using gradient descent
that need to be considered during the training and testing process.
2.3.1 Bias-Variance trade-off, overfitting and underfitting
It is important to notice that the objective of modelling the data is not to find a model that
fits the training data perfectly, but to find one that generalises to unseen data. This goal
is closely related to the concepts of bias and variance.
In machine learning, the first element, bias, refers to the difference between the
expected value and the true value; the second element, variance, refers to the variability
of the target around its true mean [75]. The figure below illustrates these concepts:
Figure 2-3: Bias and variance representation [76].
When the model shows low bias and high variance, the model is overfitting the data.
This means that the predicted values follow the true values closely (the blue dots are
close to the target in red), but the variability of these predictions across different trials is
big (the blue dots are spread). Conversely, when it displays high bias and low variance,
the model is underfitting the data. This means that the error between the predicted
L i t e r a t u r e R e v i e w | 33
values and the true values is large (the blue dots are far from the target in red), but the
variability of the predictions across different trials is small (the blue dots are clustered).
There are two alternatives to combat overfitting: reducing the number of dimensions of
the parameter space or reducing the effective size of each dimension.
The methods applied to reduce the number of dimensions of the parameters are:
pruning and weight sharing. Pruning involves removing information from the model
once it is over-fitted. This process is different according to the model, but the idea
remains the same. For instance, when pruning is applied to a decision tree model, it
means that the subtrees are replaced with a leaf; when pruning a neural network, the
unimportant weights are removed. Weight sharing is the method in which a single
weight is shared among many connections in the network. This means that the number
of adjustable weights in the network is less than the number of connections [13, 75, 77].
The methods applied to reduce the size of each dimension are: regularisation and early
stopping. Regularisation is a method that reduces overfitting by adding a complexity
penalty to the loss function. Early stopping refers to a method in which the training
process is stopped as soon as the model’s performance ceases to improve or
decreases [74, 75].
2.3.2 Vanishing gradient and exploding gradient
Two important problems arise when neural networks and deep neural networks are
trained using gradient descent: vanishing gradient and exploding gradient. The first
problem occurs when the gradient becomes weaker in the earlier hidden layers as the
gradient signal passes back through multiple layers causing the learning algorithm to
get stuck in poor local minima. In other words, neurons in later layers learn faster than
neurons in earlier layers. Conversely, the second problem occurs when neurons in later
layers learn slower than neurons in earlier layers [21].
34 | L i t e r a t u r e R e v i e w
A way to prevent these problems is to initialize the parameters using unsupervised
learning, also called generative pre-training. The advantage of applying unsupervised
learning is that the model is forced to represent a “high-dimensional response, namely
the input feature vector, rather than just predicting a scalar response. This acts like a
data-induced regularizer, and helps backpropagation find local minima with good
generalization properties” [21].
Another way to prevent that is to change the activation function. The sigmoid function
was the preferred function in the past; however, due to the disadvantages discussed in
section 2.2.2.1.2.1, this tendency has changed. As it was mentioned, the alternative
activation functions to the sigmoid function are: tanh, ReLU, and maxout. Tanh solves
the zero-centred inconvenience, but it still suffers from vanishing and exploding
gradient; ReLU presents one important drawback: the units in the network die during
training when the learning rate is too high. Finally, maxout seems to be the best
alternative since it generalises the ReLU and leaky ReLU; and has the benefits of ReLU
without its drawbacks [58].
D e e p C o n v o l u t i o n a l N e u r a l N e t w o r k | 35
Chapter 3. Deep Convolutional Neural Network
This chapter describes the important elements of Deep Convolutional Neural Network.
In section 3.1, an overview is presented; afterwards, backpropagation applied to DCNN
is detailed. Section 3.3 and 3.4 provide information about the softmax classifier and the
discrete convolution, respectively. Then, DCNN architectures and strategies to improve
their results are given. Finally, section 3.6 contains information about related work.
3.1 Overview
CNNs, a form of multilayer perceptron (MLP), are neural networks that are specially
designed to exploit the structure of the data, namely 1d signals such as speech or text,
or 2d signals such as images [16, 19, 21]. CNNs are similar to DCNNs, the only
difference between them is that DCNNs have a larger amount of hidden layers than
CNNs. Both architectures use three types of layers: Convolutional layer,
pooling/subsampling layer, and fully-connected layer.
3.1.1 Convolutional layer
The convolutional layer is an important element of the network that transforms one
volume of activation into another by convolving a small area with the input volume. This
operation is called discrete convolution and is defined in section 3.4. Convolutional
layers provide the network with two important features: Local connectivity and
parameter sharing.
Local connectivity is achieved when neurons are connected to a local region of the input
volume. This local region is a hyper parameter called receptive field with dimensions r x
r. The connections of the neuron to the receptive field are along the receptive field
dimensions, but full along the entire depth of the input volume. Local connectivity
drastically reduces the number of connections on a neural network. This not only
diminishes the processing time, but also improves the performance of the model by
preventing overfitting [16, 24, 25].
36 | D e e p C o n v o l u t i o n a l N e u r a l N e t w o r k
Figure 3-1: Local connectivity and receptive field.
Parameter sharing occurs due to neurons within a group, called feature map, or
activation map, that share the same parameters and cover different parts of the image
by using different receptive fields. This provides the network with translation invariance
which means that useful features that are learnt in some portion of the image can be
used everywhere else without independently learning those features [16, 21, 24].
Figure 3-2: Parameter sharing.
In each convolutional layer, there are filters that represent sets of weights. These filters,
also known as kernels, are convolved with an input volume to generate an output
volume. Each convolution generates an activation map which detects a specific type of
feature. The number of these activation maps is proportional to the number of filters in
the corresponding layer [16, 24, 25].
D e e p C o n v o l u t i o n a l N e u r a l N e t w o r k | 37
Figure 3-3: Activation maps [16].
Activation maps can be thought as the output of neurons after convolving the same
kernel. Their dimensions are determined by three hyper parameters: depth, stride and
zero-padding.
Depth controls the number of neurons that connect to the same region of the input
volume, but activate due to different features in the input such as edges or blobs of
colour. These neurons form a depth column [16, 25].
Stride determines the distance between depth columns. A smaller stride means less
distance between columns, and more overlapping receptive fields between columns
which translates into a larger output volume. Conversely, a higher stride results in a
smaller output volume.
Zero-padding is a hyper parameter that is used to pad the border of the input volume
with zeros in order to control the spatial size of the output volume.
Once these hyper parameters have been defined, it is possible to calculate how many
neurons can be arranged in the output volume using:
𝑊 − 𝐹 + 2P
𝑆+ 1
(3-1)
38 | D e e p C o n v o l u t i o n a l N e u r a l N e t w o r k
Where 𝑊 is the size of the input volume; F is the receptive field size; P is the amount of
zero padding used on the border, and S is the stride. It is important to notice that this
result cannot be a decimal number [16, 25].
Similarly, the output volume can be calculated using:
𝑉 = 𝑊𝑖𝑑𝑡ℎ ∙ 𝐻𝑒𝑖𝑔ℎ𝑡 ∙ 𝐷𝑒𝑝𝑡ℎ (3-2)
Where:
𝑊𝑖𝑑𝑡ℎ =𝑊1 − 𝐹 + 2P
𝑆+ 1
(3-3)
𝐻𝑒𝑖𝑔ℎ𝑡 =𝐻1 − 𝐹 + 2P
𝑆+ 1
(3-4)
𝐷𝑒𝑝𝑡ℎ = 𝐾 (3-5)
3.1.2 Pooling/subsampling layer
The pooling layer reduces the spatial size of the input decreasing the number of
parameters and the computation in the network. As a consequence, the network gains
local translation invariance and a mechanism to control overfitting [16]. The spatial size
reduction is achieved by taking a set of hidden units within a neighbourhood and
aggregating their activations using a pooling function (MAX pooling, average pooling,
L2-norm pooling or fractional max pooling) [16, 25]. It is worth noting that the spatial
size remains unchanged in the depth dimension.
D e e p C o n v o l u t i o n a l N e u r a l N e t w o r k | 39
Figure 3-4: Local translation invariance and spatial size reduction.
There are two configurations commonly used: the overlapping and non-overlapping
pooling. The former applies a 3x3 MAX-pooling filter, while the latter applies a 2x2 MAX-
pooling filter, both with a stride of 2 [16].
Figure 3-5: Downsampling (left) and MAX pooling (right) [25].
This convolutional layer operates independently on every depth slice of the input and
resizes this input spatially using the MAX-pooling operation by selecting the maximum
value within a local neighbourhood. In the figure above, the neighbourhood is formed by
4 elements.
The MAX-pooling operation can be expressed using the following equation:
𝑦𝑖𝑗𝑘 = 𝑚𝑎𝑥𝑝,𝑞 𝑥𝑖,𝑗+𝑝,𝑘+𝑞 (3-6)
Where 𝑥𝑖𝑗𝑘 is the value of the 𝑖𝑡ℎ feature map at position 𝑗, 𝑘; 𝑝 is the vertical index in
the local neighbourhood; 𝑞 is the horizontal index in the local neighbourhood; and 𝑦 is
the result of this operation in the pooling layer [25].
40 | D e e p C o n v o l u t i o n a l N e u r a l N e t w o r k
An alternative to the pooling layer is a convolutional layer with larger strides which
would also reduce the spatial size. According to [16], given the aggressive reduction in
the size of the representation when the pooling layer is used, “the trend in the literature
is towards discarding the pooling layer in modern ConvNets.”
3.1.3 Fully-connected layer
After the previous layers, there can be any number of fully-connected layers. These
layers are regular neural networks with neurons that have full connections to all
activations in the previous layer.
3.2 Backpropagation in DCNNs
In a DCNN model, the convolutional and pooling operations are applied to the gradient
as part of the backpropagation algorithm.
For the convolutional operation, the gradient is calculated as follows:
∇𝑤𝑘
𝑙 J(W, b; x, y) = ∑(𝑎𝑖(𝑙)
) ∗ 𝑓𝑙𝑖𝑝(𝛿𝑘(𝑙+2)
)
𝑚
𝑖=1
(3-7)
∇𝑏𝑘
𝑙 J(W, b; x, y) = ∑(𝛿𝑘(𝑙+1)
)𝑎,𝑏
𝑚
𝑖=1
(3-8)
Where J is the cost function; (W, b) are the parameters; and (x, y) are the training data
and label pairs [15, 26].
Similarly, for the pooling layer, the gradient is calculated as follows:
𝛿𝑘𝑙 = 𝑢𝑝𝑠𝑎𝑚𝑝𝑙𝑒 ((𝑊𝑘
𝑙)𝑇
𝛿𝑘(𝑙+1)
) ∙ ℎ′(𝑎𝑗) (3-9)
D e e p C o n v o l u t i o n a l N e u r a l N e t w o r k | 41
Where 𝑘 is the index of the number of the filter and ℎ′(𝑎𝑗) is the derivative of the
activation function. Additionally, the upsample operation propagates the error through
the pooling layer by calculating the error with respect to each unit incoming to this layer
[15, 26].
3.3 Softmax classifier
There are complex scenarios where binary classification presents limitations as real-life
data is usually classified into multiple classes. In those cases, multi-class classification
is required. This can be attained by using a classifier such as softmax. Unlike other
classifiers that treat the output as uncalibrated scores for each class, the softmax
classifier provides K mutually exclusive normalized class probabilities.
The outputs of this classifier are interpreted as 𝑦𝑘(𝑥, 𝑤) = 𝑝(𝑡𝑘 = 1|𝑥), with the following
error function:
𝐸(𝑤) = − ∑ ∑ 𝑡𝑘𝑛 ln 𝑦𝑘(𝑥𝑛, 𝑤)
𝐾
𝑘=1
𝑁
𝑛=1
(3-10)
The softmax classifier transforms a K-dimensional vector 𝒛 of arbitrary real-valued
scores into a vector of values between zero and one that sum to one. This is achieved
by using the softmax function:
𝑦𝑘(𝑥, 𝑤) =𝑒𝑥𝑝(𝑎𝑘(𝑥, 𝑤))
∑ 𝑒𝑥𝑝 (𝑎𝑗(𝑥, 𝑤))𝑗
(3-11)
Which satisfies 0 ≤ 𝑦𝑘 ≤ 1 and ∑ 𝑦𝑘𝑘 ≤ 1.
42 | D e e p C o n v o l u t i o n a l N e u r a l N e t w o r k
3.4 Discrete convolution
An important element in the CNN is the discrete convolution operation which is used to
compute pre-activation values. These pre-activation quantities are then passed through
a non-linearity to get the values of the hidden units in the feature maps.
The discrete convolution of an input image 𝑥 with a (𝑟 × 𝑟)-kernel 𝑘 is expressed as
follows:
(𝑥 ∗ 𝑘)𝑖𝑗 = ∑ 𝑥𝑖+𝑝, 𝑗+𝑞 𝑘𝑟−𝑝, 𝑟−𝑞
𝑝𝑞
(3-12)
Where (𝑥 ∗ 𝑘)𝑖𝑗 is the convolution at position (𝑖, 𝑗); 𝑥𝑖+𝑝,𝑗+𝑞 is the value of the input
image at position (𝑖 + 𝑝, 𝑗 + 𝑞); and 𝑘𝑟−𝑝,𝑗−𝑞 is the value of the kernel at position
(𝑟 − 𝑝, 𝑟 − 𝑞). The image below shows an example of the discrete convolution operation:
Figure 3-6: Example of discrete convolution.
The values of the hidden units are calculated using:
𝑦𝑗 = 𝑔𝑗 tanh (∑ 𝑘𝑖𝑗 ∗ 𝑥𝑖
𝑖
)
(3-13)
Where 𝑘𝑖𝑗 is the convolution kernel, 𝑥𝑖 is the 𝑖𝑡ℎ channel of input, and 𝑔(⋅) is the ReLU
activation function described in section 2.2.2.1.2.3.
D e e p C o n v o l u t i o n a l N e u r a l N e t w o r k | 43
It is worth noting that, with a non-linearity, the discrete convolution operation provides
feature detection to the hidden layers by emphasizing a correspondence between a
learned filter and a particular region [25]. This combination helps to detect features such
as edges or wrinkles.
3.5 Deep Convolutional Neural Network Architectures
DCNN architectures are created by combining the layers described in section 3.1. The
most common form of DCNN stacks a number of convolutional and pooling layers
together until the input image has been spatially reduced. This intermediate output is
followed by fully-connected layers, in which the last one outputs a value such as the
class score [16, 25].
3.5.1 Architecture Considerations
DCNNs are powerful neural networks, but it is important to consider some elements
during their design:
3.5.1.1 Input layer
The size of the input volume should be divisible by two several times. These size values
range from 32 to 512 [16].
3.5.1.2 Convolutional layer
Small filters are commonly used with a stride of one and zero-padding. This zero-
padding is selected using the formula 𝑃 =𝐹−1
2 so that the size of the input volume is
preserved. It is worth noting that small filters such as 3x3 and 5x5 can be used in any
layer, but large filters such as 7x7 should only be used in the first convolutional layer
[16].
44 | D e e p C o n v o l u t i o n a l N e u r a l N e t w o r k
It is preferable to stack several small filters to use one equivalent large filter because
the small filters express more powerful features of the input by preserving the non-
linearities; and require fewer parameters [16].
3.5.1.3 Pooling layer
It is common to use 2x2 or 3x3 filters. The reason to prefer small filters is that large
filters are too lossy and aggressive which causes poor performance [16].
3.5.1.4 Strides and Zero-Padding
According to [16], a stride of one is preferred because it preserves the spatial size of the
input volume and works better in practice. Similarly, zero-padding maintains the spatial
size of the input, and prevent the information at the border to disappear too quickly.
3.5.2 Improvement Strategies
There are some strategies to improve the performance of the network:
Data augmentation: This method creates additional versions of the original
image that help the network to be resistant against small changes that do not
affect the structure of the image [21].
The transformations usually applied to increase the data set are: rotation,
translation, zooming, flipping, and random cropping. These transformations are
not applied to each image; instead, they are randomly applied to some of them in
order to prevent overfitting. The table below shows the result of applying these
transformations to an image.
D e e p C o n v o l u t i o n a l N e u r a l N e t w o r k | 45
Original Image Transformation Image Transformation Image
Rotation
Horizontal flipping
Translation
Vertical flipping
Zooming
Random cropping
Table 3-1: Data augmentation transformations.
46 | D e e p C o n v o l u t i o n a l N e u r a l N e t w o r k
Dropout method: This method prevents neurons from being only helpful in the
context of several other specific neurons by randomly dropping neurons, along
with their connections, from the neural network during training. In other words, it
thwarts two or more neurons from detecting the same feature repeatedly and
wasting not only the network’s capacity, but also computational resources [23,
61].
Therefore, the main idea of dropout is to remove individual activations at random
during training which makes the model “more robust to the loss of individual
pieces of evidence and thus less likely to rely on particular idiosyncrasies of the
training data” [61].
The dropout neural network model is defined as follows:
𝑟𝑗(𝑙)
~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝) (3-14)
��(𝑙) = 𝒓(𝑙) ∗ 𝒚(𝑙) (3-15)
𝑧𝑖(𝑙+1)
= 𝒘𝑖(𝑙+1)
��𝑙 + 𝑏𝑖(𝑙+1)
(3-16)
𝑦𝑖(𝑙+1)
= 𝑓(𝑧𝑖(𝑙+1)
) (3-17)
Where 𝑟 is a vector of independent Bernoulli random variables each of which has
probability 𝑝 of being 1; �� is the thinned output that is calculated by multiplying 𝑟
and the output from the layer 𝒚(𝑙); and the operator ∗ represents an element-wise
product [61].
At test time, it is necessary to adjust the weights according to:
𝑊𝑡𝑒𝑠𝑡(𝑙)
= 𝑝𝑊(𝑙) (3-18)
D e e p C o n v o l u t i o n a l N e u r a l N e t w o r k | 47
It is mentioned in [23] and [61] that although dropout alone improves the
performance of the network, dropout with max-norm regularization, large
decaying learning rates and high momentum improves the performance of the
SGD further.
Early stopping: It is a procedure that controls overfitting by tracking the
performance of the model on a validation set. Once this performance stops
improving or decreases, the gradient descent is terminated. This method is
widely used because it is simple to understand and implement, and has been
proved to be better than other regularisation methods in numerous cases [53,
74].
Ensemble method: It is a method that is based on the law of large numbers
which establishes that the average of the predicted values tends to get closer to
the expected value as the number of trials increase [63]. In other words, by
creating different models and averaging their prediction, the likelihood of
predicting the correct expression increases. However, the models must be
different from each other so that bias for specific values decreases.
In order to create different models, there are a couple of alternatives: 1) Same
model, different initializations; and 2) Top models obtained during cross-
validation. The first alternative involves generating the best model using cross-
validation, then training multiple models based on this model, but utilizing
different random initializations. The drawback of this approach is that variety is
only introduced through random initialization. The second alternative requires
determining the best hyper parameters, then selecting the top models to form the
ensemble. The disadvantage of this approach is that it can choose suboptimal
models and affect the accuracy [61, 62].
The disadvantage with this method is that it takes longer to evaluate as it
involves more than one model. Recent work from Geoff Hinton on “dark
48 | D e e p C o n v o l u t i o n a l N e u r a l N e t w o r k
knowledge”, suggest that it is possible to extract the essence of the ensemble
model and to create a new single model by “incorporating the ensemble log
likelihoods into a modified objective" [62, 64].
3.6 Related work
This section describes briefly some CNN architectures that have been used to classify
different types of images. The first three architectures were developed to participate in
the ImageNet Challenge, while the last architecture was applied in facial expression
recognition.
3.6.1 AlexNet
AlexNet was proposed by Alex Krizhevsky to classify the images into 1000 different
classes as an entry for the ImageNet 2010 challenge. It contains eight learned layers of
which five are convolutional layers, two are fully-connected layers and one is fully-
connected 1000-way softmax. Moreover, this model implements a local normalization
scheme which aids generalisation [35].
There are three important elements used in this model in order to reduce overfitting: 1)
neurons with the ReLU nonlinearity (since DCNNs with ReLU train “several times faster
than their equivalents with tanh units” [35]); 2) dropout regularisation method; and 3)
data augmentation (image translation, horizontal reflection, and PCA on the set of RGB
pixel values) [35].
This architecture is displayed in Figure 3-8. It can be noticed that some of the
convolutional layers are followed by max-pooling and normalisation layers. Moreover,
the first two fully-connected layers apply the dropout regularisation method.
3.6.2 Visual Geometry Group
The Visual Geometry Group (VGG) architecture was developed by Karen Simonyan
and Andrew Zisserman as an entry for the ImageNet 2014 challenge. These DCNNs
D e e p C o n v o l u t i o n a l N e u r a l N e t w o r k | 49
push the number of weight layers to 19, demonstrating that increasing the depth of the
network is beneficial for the classification accuracy. The novel approach of this
architecture was using very small 3x3 receptive fields throughout the whole network.
There are three advantages of very small over large receptive fields: 1) given that more
non-linear rectification layers are used, the decision function is more discriminative; 2)
the number of parameters is greatly reduced; and 3) as a consequence of the previous
point, the training time is reduced [79].
This VGG architecture is shown in Figure 3-8. Similarly as AlexNet, the VGG model
applied two fully-connected layers with the dropout regularisation method (𝑝 = 0.5) [79].
3.6.3 GoogLeNet
The DCNN architecture, named inception, was developed by Christian Szegedy and his
team at Google. This architecture was used to participate in the ImageNet 2014.
The novelty idea was to stack inception modules, shown below, upon each other with
occasional max-pooling layers which improves the utilization of computing resources as
a result of the “ubiquitous use of dimensionality reduction prior to expensive
convolutions with larger patch sizes” [80]. In other words, by halving the resolution of
the grid, it is possible to increase the number of layers without creating computational
difficulties [80].
The inception module can be observed in the next figure:
50 | D e e p C o n v o l u t i o n a l N e u r a l N e t w o r k
Figure 3-7: Inception module: naïve version (left) and with dimensionality
reduction (right).
The model follows the "practical intuition that visual information should be processed at
various scales and then aggregated so that the next stage can abstract features from
the different scales simultaneously" [80].
As a result, when compared with AlexNet, the model used 12 times less parameters
(only five million parameters) and improved the top-5 error by almost 9.73% [24, 35, 80].
This architecture can be observed in Figure 3-8. It is worth noting that the network
eliminates the fully-connected layer completely.
3.6.4 Convolutional Neural Network for Facial Expression Recognition
This architecture is similar to AlexNet, but the authors used heavily batch regularisation
after each convolutional layer and each fully-connected layer. The authors omitted to
mention the dropout value applied in each layer, but provided information of each
convolutional filter and number of neurons for each fully-connected layer [81].
The dataset used in their project was taken from Kaggle which consists of about 37,000
gray-scale images of centred-faces with dimensions 48x48. Moreover, the purpose of
their work was to recognise the same seven facial expressions recognised in this
dissertation [81].
The structure of the deep network used in this project is also shown in Figure 3-8. It is
important to notice that this model also increase the number of convolutional filters as
the size of the images is reduced after each max-pooling layer [81].
D e e p C o n v o l u t i o n a l N e u r a l N e t w o r k | 51
Figure 3-8: Architectures from left to right: AlexNet, VGG, GoogLeNet, and CNN
for facial recognition [35,79,80, 81].
52 | R e s e a r c h m e t h o d o l o g y
Chapter 4. Research methodology
This chapter focuses on the research methodology used during this project and
describes its different phases. The chapter also details the components that were
selected during the development of this system and the justification for those decisions,
including an overview of the datasets used to train the DCNN model.
4.1 Project phases
This section elaborates on the phases the project went through. These phases were as
follows: data collection; pre-processing; system design; development; training and
testing; and evaluation phase.
4.1.1 Data collection
Considering that all machine learning algorithms are data driven, this stage was critical
for the system development given that selecting the incorrect dataset would have
increased the difficulty of the project. For instance, datasets with noisy backgrounds
would have unnecessarily imposed an additional challenge to the face detector
algorithm since the main objective of this project is not to detect faces.
There are different facial expression datasets available with varying image quality;
however, most high-quality datasets require permission for usage to be granted. This
permission for usage is fairly easy to obtain since most dataset owners require the user
to fill in a form as a mere formality.
Therefore, after careful consideration, the KDEF, and the JAFFE databases were
selected for this project to train the DCNN because they offered labelled images that
contains the required facial expressions with noise-free background. These databases
are described below:
R e s e a r c h m e t h o d o l o g y | 53
4.1.1.1 KDEF database
The Karolinska Directed Emotional Faces (KDEF) database is a set of photographs
formed by 4900 images displaying seven different human facial expressions. These
images were obtained by photographing, twice, 70 amateur actors (35 females and 35
males) from five different angles [47], and contain people without beards, moustaches,
earrings or eyeglasses, or visible make-up.
Figure 4-1: Sample images from the KDEF dataset.
4.1.1.2 JAFFE database
The Japanese Female Facial Expression (JAFFE) database is a set of 213 photographs
of 10 Japanese female models displaying seven facial expressions (six basic facial
expressions and one neutral) [48].
These databases contain posed expression images, in which the subject is ordered to
show a specific emotion; and do not display any occlusion, or varying levels of
illumination. It is important to notice that, although the datasets do not show occlusion, it
was possible to modify the images to include this element, and as a result, to increase
the size of the training set.
54 | R e s e a r c h m e t h o d o l o g y
Figure 4-2: Sample images from the JAFFE dataset.
4.1.2 Pre-processing
In the second phase the data was cleaned and transformed. This section was divided in
two parts: face detection, and data modification.
4.1.2.1 Face detection
Face detection was employed to extract faces from the datasets using a boosted
cascade of simple feature as described by Paul Viola and Michael Jones in [52].
In this process, a method proposed by Viola and Jones was used. The approach trained
a machine learning model using Haar-like features in order to extract important
characteristic from an image. The effect of these features was to reduce by “over one
half the number of locations where the final detector must be evaluated” [52]. For face
detection, the authors utilised three types of features: the value of a two-rectangle
feature, which calculates the difference between the sum of the pixels within two
rectangular regions; three-rectangle feature, which calculates the difference between
the sum of the pixels within the central rectangular region and the sum of the pixels
within the two outside rectangles; and four-rectangle feature, which calculates the
difference between the pixels within the diagonal pairs of rectangles [52]. These Haar-
like features can be observed in the Figure 4-3.
R e s e a r c h m e t h o d o l o g y | 55
Figure 4-3: Haar features [42].
To improve the evaluation time, they used an image representation, called integral
image, to obtain the sum of values in a rectangular area. The integral image is defined
as:
𝐼(𝑥, 𝑦) = ∑ 𝑖(𝑥′, 𝑦′)
𝑥′≤𝑥, 𝑦′≤𝑦
(4-1)
Where 𝐼(𝑥, 𝑦) is the integral image and 𝐼(𝑥′, 𝑦′) is the original image. This represents
the sum of the pixels from the origin to the point (𝑥, 𝑦). Therefore, as it is shown in
Figure 4-4, in order to calculate the sum of values in a rectangular area delimited by
(𝑥0, 𝑦0) and (𝑥1, 𝑦1) where 𝐴𝑟𝑒𝑎0 < 𝐴𝑟𝑒𝑎1 , the following operation is performed:
𝐼(𝑥, 𝑦) = 𝐼(𝑥1, 𝑦1) − 𝐼(𝑥1, 𝑦0) − 𝐼(𝑥0, 𝑦1) + 𝐼(𝑥0, 𝑦0) (4-2)
Figure 4-4: Area calculation using Integral Image.
Despite simplifying the rectangular area calculation, performing this operation through
the whole image would still require a significant amount of processing. Ideally, it would
be preferred to focus on objects of interest and ignore other parts of the image that do
not contain valuable information. The solution to this problem was a machine learning
56 | R e s e a r c h m e t h o d o l o g y
model that combined “increasingly more complex classifiers in a ‘cascade’” [52] which
substantially decreased the face detection time. Given that the system processed
images in real-time, it was reasonable to select this approach.
4.1.2.2 Data modification
After the previous step, the images were resized to either 48x48 or 96x96 (depending
on the experiment performed) and converted to grayscale images. Then, the data was
modified by applying one or a combination of the following methods: PCA, whitening,
and normalisation.
4.1.2.2.1 Principal Component Analysis
PCA is a pre-processing method that decorrelates the input data by projecting it into a
new coordinate system. PCA utilises the eigenvalues from the covariance matrix to
project the data into the new basis (which is assumed to be orthonormal). Then, by
selecting the components with the largest variance, the dimensionality of the input data
can be reduced.
In [73], the steps to calculate PCA are described:
Centralise data
The first step is to subtract the mean vector from all the instances in a given dataset
𝑑 × 𝑁, 𝑋. This is denoted by �� and defined as:
�� = 𝒙 − �� (4-3)
Where �� is the mean vector of 𝒙.
Calculate covariance matrix
This matrix is calculated with �� in the following form:
𝑆 =����𝑇
𝑁
(4-4)
R e s e a r c h m e t h o d o l o g y | 57
Perform Eigen analysis
Then, the eigenvalues are calculated using this matrix. Afterwards, the eigenvalues and
eigenvectors are ordered so that 𝜆1 ≥ ⋯ ≥ 𝜆𝑑.
Find principal components
The largest eigenvectors of 𝑆 are selected to form the projection matrix 𝑈𝑀 =
[𝑢1, … , 𝑢𝑀], where 𝜆1 ≥ ⋯ ≥ 𝜆𝑀.
Encode data
In order to encode the data, the following formula is applied:
𝑧 = 𝑈𝑀𝑇 (𝒙 − ��) (4-5)
Where 𝑧 is the encoded 𝑀-dimensional vector of 𝑥.
4.1.2.2.2 Whitening
Whitening is another process to standardise input data. According to [57], the geometric
interpretation of the whitening transformation is that "if the input data is a multivariable
Gaussian, then the whitened data will be a Gaussian with zero mean and identity
covariance matrix."
The transformation is defined as follows:
𝑊ℎ𝑖𝑡𝑒𝑛𝑖𝑛𝑔 =��
𝑆
(4-6)
Where �� is the decorrelated data (which is the dot product between the eigenvectors
and the input data), and 𝑆 is a one dimensional array that contains the singular values
(which correspond to the squared eigenvalues).
58 | R e s e a r c h m e t h o d o l o g y
4.1.2.2.3 Normalisation
Normalisation is a process that involves zero-centring the input data, and dividing this
value by the standard deviation. This method is defined by the following operation:
�� =𝑥 − 𝜇𝑥
𝜎𝑥 (4-7)
Where �� ∈ 𝑅 and 𝑥 ∈ 𝑅 are the original and normalised feature vectors respectively, 𝜇𝑥
is the mean of 𝑥, and 𝜎𝑥 represents the standard deviation of 𝑥.
After performing the pre-processing stage, the values were split into training and test
sets. The proportion in which these values were divided was four to one.
4.1.3 System design
The decisions taken during this phase were crucial given the repercussions that bad
design choices have in the software industry. For instance, the Obamacare website had
glitches that prevented people from signing up and increased the total development cost
to amounts ranging from $840 million to over $2 billion dollars [55, 56].
With that in mind, a careful analysis of different elements was made in order to select
the most suitable components.
4.1.3.1 Elements involved
Sections 4.1.3.1.1 - 4.1.3.1.3 describe the elements that were used to train the DCNN
model, and create the GUI application.
4.1.3.1.1 Amazon Web Services and Graphics Processing Unit
Amazon Web Services (AWS) is a cloud services platform that offers database
solutions, storage, application hosting, compute power, etc. [28]. As the name suggests,
these services are provided by Amazon.com and are used by popular companies such
as Netflix, Spotify, Unilever, Yelp, to name a few.
R e s e a r c h m e t h o d o l o g y | 59
The system was trained in an Elastic Compute Cloud (EC2) instance on AWS running
the Ubuntu operating system. There were some reasons for this: the first reason was
that the Ubuntu instance on AWS provides access to NVIDIA GPUs (which train
machine learning models faster than CPUs) with up to 1,536 CUDA cores and 4GB of
video memory per GPU [29]; the second reason was that, according to [35], training
machine learning models, particularly deep learning models, is a time-consuming
process ranging from a couple of hours to several days depending on the model
complexity and the processing power available; finally, another reason was the higher
cost of buying a GPU compared with the cost of renting an instance in AWS.
The GPU installed on the server is a NVIDIA GRID k520 model based on the Kepler
architecture which was designed to address "the most daunting challenges in High
Performance Computing (HPC)" [30], improving up to three times the performance of
previous architectures. This model contains two GPUs, in other words, it contains 3072
CUDA cores and 8GB of video memory [31, 32].
4.1.3.1.2 Programming language
Selecting the correct programming language to develop the system was a daunting task
since, nowadays, there are more than 2000 high level programming languages [33].
Fortunately, this problem was simplified by the project requirements.
During the training phase, the programming language needed to be simple to learn and
apply, and, since the project involves neural networks, it needed to be compatible with
the GPU libraries. Moreover, it was important that it had enough documentation and
community support so that problems during the development cycle were solved within a
reasonable time.
During the GUI design and neural network implementation, the programming language
required to be also simple and accessible. Moreover, since the application classifies
60 | R e s e a r c h m e t h o d o l o g y
images in real time, this language had to be fast enough to deal with this type of input
and the classification process.
After considering these factors, the list was reduced substantially. Among the
programming languages left, R, C++, Python, and Matlab are one of the most popular
languages regarding machine learning. R is regarded as a powerful language, but, it
has a steep learning curve; C++ is considered the best option when performance is
critical, although, it is difficult to learn and even more difficult to optimise. Moreover, the
development cycle is longer than in other programming languages; Python is a user-
friendly programming language that contains a variety of libraries that allow a
transparent use of GPUs; Matlab is a balanced alternative, it is user-friendly, fast, and
has an acceptable performance, but purchased of a licence is required in order to use
the main product, and some of its libraries [34, 36].
After a reasonable analysis, Python was selected to train the DCNN, and to create the
GUI given that it has good performance, offers a gentle learning curve, and does not
require a licence to be bought.
4.1.3.1.3 Libraries and development environments
According to the programming languages selected, the following section describes the
elements used to develop the system.
Anaconda: Anaconda is a Python distribution that includes 400 packages for
science, math, engineering, and data analysis. It is easy to install and provides
flexibility when interacting with other languages [37].
CUDA and CuDNN: CUDA is “a parallel computing platform and programming
model invented by NVIDIA” that allows to take advantage of the power of the
GPU [38]. Similarly, CUDA Deep Neural Network (CuDNN) library is a collection
of primitives for DNN that provides “highly tuned implementations for standard
R e s e a r c h m e t h o d o l o g y | 61
routines such as forward and backward convolution, pooling, normalization, and
activation layers” [39].
Theano: It is a Python library that helps to “define, optimize, and evaluate
mathematical expressions involving multi-dimensional arrays efficiently” [40]. It
features, among other things, efficient symbolic differentiation, transparent use
of a GPU, speed and stability optimizations, and dynamic C code generation [40].
Lasagne: Lasagne is a lightweight library that is used to build and train neural
networks using Theano [41].
OpenCV: It is an open source library with C++, Python, and Java interfaces that
offers more than 2500 optimized computer vision and machine learning
algorithms. These algorithms facilitate tasks such as: face detection, object
recognition, object tracking, etc. [43].
PyQt: It is a set of python bindings which are application programming interfaces
that provide code to use the Qt application framework. The Qt application
framework is a toolkit that contains abstractions not only of GUI components, but
also of network sockets, threads, etc. [78].
4.1.3.1.4 Justification
There were some technical factors that were considered in order to select these
elements, but special attention was given to three attributes: simplicity, familiarity, and
flexibility.
In the case of Anaconda, it was selected to simplify the installation process, and to
avoid disrupting other libraries and dependencies.
Regarding CUDA and CuDNN, these elements were selected to take advantage of the
NVIDIA GPUs on the servers.
62 | R e s e a r c h m e t h o d o l o g y
Concerning Theano and Lasagne, they were selected because they offer a simpler and
more efficient implementation of deep learning algorithms. There were other alternatives
to Theano and Lasagne such as TensorFlow, Torch and Caffe; however, the
alternatives presented some disadvantages: TensorFlow is up to four times slower than
other deep learning tools because it does not support inline matrix operations [45];
Torch is a powerful computing framework widely regarded as one of the best options,
but due to some constraints in the available time, learning LuaJIT was not a viable
alternative [46]; finally, Caffe has good performance, but it is not as simple to use as
Lasagne [45].
In terms of image processing, OpenCV was selected because it contains the code
necessary to read webcam images. Moreover, it includes not only the functions required
for data augmentation, but also the Viola and Jones model for face detection which
significantly reduces the pre-processing phase.
Finally, PyQt was selected because it simplifies the development of the GUI and
provides a way to create appealing interfaces.
4.1.4 Development
The development phase was another important part of this process in which
underestimating the development time or creating poor and complex code were
potential risks. Therefore, in order to prevent these risks, a methodology that is based
on modular design was used. Modular design simplifies the development of software by
subdividing a system into smaller parts. The key idea about this methodology is that
each part performs a specific task and, in case of modifying a particular module, the
other modules are not affected or are affected minimally.
This phase was divided in two parts: the training part and the application part.
In the training part the following modules were developed:
R e s e a r c h m e t h o d o l o g y | 63
● The input module: It is in charge of extracting the features and labels.
● The DCNN module: It receives the features and labels, and creates the DCNN
model.
In the application part, the following modules were developed:
The GUI module: It is the main component. This module displays the input data,
the classification of the facial expression, and the likelihood of each expression.
The input processing module: it reads the input data depending on the source
(image or webcam) and processes the image following the same steps as the
input module in the training part.
The DCNN application module: it receives the pre-process vector and outputs the
classification and probabilities for all the expressions by using the previously
trained DCNNs.
4.1.5 Training and Testing
This stage focuses on training the models by varying the hyper parameters, applying
data augmentation techniques, and utilising different datasets; and studying the DCNN
by performing different experiments with them.
4.1.5.1 Model selection
Model selection requires choosing the DCNN architecture and the hyper parameters.
Selecting the best elements is not as simple as following a recipe given that there are
several variables involved in this process. For instance, using different datasets means
that hyper parameters need to be adjusted to achieve similar performance.
Subsections 4.1.5.1.1 and 4.1.5.1.2 list the elements that were considered during the
training and testing stage.
64 | R e s e a r c h m e t h o d o l o g y
4.1.5.1.1 Hyper parameter tuning
Training an accurate model involved an important process named hyper parameter
tuning. This process is concerned with adjusting the hyper parameters, which are the
values that define the model, in order to achieve good performance. During this phase,
the following hyper parameters were tuned:
Learning rate: It conditions the number of iterations the model takes to converge
to a satisfactory solution by controlling the step size when moving towards the
global minimum or local minima. According to [74], this is often the “single most
important hyper-parameter and one should always make sure that it has been
tuned.” Values for a neural network can vary greatly, ranging from values greater
than 10−6 to less than 1. However, a default value that works for standard neural
networks is 0.01 [62, 74].
Gradient descent optimization algorithm: This hyper parameter affects the
convergence speed as it specifies the algorithm used during gradient descent. As
it was mentioned in Chapter 2, there are six algorithms commonly utilised:
Momentum, Nesterov Accelerated Gradient, Adagrad, Adadelta, RMSprop, and
Adam.
Number of epoch: It controls the number of iterations that the training process
will last. On the one hand, a small number will prevent convergence because the
training process will stop prematurely; on the other hand, a great number will
waste processing time without achieving better performance.
It is important to notice that by using early stopping, it is not required to set the
number of epochs given that, as described in section 2.3.1, the model will stop as
soon as the performance ceases to improve or decreases [74, 75].
Number of neurons in the FC layer: It specifies the number of neurons in the
fully-connected layer. A small number of neurons will prevent the network to
R e s e a r c h m e t h o d o l o g y | 65
model complex data, while a sizeable number of neurons will increase the
training time and overfit the data.
Number of FC layers: It specifies the number of layers which affects the type of
decision boundary that the architecture is able to model. However, the greater
the number of layers, the more complex the model and the more difficult it is to
select the hyper parameters so that the architecture can converge.
Size of max-pooling filters: Varying this value affects the spatial dimensions of
the input; however, the most common and recommended setting for this filter is a
2x2 filter. Bigger filters, as it was mentioned in section 3.5.1.3, are uncommon
because they affect the performance [16].
Size of convolutional filters: It controls how the DCNN respond to certain type
of features. When the filter is big, which means 7x7 or more, the network can
overlook essential details in the input. When the filter is small, more details are
considered by the network which is also not always correct. 3x3 and 5x5 filters
are the recommended sizes of convolutional filters in any layer; however, 7x7
filters should only be used in the first layer [16].
Number of convolutional filters: It controls the quantity of activation maps and
the features to which the network respond. This corresponds to the output
volume described in section 3.1.1.
Weight initialization: It is suggested that the weights are initialized randomly to
ensure symmetry breaking. This is explained because initializing all the weights
to zero causes the gradient during backpropagation to be the same resulting,
incorrectly, in the same parameter updates [57].
Activation function: It controls the firing of the neurons. This is an important
hyper parameter when the architecture is formed by more than one layer. As it
66 | R e s e a r c h m e t h o d o l o g y
was mentioned in section 2.3.2, vanishing gradient and exploding gradient are
two consequences of backpropagation that are closely link to the initialization
weights and the activation function. According to [58], the ReLU and maxout
activation functions are preferred as they solve these problems.
4.1.5.1.2 Architecture
As it was mentioned in section 3.6, there have been different architectures proposed by
different academics and companies. The experiments in this phase involved training
three models: AlexNet, VGG, and GoogLeNet in order to compare their results.
4.1.6 Evaluation
This phase focuses on analysing and contrasting the results from the previous stage.
Different tools such as figures, tables and plots were used in order to do this and
answer the research questions described previously.
I m p l e m e n t a t i o n | 67
Chapter 5. Implementation
This chapter details the implementation process of both the training and application
modules. Section 5.1 describes the main purpose the training module used to train the
DCNNs, while section 5.2 provides information about the AFER modules used in the
main system.
5.1 Training modules
The section outlines the main purpose of each module in order to create the DCNN
model. The developed modules in this section were: input module and DCNN module.
5.1.1 Input module
The input module is in charge of extracting features and labels from the dataset. This
module is in charge of pre-processing the images in each dataset.
First, images are transformed to grayscale images. Then, the face is located in each
image and the values are transformed into numpy arrays. This process is shown in
Figure 5-1.
Figure 5-1: Face detection and numpy array creation.
After that, each column in this array is normalised by removing the mean and dividing
by the standard deviation. Additionally, as it will be described in Chapter 6, some
experiments involved decorrelating and whitening the data by computing the SVD
factorization of the covariance matrix, and dividing the decorrelated data by the
eigenvalues. Finally, this data is sent to the DCNN module where it is divided into train,
validation, and test sets, and used to train the DCNN model.
68 | I m p l e m e n t a t i o n
5.1.2 DCNN module
The DCNN module creates the model using the features and labels created by the
previous module. As part of the creation of the DCNN architecture, the input data is
separated into training, validation and testing sets. The validation set is used to obtain
the best performing model by tuning the hyper parameters. Without this set, the model
risks overfitting the data given that the hyper parameters would be tuned specifically for
the test data and would generalised poorly [27]. The test set is used to assess the
performance of the fully-trained DCNN.
This module combines Lasagne and nolearn in order to abstract the creation of the
DCNN and train the network. While Lasagne provides various classes that represent the
layers of a neural network and simplify the use of theano variables, nolearn abstracts
the creation of neural networks by combining different Lasagne layers. So, instead of
creating various Lasagne layer objects, and combining them, the only object that is
created is a nolearn neural network [41, 44].
The result of the DCNN module is a trained model that is stored as a pickle file so that
the AFER modules can use it to predict the facial expression.
5.2 AFER modules
After the DCNN model was created, it was used to identify facial expressions in images.
There are three components of the AFER system: Graphical User Interface module,
input processing module, and DCNN application module.
5.2.1 Graphical user interface module
This module is designed with simplicity in mind so that any user can quickly interact with
the system. As shown in the figure below, the interface is austere. The main window
has an image area, a button to load the data, an output section and a lateral part that
contains seven meters representing all the emotions the system is capable of
recognizing.
I m p l e m e n t a t i o n | 69
Figure 5-2: AFER Graphical User Interface.
When the “Select input Data” button is pressed, the user is prompted with a dialog box
to decide whether the input should come from a file or webcam. Clicking on the
“Images” button displays a file selector that returns the image filename, and adds two
buttons to the interface: “Detect face” and “Show emotion”. Similarly, clicking on the
“Webcam” button creates a webcam object that is returned to the GUI.
In the case of using files, the new buttons control the rest of the process. Pressing
“Detect face” calls the face detector function from the input processing module which
locates the face and outputs the coordinates of the face in order to enclose it in a
rectangle. Then, clicking “Show emotion” calls the DCNN application module.
In the case of utilising the frames from the webcam, the system calls the input
processing module for each frame collected at equally spaced intervals in order to
transform this frame into a numpy array. Then, it continues to the next module.
After clicking “Show emotion” or calling the input processing module, the DCNN
application module is called to identify the facial expression. The output values of this
function are: a vector containing the likelihood of the expression and an index that
indicates the position of the highest ranking expression. After that, the system displays
the result of this classification (i.e. the name and emoji of the emotion) in the lower
section of the GUI. Additionally, the meters and labels are modified according to the
likelihood of each expression. Each path described above is shown in figures 5-3 and 5-
4; and the whole process is summarised in Figure 5-5.
70 | I m p l e m e n t a t i o n
Figure 5-3: Result from input image.
Figure 5-4: Result from input video.
I m p l e m e n t a t i o n | 71
Figure 5-5: AFER process.
72 | I m p l e m e n t a t i o n
5.2.2 Input processing module
This module is necessary to process the input data in the same fashion as the data
used to train the model. Unlike the pre-process module before, this component is part of
the GUI module, and is only called after specific intervals, when the webcam is used as
input; or after pressing “Detect face”, when images are used as input.
The module begins by receiving the image or a frame from a webcam. If the input is an
image, then, the process follows the same steps as described in section 5.1.1; if the
input is obtained from a webcam, it is necessary to transform the format of the frame
from BGR (Blue-Green-Red) to 24-bit RBG (Red-Blue-Green) before proceeding with
the same steps. This transformation involves reordering the colours without modifying
the original values. The OpenCV library works with the BGR colour format because it
used to be popular “among camera manufacturers and software providers” [59, 60]
which forced the developers to adopt it.
Once the image is transformed into a numpy array, the same steps described in the
input module are followed. After that, the DCNN application module is automatically
called if the webcam is used or it is called after the user presses “Show emotion” if input
files are used.
5.2.3 DCNN application module
This module receives the numpy array containing the face and identifies the facial
expression. The module returns two values: the index of the most likely expression, and
the likelihood of each expression.
There were two approaches taken in order to identify the facial expression. The first
approach consisted in using the most accurate DCNN, while the second approach
consisted in using an ensemble of the most accurate models and taken the most voted
facial expression. It is important to notice that the accuracy increased as more models
were added. However, the models used in the ensemble were required to be different
I m p l e m e n t a t i o n | 73
from each other in order to avoid overfitting by varying hyper parameters and the
architecture.
74 | E x p e r i m e n t s
Chapter 6. Experiments
This chapter details the experiments performed during the training stage to evaluate the
best hyper parameters and DCNN model; and the experiments carried out during the
test stage in order to assess the accuracy of the models.
6.1 Training stage
In order to evaluate the DCNN model and hyper parameters during this phase, the
accuracy of the model was tested against popular benchmark facial expression
datasets, namely the KDEF and JAFFE Databases described in section 4.1.1. This
section is divided in two parts. In section 6.1.1, the objective of the experiments was to
test different combinations of hyper parameters. In section 6.1.2, the purpose of the
experiments was to tests additional steps to improve the performance of the DCNN.
6.1.1 Hyper parameters
In this section, a set of experiments was performed five times using the same
architecture and, depending on the test, the same dataset. The results were obtained by
averaging each output. The experiments are described next.
6.1.1.1 Learning rate
The tests in this part consisted on training a model using a specific GDO algorithm while
varying the learning rate. These tests were performed to find the ideal combination
between Gradient Descent Optimization (GDO) algorithm and learning rate.
The experiments were performed using a model containing two convolutional layers
with 32 and 64 filters, respectively; and two fully-connected layers with 512 neurons
each and dropout with 𝑝 = 0.5 after the first FC layer. The learning rate values were
varied from 0.001 to 0.5 according to the recommended values [16, 49].
The results of these tests are displayed in the following image:
E x p e r i m e n t s | 75
Figure 6-1: Accuracy of applying diverse GDO algorithms after varying the
learning rate.
In Figure 6-1, three things can be noticed: 1) RMSprop and Adam perform better when
the learning rate is very small; 2) SGD, Adagrad, and Adadelta show ordinary results
throughout the experiments; and 3) Nesterov momentum and Momentum display more
stable accuracy for medium learning rates, but they show poor performance when the
learning rates are either very small or very big.
Moreover, these results can be explained for two reasons: the first reason is that a
training set can be biased and contain images that favour specific weight values; the
second reason is that very small or very big learning rates affect the weight
convergence.
Finally, it can be observed that the learning rate should be selected carefully since it
controls the convergence speed of the model and can also affect the performance of the
GDO algorithm.
76 | E x p e r i m e n t s
6.1.1.1.1 KDEF dataset learning rate
The tests in this subsection were performed to determine whether the same learning
rate applied to the JAFFE dataset was also useful in the KDEF dataset or not. For these
experiments, a network was formed using seven layers: two convolutional layers
followed by max-pooling layers, two fully-connected layers with 128 neurons with
dropout in the first layer, and one fully-connected layer with 7-way softmax in the output
layer. Moreover, the Nesterov momentum GDO algorithm and the ReLU activation
function were used in these networks.
The results of these experiments are shown in the following figure:
Figure 6-2: Validation and test accuracy for different learning rate values.
It can be seen that a learning rate of 0.001 achieves the highest validation accuracy and
the third highest test accuracy. This suggests that the model finds difficult to generalised
when it is tested using the JAFFE dataset. However, as in the previous section, a
Nesterov momentum with a medium learning rate works better than the same GDO
algorithm with a higher or smaller learning rate.
6.1.1.2 Gradient descent optimization algorithm
As it was mentioned in Chapter 2, GDO algorithms are used to avoid suboptimal points
and find values that achieve high performance. In this section, the highest accuracy
E x p e r i m e n t s | 77
result that each GDO algorithm achieved in section 6.1.1.1 was selected and compared
here. The results are shown in the following plot:
Figure 6-3: Highest accuracy result of each GDO algorithms.
It can be observed in Figure 6-3 that Momentum, RMSprop, Adam and Nesterov
momentum achieved the highest accuracy. While, Adam and RMSprop got that
performance using a learning rate of 0.001, Nesterov momentum and Momentum
obtained the same performance with a learning rate of 0.015.
6.1.1.3 Size of max-pooling filters
In order to identify the best max-pooling-filter size, the experiments in this section were
performed by varying the size of the filter. They involved three sizes: small (2x2),
medium (4x4), and big (10x10). The outcome of this experiment can be observed in the
following figure:
78 | E x p e r i m e n t s
Figure 6-4: Accuracy after varying the size of the max-pooling filter.
It can be noticed that the best size for the max-pooling filter is 2x2 as it achieved the
highest accuracy. This value is confirmed in [16] where it is noted that "the most
common setting is to use max-pooling with 2x2 receptive fields and with a stride of 2"
given that bigger filters are aggressive which, as it can also be observed with the 10x10
filter (red line), "leads to worse performance."
6.1.1.4 Activation function
The experiments in this section involved applying the activation functions described in
section 2.2.2.1.2. These tests were performed using the same pre-processed dataset,
but some hyper parameters were modified to speed up the process. The modified
values were: the number of iterations which was reduced to 50 epochs; the number of
neurons in the fully-connected layers which was reduced to 128, and the dropout layer
which was removed.
The results of these experiments are displayed in the next figure:
E x p e r i m e n t s | 79
Figure 6-5: Accuracy for different activation functions.
Figure 6-5 shows the behaviour of three activation functions, namely tanh, sigmoid and
ReLU. The ReLU function achieved the highest accuracy among the activation
functions; the sigmoid function failed to converge; and the tanh function displayed an
erratic behaviour. This information is in line with the results in [35] and [58]. Particularly,
in the second reference, it is suggested to never use the sigmoid or tanh function to
train a DCNN; instead, ReLu or maxout should be preferred in order to obtain a better
performance.
6.1.1.5 Number of neurons in the FC layer
The experiments performed in this section involved building models with different
number of neurons in the fully-connected layers. The network consisted of two fully-
connected layers with dropout in the first layer and one fully-connected layer with 7-way
softmax in the output layer. Additionally, the activation function used was ReLU, and the
max-pooling-filter size was 2x2.
The results from these tests are presented in the next table:
80 | E x p e r i m e n t s
Figure 6-6: Validation and test accuracy for FC layers with varying number of
neurons.
As Figure 6-6 shows, using more than 128 neurons achieves similar validation
accuracies. The fact that the validation accuracy is greater than the test accuracy could
be explained given that the model might have overfitted the data, which makes
generalisation harder to obtain.
An important point to observe is that a small number of neurons achieve lower accuracy
given that the function to model is more complex than what they are capable of
modelling.
6.1.1.6 Number of fully-connected layers
This experiment involved adding convolutional neural networks and evaluating their
accuracy. The network was formed by two convolutional layers followed by max-pooling
layers; a varying number of 128-fully-connected layers with dropout regularization
method; and one fully-connected layer with 7-way softmax in the output layer.
The results from these tests are shown in the following table:
E x p e r i m e n t s | 81
Figure 6-7: Validation and test accuracy for different number of FC layers.
As it can be noticed in this table, the accuracy decreases when the network is formed
by 10 layers or more. This is an expected result as it becomes more difficult to train a
network the more layers it has. In order to solve this problem, care should be taken
when selecting other hyper parameters such as the learning rate or the size of the max-
pooling filters [54, 62, 74, 75].
Figure 6-7 shows that five FC layers or less are capable of achieving high accuracy.
Particularly, three FC layers obtained a good validation accuracy, and the highest test
accuracy.
6.1.2 Other alternatives to improve the DCNN
In addition to modifying the hyper parameters listed in the previous section, other
methods were used to improve the performance of the DCNN, namely data pre-
processing, data augmentation and the dropout technique.
6.1.2.1 Data Pre-processing
The first alternative was to pre-process the input data. [57] describes three methods
(detailed in Section 4.1.2.2) that were used to perform this task, namely normalisation,
PCA and whitening.
82 | E x p e r i m e n t s
In this set of experiments, normalisation, PCA, and whitening were used to modify the
input data. The first experiment normalised the data; the second, applied PCA to the
data; the third, whitened the data; and the last applied a combination of all of them.
The results can be observed in the following figure:
Figure 6-8: Accuracy of different pre-processing methods.
Based on these results, it is worth nothing that PCA and whitening performed slightly
better than the scenario in which pre-processing was not applied. However, they
performed worse than only normalising the input data. In [57], it is mentioned that PCA
and whitening are not used with CNN, so the previous result confirms this assertion.
Additionally, combining different techniques was thought to increase the performance
since “scaling the data has a substantial effect on the result obtained” [69]. However, it
did not perform better than only normalising the data. Instead, applying all the
techniques seems to average the accuracies.
E x p e r i m e n t s | 83
6.1.2.2 Data augmentation
This process was applied to the original datasets in order to increase the number of
images. The transformations used on the input data were: rotation, translation, zooming,
flipping, random cropping, gamma correction, and occlusion.
Rotation was applied by turning the image 90 degrees clockwise and counter-clockwise;
Translation consisted on randomly moving images between -4 to 4 pixels; Zoom was
the result of randomly extracting sections of images and increasing their size; Flipping
was applied vertically, and horizontally; Random cropping involved arbitrarily trimming
the images; Gamma correction was applied by modifying the luminance of the image;
finally, occlusion was used to obstruct parts of the image.
These transformations can be observed in the following figure:
Original
Zoom and
Translation
Flip and Rotation
30%
Occlusion
Gamma variation
Random crop
Figure 6-9: Transformations applied to one image.
84 | E x p e r i m e n t s
The transformations were added to the original dataset in order to create six datasets,
namely zoom, gamma variation, flipped, medium occlusion, and random crop datasets;
and a dataset with all the modifications combined. Given that these transformations
were applied randomly to the original dataset in order to prevent overfitting, the size of
each of them is different. For these tests, the number of images for each datasets is
described in the following table:
Table 6-1: Number of images in each dataset.
Given that the dataset was increased, more layers were added to the DCNN. The
architecture was formed by nine learned layers: three convolutional layers followed by
max-pooling layers, two fully-connected layers with 512 neurons and dropout, and one
fully-connected layer with 7-way softmax.
The results of these experiments are shown below:
Figure 6-10: Accuracy of using different data augmentation techniques.
Flip Gamma Occlusion Random Crop Zoom All
Number of images 213 864 973 639 1059 333 6727
Transformations
Original
E x p e r i m e n t s | 85
In this figure, it can be seen that applying different data augmentation techniques affects
the accuracy of the model dissimilarly. While some techniques, such as gamma or
occlusion, improved the accuracy, other techniques, such as flip or zoom, achieved
lower accuracy levels.
On the one hand, a higher accuracy in the case of occlusion and gamma correction
might be explained given that the modified images are more similar to the original than
other transformations. While other methods change the position or the size of feature,
which in turn creates a different activation map; occlusion and gamma correction modify
these features, but preserve the position and the size, which creates a similar activation
map and reinforces the knowledge of neurons in the same position.
On the other hand, the lower accuracy obtained by other transformations can be
explained given that the model decreases overfitting and improve generalisation by
creating different version of the input image which forces the model not to favour any
particular characteristic of the data.
Moreover, an interesting outcome was observed when all the data augmentation
methods were combined. Although the accuracy was lower, the convergence rate in
these experiments was faster than in the other experiments.
6.1.2.3 Dropout
Dropout was a regularisation technique used to decrease the tendency to overfit. The
basic idea of this layer is that it randomly removes individual activations during the
training process. The experiments in this stage consisted in comparing the results of the
models trained using normalisation, and normalisation and dropout combined. The
dropout utilised a Bernoulli distribution probability of 𝑝 = 0.5 of being 1.
The results from these experiments are shown in the next figure:
86 | E x p e r i m e n t s
Figure 6-11: Accuracy of using normalisation and dropout.
Figure 6-11 shows that normalisation alone achieves a higher accuracy than
normalisation and dropout combine. This is expected given that the dropout technique
breaks co-adaptation by not considering certain neurons at a given time which means
that the neurons are different in every iteration and break the dependency of learning
specific features in the presence of other neurons [23, 61]. Moreover, this result is in
line with the observations made in [61]: "One of the drawbacks of dropout is that it
increases training time. A dropout network typically takes 2-3 times longer to train that a
standard neural network of the same architecture."
6.2 Test stage
In this section, the objective was to test the trained networks built in the previous part.
Section 6.2.1 describes the occlusion and illumination tests. Section 6.2.2 presents
additional test performed in order to improve the accuracy of the models created in
section 6.1.1. Finally, section 6.2.3 details the test performed using other architectures.
E x p e r i m e n t s | 87
6.2.1 Difficult scenarios
The accuracy of the model obtained under perfect conditions is not as important as the
accuracy that is achieved under difficult conditions. This section tests the network
against difficult scenarios in order to study their effect in the accuracy of the models.
6.2.1.1 Occlusion and gamma variation tests
In these experiments, three models were tested against three sets of facial expression
images with varying levels of occlusion and illumination. As it was mentioned before, the
datasets did not contain images showing different levels of occlusion and illumination.
Therefore, it was necessary to modify these images to artificially include these
characteristics by inserting random black rectangles covering 10, 30 and 50% of the
image; and varying the level of illumination from low to high.
The next figure shows the different type of occlusion applied to the dataset.
Figure 6-12: Images with 10% (top), 30% (middle) and 50% (bottom) occlusion.
88 | E x p e r i m e n t s
Similarly, the figure below contains a sample of the gamma variation applied to the
images in the dataset. The first row from top to bottom has the original images; the
second row contains low levels of illumination; the third row corresponds to medium
levels; and the last row is formed by images with high levels of illumination.
Figure 6-13: Original images (first row) and images with low (second row),
medium (third row), and high (fourth row) levels of gamma variation.
E x p e r i m e n t s | 89
The models contained 13 learned layers of which five were convolutional layers
followed by max-pooling layers, two were fully-connected layers with dropout, and one
was a fully-connected layer with 7-way softmax. They were trained with a dataset
formed by KDEF images occluded 10, 30 or 50%; low, medium and high levels of
illumination; and the original image dataset. Then, each model was tested against
datasets formed by JAFFE images occluded 10, 30 or 50%, and one set without any
modifications which was used as a baseline.
The result of the occlusion tests are shown in the following table:
Table 6-2: Occlusion test results.
This table shows that, for most cases, the models performed better when the dataset
did not contain occlusion and, as it would be expected, the accuracy decreased as the
level of occlusion increased. This decrement was especially severe in the case of
images occluded 50% where the accuracy got reduced by an average of 9% with
respect to the accuracy of the images occluded 30%.
Additionally, it can be noticed that the third model, the one that was trained with images
containing 50% occlusion, obtained the best accuracy of all the models for all the
datasets containing occlusion, particularly the dataset showing 30% occlusion. These
points can be explained given that an occlusion of 50% is a drastic reduction of
information that forces the DCNN to learn better and that reduces overfitting. However,
the disadvantage of these images occluded 50% is that recognising facial expressions
becomes more difficult. For instance, discerning between anger and disgust, or surprise
and fear is more complex when the mouth is covered.
These results can also be visualised in the figure below:
No Occlusion Occlusion 10% Occlusion 30% Occlusion 50%
Occlusion 10% 34.74 33.57 28.64 24.18
Occlusion 30% 39.44 38.26 37.32 24.65
Occlusion 50% 38.97 39.44 39.91 29.58
Dataset
Model
90 | E x p e r i m e n t s
Figure 6-14: Accuracy for each DCNN model.
Similarly, the results of the illumination tests are displayed in the next table:
Table 6-3: Illumination test results.
From the previous table, it can be notice that varying levels of illumination does not
have a strong impact in the accuracy of the model. This is explained given that the
modified datasets contain a similar amount of information as the original dataset. When
low illumination is present, some features are ignored and the accuracy decreases
since this is similar to occluding the image. However, when there is high illumination in
the dataset, the number of training images and the accuracy increase as this is similar
to augmenting the data.
6.2.2 Innovative architectures
This section tests the architectures described in section 3.6, namely AlexNet, VGG, and
GoogLe Net.
The architectures used in each experiment are shown in the following figure:
Original Low illumination Medium illumination High illumination
Model 38.97 35.65 38.41 40.69
Dataset
E x p e r i m e n t s | 91
Figure 6-15: Architectures: AlexNet (Left), VGG (Centre), and GoogLeNet (Right).
92 | E x p e r i m e n t s
These models were configured using the values described in each paper with the
exception of their output, which was modified to match the seven facial expressions.
The architectures were trained using the augmented KDEF dataset for 40 iterations and
tested using the JAFFE dataset. In the case of AlexNet, three models were trained and
tested, separately and collectively (as an ensemble). In the case of GoogLe Net and
VGG, only one of each model was created.
The results from these experiments are shown in the following table:
Table 6-4: AlexNet, GoogleNet, and VGG accuracy tests.
As it can be noticed in this table, after 40 iterations the accuracy of the AlexNet and the
ensemble are lower than VGG and GoogLe Net. Moreover, the ensemble model got a
lower accuracy than the AlexNet 1 because AlexNet 3 had a lower accuracy and, after
being combined, the errors of the third model affected the overall performance. It is
expected that the more models with different hyper parameters are added to the
ensemble, the better the accuracy.
6.2.3 Additional test
Section 6.2.3 was added in an attempt to improve accuracy given that in previous
sections the validation accuracy was high, but once the models were assessed using
the test datasets, the accuracy diminished notably.
The first experiment was to test the models against different datasets using both
versions (original and augmented) in order to determine if the model was overfitted or
underfitted.
Model Accuracy
AlexNet 1 14.91
AlexNet 2 14.05
AlexNet 3 13.63
AlexNet Ensemble 14.26
VGG 32.96
GoogLe Net 29.75
E x p e r i m e n t s | 93
The following figure shows the results of these experiments:
Figure 6-16: Test accuracy for different models using different datasets.
It can be noticed that model 1 and 2 have high accuracy when tested with the
augmented KDEF dataset as this was the dataset used to train these models. However,
the other values show low accuracy which might mean that the model overfitted the
data. This seems to be the case as the accuracy of the original KDEF dataset is lower
than the augmented KDEF dataset which means that the model did not generalise
correctly with other images outside of the training data given that it learned the modified
dataset better than the original dataset as there were numerous modified images for
each original image. Therefore, in order to improve this result, it was necessary to use
techniques to reduce overfitting.
One alternative to reduce overfitting is to augment the KDEF dataset. In order to do this,
it is possible to combine both the JAFFE and KDEF dataset, but it is important to leave
94 | E x p e r i m e n t s
some images out of the training set in order to test the accuracy of the model with these
images.
The new model was formed by 14 layers of which five were convolutional layers
followed by max-pooling layers; three were fully-connected layers, and one was a fully-
connected layer with 7-way softmax. Each layer had a ReLU activation function, and
used the dropout regularisation technique. Additionally, a learning rate of 0.008 and
Nesterov momentum were used.
The accuracies for this experiment are shown in the following table:
Table 6-5: Validation and test accuracy for improved model and ensemble.
In Table 6-5, it can be noticed that the test accuracy has improved substantially. One
reason for this is that care was taken when selecting hyper parameters, particularly the
learning rate, the number of convolutional and fully-connected layers, and the value of
the dropout regularisation technique after each layer. As a result, a deeper network was
created.
Another reason is that the datasets used to train the model share similar characteristics
with the images utilised to test the network such as the quality of the picture, the race of
the people in the dataset, etc. This caused the DCNN to learn these features as part of
the model, instead of learning more relevant facial expression features.
In addition, the accuracy of the test KDEF dataset is higher in the case of all models
than in the case of the new model given that all the models were trained using this
dataset.
Combined JAFFE KDEF
Validation Test Test
Combined
Dataset84.33 75.07 64.89
All n/a 44.74 78.95
Accuracy
Models
E x p e r i m e n t s | 95
Another way to visualise this is to calculate the performance metrics for both models.
Two of this metrics are precision, and recall. Precision gives the value of facial
expressions that were predicted correctly from all the predictions made correctly and
incorrectly. Recall provides the value of facial expressions that were predicted correctly
from the relevant facial expressions.
The following tables show the precision, recall and f1- score per emotion obtained in
each model tested against both datasets:
Table 6-6: Precision, recall and f1-score for both models tested against the KDEF
dataset.
Table 6-7: Precision, recall and f1-score for both models tested against the JAFFE
dataset.
It can be noticed that the new model achieved higher results than all the models
combined for the JAFFE dataset. Conversely, the new model obtained lower values
precision recall precision recall
Fear 0.71 0.54 0.79 0.81
Anger 0.52 0.55 0.72 0.81
Disgust 0.68 0.58 0.92 0.78
Happiness 0.81 0.83 0.87 0.89
Neutral 0.63 0.71 0.96 0.77
Sadness 0.48 0.66 0.54 0.87
Surprise 0.83 0.69 0.98 0.61
Avg/total 0.67 0.65 0.83 0.79
EmotionKDEF - Combined dataset model KDEF - All models combined
precision recall precision recall
Fear 0.95 0.66 0.54 0.59
Anger 0.68 0.81 0.4 0.29
Disgust 0.91 0.68 0.81 0.19
Happiness 0.79 0.79 0.66 0.49
Neutral 0.71 0.83 0.5 0.61
Sadness 0.53 0.83 0.23 0.67
Surprise 0.86 0.7 0.81 0.33
Avg/total 0.79 0.75 0.57 0.45
EmotionJAFFE - Combined dataset model JAFFE - All models combined
96 | E x p e r i m e n t s
when the KDEF dataset was tested. As it was mentioned in the previous page, both
results are explained by the dataset used to train these models.
For the new model, the confusion matrix is shown in the following figure:
Figure 6-17: Confusion matrix for the new model trained with both KDEF and
JAFFE datasets (0: Fear, 1: Anger, 2: Disgust, 3: Happiness, 4: Neutral, 5:
Sadness, and 6: Surprise).
Figure 6-17 shows that the new model is capable of predicting happiness, sadness,
surprise and the neutral state with high accuracy; however, the network has a few
problems when it tries to classify anger, disgust and fear. The first two, anger and
disgust, are predicted instead of the other in some occasions, while fear is confused
mostly with surprise. Overall, the new model obtains reasonable test accuracy in both
datasets.
D i s c u s s i o n | 97
Chapter 7. Discussion
This chapter describes the strengths and weaknesses of the project, and the knowledge
that was acquired through the development of the project. Additionally, it discusses new
directions for future work.
7.1 Creating a Deep Convolutional Neural Networks
Building a machine learning model based on Deep Convolutional Neural Networks was
a challenging, but interesting journey. Through this project, revolutionary ideas created
by various academics such as Geoffrey Hinton, Alex Krizhevsky, Andrew Ng, Yoshua
Bengio, Yann LeCun, etc. were instrumental in not only improving the system, but also
simplifying complex concepts.
During the development phase, designing, training and testing the DCNN proved to be
hard since the literature was abundant and the procedure for building the best model
was not as simple as just replacing the input data with the new dataset used in this
project. In fact, applying the same settings to two different datasets will not achieve the
same performance.
In the design phase, the literature contained a copious set of alternatives. These
alternatives include the work of academics such as Alex Krizhevsky, Yann LeCun and
Karen Simonyan. The main limitation to implementing the different type of architectures
was the training time which increased as different factors such as the number of images
or the number of convolutional filters augmented. In some cases, these architectures
required to be trained for days in order to achieve good performance.
In fact, there is a trade-off between the complexity of the network and the processing
time. The greater the number of layers, number of neurons, or number of filters the
network has, the higher the processing time. Therefore, it is important to evaluate
whether increasing the number of elements is worth the processing time, but it is also
98 | D i s c u s s i o n
important to consider that adding more elements does not guarantee that the
performance of the network will be better.
During the training stage, the selection of the hyper parameters was based on trial and
error within a range of recommended values or methods [16, 49, 57]. However, these
recommendations were applied to different type of datasets than the sets used in this
project which meant that, in some cases, the results obtained from the experiments
diverged from what the literature reported.
As far as it was possible, all the experiments were performed by varying only one
element at a time, namely hyper parameters, input data, or architecture; however, there
were some tests, in which one of these elements had to be modified from one
experiment to the other. For instance, during the data augmentation and occlusion tests,
the input data was created every time a new transformation was applied which added
some uncertainty to the process since, as it was mentioned in chapter 6, it was required
to transform the input data randomly.
The next stage, testing, was performed using a separated dataset. This dataset was
created by using images that were excluded from the original dataset. Afterwards, a
model was trained using the best hyper parameters discovered during the experiments
and the test dataset was used to evaluate the performance of the model. This whole
process proved to be time consuming given that a mistake at any point could affect
other experiments. For example, if the experiment to select the gradient descent
optimization algorithm had been performed incorrectly, the experiment to elect the
learning rate of the model would have had to be repeated.
Moreover, based on the results, it seems that some models overfitted given that the
validation accuracy was much higher than the test accuracy which was below 50%. The
models failed to generalise because they learned the noise in the training data as
patterns and when new data was processed, this impacted negatively the ability of the
models to recognise facial expressions.
D i s c u s s i o n | 99
In an attempt to fix the discrepancy between these values, a new model was trained
combining KDEF and JAFFE datasets. This created a DCNN that generalised better
than previous models and obtained reasonable accuracy when tested against unseen
images from these datasets.
7.2 Limitations
In both training and test stages, one important drawback that affected the accuracy was
the pre-processing stage, particularly face recognition. As it can be noticed in the figure
below, the face recognition algorithm did not always perform correctly, failing to identify
the face and instead confusing it with other parts of the body such as ears or skin with
moles. This error was most noticeable when using the KDEF dataset since it contained
facial images from different angles, and it caused an increase of junk images that did
not provide information, and affected the convergence point.
Figure 7-1: False positive face detection.
The problem was solved during training and testing by manually removing the
problematic images, but this alternative was not possible during the evaluation of the
system as it can be shown in Figure 7-2 where, due to the noisy background, a false
positive was found in the column:
100 | D i s c u s s i o n
Figure 7-2: Incorrect face detection in noisy background.
Moreover, the algorithm was incapable of recognising the face in profile pictures, which
decreased the number of images in the dataset. However, this problem was mitigated
by performing data augmentation.
7.3 Building a Graphical User Interface
The creation of a GUI was divided in two phases: design and implementation. As it was
mentioned, during the design phase, the idea behind the GUI was simplicity. The
interface was thought so that the user could quickly test the application using the
webcam or images from a file without the need of complex menus. This was achieved
by taking the place of the user and thinking how to create an intuitive interface.
Designing the GUI was not as complicated as implementing it given that thinking about
a new design is faster than actually transforming that idea into a real application. The
first problem found was that python was limited in the number of alternatives to create
an appealing interface. Even after finding a reasonable alternative, PyQt, this python
binding did not implement all the capabilities of the Qt library which meant that the GUI
appearance had to be adapted to these limitations.
D i s c u s s i o n | 101
7.4 Future Work
Deep Convolutional Neural Networks are powerful models once hyper parameters have
been tuned correctly. Therefore, one approach to further enhance the results in this
project is to improve the tuning process. As it is suggested in [74], an alternative for
hyper parameters selection is to wrap a “pure” learning algorithm around the DCNN
model to be trained. This would require either modifying the lasagne and nolearn
implementation or building the implementation from the ground up.
Additionally, modifying the facial detection process could improve the accuracy of the
system given that, during the training stage, junk images would be reduced, and during
the AFER system test stage, faces in images containing noisy backgrounds would be
recognised.
The application needs to run on a computer with openCV, cuda, lasagne, nolearn, and
python installed. Therefore, the portability is limited. One option to avoid installing all
those elements is to create a website that replicates the functionality of this application.
Finally, another direction to improve this project is to experiments with different
architectures. In this dissertation, some of them were trained and tested. However,
training these complex architectures was time-consuming. Due to this constraint, it was
only possible to test a limited number of architectures.
102 | C o n c l u s i o n s
Chapter 8. Conclusions
The purpose of this dissertation was to automate the facial expression recognition
process. This was achieved by extensively training and testing several Deep
Convolutional Neural Networks and finding the most suitable hyper parameters.
The dissertation focused on seven stages required to build the Automatic Facial
Expression Recognition (AFER) system, namely data collection, pre-processing, system
design, development, training, testing, and evaluation.
One of the most crucial parts on this dissertation was the background which defined and
delimited this project. Moreover, it provided the ideas from experts in the field of
machine learning, and in particular, neural networks, to implement during the training
and testing stages.
The experimental section was performed using two benchmark datasets: the JAFFE
and KDEF datasets. The results obtained from these tests demonstrated that it is critical
to apply different techniques in order to avoid overfitting and increase accuracy.
Moreover, it was showed that the greater the number of layers the DCNN has, the more
care has to be taken when selecting the hyper parameters to prevent a decrease in
accuracy. This holds particularly true when selecting the learning rate and the gradient
descend optimization algorithm which as it was shown are closely dependent.
Additionally, it was proven that the ReLU activation function should be the preferred
option over sigmoid or tanh. All these experiments allowed for a deeper understanding
of the advantages and limitations of the models.
The deliverables of this project were described and documented in this dissertation,
including the research methodology, system implementation, experimental results, and
analysis of these results in order to grasp a deeper understanding of DCNN.
C o n c l u s i o n s | 103
Finally, the document presented some alternatives to expand this project based on the
experimental results and the experience acquired during the development of this
system.
104 | R e f e r e n c e
Reference
[1] Ekman, P. and Rosenberg, E. (1997). What the face reveals. New York: Oxford
University Press.
[2] Jabr, Ferris. (2010) "The Evolution Of Emotion: Charles Darwin's Little-Known
Psychology Experiment". Scientific American Blog Network, 2 Mar. 2016.
[3] Russell, James A, and Jose Miguel Fernandez Dols. (1997) The Psychology Of
Facial Expression. Cambridge: Cambridge University Press.
[4] Bettadapura, Vinay. (n.d.) "Face Expression Recognition And Analysis: The State Of
The Art". arXiv:1203.6722 [cs.CV]
[5] Siegman, Aron Wolfe, and Stanley Feldstein. Nonverbal Behavior And
Communication. Chapter 4. Facial Expression. Hillsdale, N.J.: L. Erlbaum Associates,
1978.
[6] Liu, Ping et al. (n.d.) "Facial Expression Recognition Via A Boosted Deep Belief
Network".
[7] Zhao, Xiaoming, Xugan Shi, and Shiqing Zhang. (2015). "Facial Expression
Recognition Via Deep Learning". IETE Technical Review 32.5: 347-355.
[8] Fasel, B. and Luettin, J. (2003). Automatic facial expression analysis: a survey.
Pattern Recognition, 36(1), pp.259-275.
[9] Y. Lv, Z. Feng and C. Xu, "Facial expression recognition via deep learning," Smart
Computing (SMARTCOMP), 2014 International Conference on, Hong Kong, 2014, pp.
303-308. doi: 10.1109/SMARTCOMP.2014.7043872
[10] Chibelushi, C. and Bourel, F. (2016). Facial Expression Recognition: A Brief
Tutorial Overview.
[11] P. K. Manglik, U. Misra, Prashant and H. B. Maringanti, "Facial expression
recognition," Systems, Man and Cybernetics, 2004 IEEE International Conference on,
2004, pp. 2220-2224 vol.3. doi: 10.1109/ICSMC.2004.1400658
[12] LeCun, Y. and Huang, F. (n.d.). Large-scale Learning with SVM and Convolutional
Nets for Generic Object Categorization.
R e f e r e n c e | 105
[13] Hinton, G. (2012). Neural Networks for Machine Learning. [online] Coursera.
Available at: https://www.coursera.org/course/neuralnets [Accessed 8 Mar. 2016].
[14] Nielsen, M. (2015). Neural Networks and Deep Learning. Determination Press.
[online] Available at: http://neuralnetworksanddeeplearning.com/chap5.html [Accessed
24 Apr. 2016].
[15] Ufldl.stanford.edu. (2016). Unsupervised Feature Learning and Deep Learning
Tutorial. [online] Available at: http://ufldl.stanford.edu/tutorial/supervised/Convolutional
NeuralNetwork/ [Accessed 15 Apr. 2016].
[16] Cs231n.github.io. (2016). CS231n Convolutional Neural Networks for Visual
Recognition. [online] Available at: http://cs231n.github.io/convolutional-networks/
[Accessed 15 Apr. 2016].
[17] Poczos, B. and Singh, A. (2016). Introduction to Machine Learning. Deep Learning.
CMU.
[18] Six Novel Machine Learning Applications (2016). Forbes. [online] Available at:
http://www.forbes.com/sites/85broads/2014/01/06/six-novel-machine-learning-applicati
ons/#1d3c954367bf [Accessed 17 Apr. 2016].
[19] Bishop, C. (2006). Pattern recognition and machine learning. New York: Springer,
pp.227 - 249 and 256 - 272.
[20] MacKay, D. (2003). Information theory, inference, and learning algorithms.
Cambridge, UK: Cambridge University Press, pp. 467 - 492.
[21] Murphy, K. (2012). Machine learning. Cambridge, Mass.: MIT Press, pp. 563 - 579.
[22] LeCun, Y. et al. (1998). Object recognition with gradient-based learning. doi:
10.1007/3-540-46805-6_19
[23] Krizhevsky, A. et al. (2014). Dropout: A Simple Way to Prevent Neural Networks
from Overfitting.
[24] YouTube. (2016). CS231n Winter 2016: Lecture 7: Convolutional Neural Networks.
[online] Available at: https://www.youtube.com/watch?v=LxfUGhug-iQ&feature=youtu.be
[Accessed 18 Apr. 2016].
106 | R e f e r e n c e
[25] YouTube. (2016). Convolutional Neural networks. [online] Available at:
https://www.youtube.com/watch?v=rxKrCa4bg1I [Accessed 15 Apr. 2016].
[26] Rodrigues, C. (2015). Exploring the transfer learning aspect of deep neural
networks in facial information processing. University of Manchester., p.23.
[27] Win-Vector Blog. (2014). Estimating Generalization Error with the PRESS statistic.
[online] Available at: http://www.win-vector.com/blog/2014/09/estimating-generalization-
error-with-the- press-statistic/ [Accessed 30 Apr. 2016].
[28] Amazon Web Services. (2016) Amazon Web Services. [online] Available at:
https://aws.amazon.com [Accessed 25 Apr. 2016].
[29] Docs.aws.amazon.com. (2016). Linux GPU Instances - Amazon Elastic Compute
Cloud. [online] Available at: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/
using_cluster _computing.html [Accessed 25 Apr. 2016].
[30] Nvidia.com. (2016). NVIDIA KEPLER GK110 Next-Generation CUDA Compute
Architecture. [online] Available at: http://www.nvidia.com/content/PDF/kepler
/NV_DS_Tesla_KCompute_ Arch_May_2012_LR.pdf [Accessed 25 Apr. 2016].
[31] Nvidia.com. (2016). Specs & Features of GRID Cloud Gaming GPUs | NVIDIA.
[online] Available at: http://www.nvidia.com/object/cloud-gaming-gpu-boards.html
[Accessed 25 Apr. 2016].
[32] Nvidia.com. (2016). NVIDIA Kepler Compute Architecture | High Performance
Computing | NVIDIA. [online] Available at: http://www.nvidia.com/object/nvidia-
kepler.html [Accessed 25 Apr. 2016].
[33] School of Computer Science (2016). Computer Languages. [online] Available at:
http://www.cs.man.ac.uk/~pjj/cs1001/software/node3.html [Accessed 27 Apr. 2016].
[34] Brownlee, J. (2014). Best Programming Language for Machine Learning - Machine
Learning Mastery. [online] Machine Learning Mastery. Available at: http://
machinelearningmastery.com/best-programming-language-for-machine-learning/
[Accessed 25 Apr. 2016].
[35] Krizhevsky, A. and Hinton, G. (2012). ImageNet Classification with Deep
Convolutional Neural Networks.
R e f e r e n c e | 107
[36] Fernández Villaverde, J. and Aruoba, B. (2014). An empirical comparison of seven
programming languages. P.15.
[37] Docs.continuum.io. (2016). Anaconda | Continuum Analytics: Documentation.
[online] Available at: https://docs.continuum.io/anaconda/index [Accessed 26 Apr.
2016].
[38] Nvidia.com. (2016). Parallel Programming and Computing Platform | CUDA |
NVIDIA | NVIDIA. [online] Available at: http://www.nvidia.com/object/cuda_home
_new.html [Accessed 26 Apr. 2016].
[39] NVIDIA Developer. (2014). NVIDIA cuDNN. [online] Available at: https://developer
.nvidia.com/cudnn [Accessed 26 Apr. 2016].
[40] Deeplearning.net. (2016). Welcome — Theano 0.8.0 documentation. [online]
Available at: http://deeplearning.net/software/theano/ [Accessed 26 Apr. 2016].
[41] Lasagne.readthedocs.io. (2016). Welcome to Lasagne — Lasagne 0.2.dev1
documentation. [online] Available at: http://lasagne.readthedocs.io/en/latest/ [Accessed
26 Apr. 2016].
[42] Docs.opencv.org. (2016). OpenCV: Face Detection using Haar Cascades. [online]
Available at: http://docs.opencv.org/master/d7/d8b/tutorial_py_face_detection.html#
gsc.tab=0 [Accessed 18 Jul. 2016].
[43] Opencv.org. (2016). OpenCV. [online] Available at: http://opencv.org/ [Accessed 24
Apr. 2016].
[44] Pythonhosted.org. (2016). nolearn.lasagne — nolearn 0.6 documentation. [online]
Available at: https://pythonhosted.org/nolearn/lasagne.html [Accessed 26 Apr. 2016].
[45] Chris Nicholson, A. (2016). Deep Learning Comp Sheet: Deeplearning4j vs. Torch
vs. Theano vs. Caffe vs. TensorFlow - Deeplearning4j: Open-source, distributed deep
learning for the JVM. [online] Deeplearning4j.org. Available at:
http://deeplearning4j.org/compare-dl4j-torch7 -pylearn.html [Accessed 27 Apr. 2016].
[46] Torch.ch. (2016). Torch | Scientific computing for LuaJIT. [online] Available at:
http://torch.ch/ [Accessed 27 Apr. 2016].
108 | R e f e r e n c e
[47] The Karolinska Directed Emotional Faces. Lundqvist, D., Flykt, A., & Öhman, A.
(1998). The Karolinska Directed Emotional Faces - KDEF, CD ROM from Department of
Clinical Neuroscience, Psychology section, Karolinska Institutet, ISBN 91-630-7164-9.
[48] Michael J. Lyons, Shigeru Akemastu, Miyuki Kamachi, Jiro Gyoba. Coding Facial
Expressions with Gabor Wavelets, 3rd IEEE International Conference on Automatic
Face and Gesture Recognition, pp. 200-205 (1998). doi: 10.1109/AFGR.1998.670949
[49] Nielsen, M. (2015). Neural Networks and Deep Learning. Determination Press.
[online] Available at: http://neuralnetworksanddeeplearning.com/chap3.html [Accessed
30 Apr. 2016].
[50] Deeplearning.net. (2016). Convolutional Neural Networks (LeNet) — DeepLearning
0.1 documentation. [online] Available at: http://deeplearning.net/tutorial/lenet.html
[Accessed 31 Apr. 2016].
[51] Simard, P., Steinkraus, D. and Platt, J. (2003). Best Practices for Convolutional
Neural Networks Applied to Visual Document Analysis. Proceedings of the Seventh
International Conference on Document Analysis and Recognition. IEEE Computer
Society. ISBN: 0-7695-1960-1
[52] Viola, P. and Jones, M. (2001). Rapid Object Detection using a Boosted Cascade
of Simple Features. doi: 10.1109/CVPR.2001.990517
[53] Deeplearning.net. (2016). Getting Started — DeepLearning 0.1 documentation.
[online] Available at: http://deeplearning.net/tutorial/gettingstarted.html [Accessed 9 May
2016].
[54] Ufldl.stanford.edu. (2016). Data Preprocessing - Ufldl. [online] Available at:
http://ufldl.stanford.edu/wiki/index.php/Data_Preprocessing [Accessed 18 Jul. 2016].
[55] Baker, S. (2014). Obamacare Website Has Cost $840 Million. [online] The Atlantic.
Available at: http://www.theatlantic.com/politics/archive/2014/07/obamacare-website-
has-cost-840-million/440478/ [Accessed 24 Jul. 2016].
[56] Bloomberg.com. (2016). Obamacare Website Costs Exceed $2 Billion, Study Finds.
[online] Available at: http://www.bloomberg.com/news/articles/2014-09-24/obamacare-
website-costs-exceed-2-billion-study-finds [Accessed 24 Jul. 2016].
R e f e r e n c e | 109
[57] Cs231n.github.io. (2016). CS231n Convolutional Neural Networks for Visual
Recognition. [online] Available at: http://cs231n.github.io/neural-networks-2/ [Accessed
26 Jul. 2016].
[58] Cs231n.github.io. (2016). CS231n Convolutional Neural Networks for Visual
Recognition. [online] Available at: http://cs231n.github.io/neural-networks-1/ [Accessed
26 Jul. 2016].
[59] kaurdavinder, V. (2015). Difference between RGB and BGR. [online] Davinder
Kaur. Available at: https://lifearoundkaur.wordpress.com/2015/08/04/difference-
between-rgb-and-bgr/ [Accessed 29 Jul. 2016].
[60] RGB, W. (2016). Why OpenCV Using BGR Colour Space Instead of RGB. [online]
Stackoverflow.com. Available at: http://stackoverflow.com/questions/14556545/why-
opencv-using-bgr-colour-space-instead-of-rgb [Accessed 30 Jul. 2016].
[61] Nielsen, M. (2015). Neural Networks and Deep Learning. Determination Press.
[online] Available at: http://neuralnetworksanddeeplearning.com/chap6.html [Accessed
31 Jul. 2016].
[62] Cs231n.github.io. (2016). CS231n Convolutional Neural Networks for Visual
Recognition. [online] Available at: http://cs231n.github.io/neural-networks-3/ [Accessed
31 Jul. 2016].
[63] Wolframalpha.com. (2016). Wolfram|Alpha: Computational Knowledge Engine.
[online] Available at: http://www.wolframalpha.com/input/?i=law+of+large+numbers
[Accessed 31 Jul. 2016].
[64] YouTube. (2016). TTIC Distinguished Lecture Series - Geoffrey Hinton. [online]
Available at: https://www.youtube.com/watch?v=EK61htlw8hY [Accessed 31 Jul. 2016].
[65] Sebastian Ruder. (2016). An overview of gradient descent optimization algorithms.
[online] Available at: http://sebastianruder.com/optimizing-gradient-descent/index.html
[Accessed 31 Jul. 2016].
[66] Zeiler, M. (2016). Adadelta: An adaptive learning rate method. Google Inc., USA -
New York University, USA. arXiv:1212.5701 [cs.LG]
110 | R e f e r e n c e
[67] Hinton, G. (2016). Lecture 6e. rmsprop: Divide the gradient by a running average of
its recent magnitude. [online] Available at: http://www.cs.toronto.edu/~tijmen
/csc321/slides/ lecture_slides_lec6.pdf [Accessed 02 Aug. 2016].
[68] Kingma, D. and Ba, J. (2015). Adam: A method for Stochastic Optimization.
arXiv:1412.6980 [cs.LG]
[69] James, G. (2014). An introduction to statistical learning. New York, NY: Springer,
pp.374-385.
[70] Alpaydin, E. (2010). Introduction to machine learning. Cambridge, Mass.: MIT
Press, pp. 113-120.
[71] Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A. and Bengio, Y. (2013).
Maxout Networks. JMLR WCP 28, pp.1319-1327. arXiv:1302.4389 [stat.ML]
[72] Shlens, J. (2003). A tutorial on Principal Component Analysis: Derivation,
discussion and Singular Value Decomposition. [online] princeton.edu. Available at:
https://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf [Accessed 8
Aug. 2016].
[73] Chen, K. (2015). Principal Component Analysis (PCA).
[74] Yoshua Bengio. (2012) Early Stopping – But When? In Neural Networks: Tricks of
the Trade. Springer, pp. 53 – 69.
[75] Hastie, T., Tibshirani, R. and Friedman, J. (2001). The elements of statistical
learning. New York: Springer, pp.219-228.
[76] Duncan, B. Bias and variance. [online] Fliptop blog. Available at:
http://blog.fliptop.com/blog /2015/03/02/bias-variance-and-overfitting-machine-learning-
overview/ [Accessed 11 Aug. 2016].
[77] Le Cun, Y., Denker, J. and Solla, S. (1990). Optimal Brain Damage. AT&T Bell
Laboratories. ISBN:1-55860-100-7
[78] Riverbankcomputing.com. (2016). Riverbank | Software | PyQt | What is PyQt?.
[online] Available at: https://riverbankcomputing.com/software/pyqt/intro [Accessed 12
Jun. 2016].
R e f e r e n c e | 111
[79] Simonyan, K. and Zisserman, A. (2015). Very deep convolutional networks for
large-scale image recognition. arXiv:1409.1556 [cs.CV].
[80] Szegedy, C. et al. (2014). Going Deeper with Convolutions. arXiv:1409.4842
[cs.CV].
[81] Alizadeh, S. and Fazel, A. (2015). Convolutional Neural Networks for Facial
Expression Recognition.
[82] Whitehead, N. and Fit-Florea, A. (n.d.). Precision & performance: Floating point and
IEEE 754 Compliance for NVIDIA GPUs.
112 | A p p e n d i x
Appendix
Setting the environment on Ubuntu 14.04.3 LTS with GPU
support
The following steps will install theano, lasagne, and nolearn with GPU support:
1. Update the default packages: sudo apt-get update
2. Install Ubuntu updates: sudo apt-get -y dist-upgrade
3. Create a Download folder: mkdir Downloads
4. Change directory: cd Downloads
5. Download Anaconda:
o wget https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux -x86_64.sh
6. Change file permissions: chmod 770 Anaconda2-4.1.1-Linux-x86_64.sh
7. Install Anaconda: ./Anaconda2-4.1.1-Linux-x86_64.sh
8. Reload the profile to update the PATH variable changed by the Anaconda
installation process: source ~/.bashrc
9. Install dependencies:
o sudo apt-get install -y gcc g++ gfortran build-essential git wget linux-image-generic libopenblas-dev python-dev python-nose
10. Install LAPACK: sudo apt-get install -y liblapack-dev
11. Download CUDA Toolkit:
o wget http://developer.download.nvidia.com/compute/cuda/7.5/ Prod/local_installers/cuda-repo-ubuntu1404-7-5-local_7.5-18_ amd64.deb
12. Disable nouveau drivers:
o Create a file at: /etc/modprobe.d/blacklist-nouveau.conf with the
following content:
blacklist nouveau options nouveau modeset=0
A p p e n d i x | 113
o Regenerate the kernel initramfs: sudo update-initramfs –u
13. Reboot the system: sudo reboot
14. Install repository meta-data: sudo dpkg -i cuda-repo-ubuntu1404-7-5-
local_7.5-18_amd64.deb
15. Update the Apt repository cache: sudo apt-get update
16. Install CUDA: sudo apt-get install cuda
17. Add the environment variables to the .bashrc file:
o export PATH="/usr/local/cuda-7.5/bin:$PATH" o export LD_LIBRARY_PATH="/usr/local/cuda-7.5/lib64:$LD_LIBRAR
Y_PATH"
18. Reload the profile: source ~/.bashrc
19. Install theano:
o conda install theano o pip install --upgrade --no-deps git+git://github.com/Theano/
Theano.git
20. Install nolearn: pip install nolearn
21. Install lasagne: pip install https://github.com/Lasagne/Lasagne/
archive/master.zip
To test if the installation was successful:
22. Type: python
23. Import theano: import theano
o The next message should be displayed:
“Using gpu device 0: GRID K520 (CNMeM is enabled with initial size: 95.0% of memory, cuDNN not available)”
Additionally, in order to install cuDNN libraries, it is necessary to:
24. Create an NVIDIA developer account.
25. Download the cuDNN tar file.
26. Decompress the file: tar -xvf cudnn-7.5-linux-x64-v5.1.tgz
27. Change directory to the extracted directory: cd cuda
28. Copy the files to the corresponding directories:
114 | A p p e n d i x
o sudo cp lib64/* /usr/local/cuda-7.5/lib64/ o sudo cp include/* /usr/local/cuda-7.5/include/
To test if the installation was successful:
29. Type: python
30. Import theano: import theano
o The next message should be displayed:
“Using gpu device 0: GRID K520 (CNMeM is enabled with initial size: 95.0% of memory, cuDNN 5103)”
Installing OpenCV on Ubuntu 14.04.3 LTS
The following steps will install OpenCV on Ubuntu:
1. Install cmake and git: sudo apt-get install cmake git
2. Install the image IO package: sudo apt-get install libjpeg8-dev
libtiff4-dev libjasper-dev libpng12-dev
3. Install the library to display the image on the screen: sudo apt-get install
libgtk2.0-dev
4. Install libraries to handle video stream: sudo apt-get install libavcodec-
dev libavformat-dev libswscale-dev libv4l-dev
5. Install libraries to optimise some openCV routines: sudo apt-get install
libatlas-base-dev gfortran
6. Pull down openCV from GitHub: git clone https://github.com/Itseez
/opencv.git
7. Checkout the 3.0.0 version: git checkout 3.0.0
8. Pull down openCV contrib from GitHub: git clone
https://github.com/Itseez/opencv_contrib.git
9. Checkout the 3.0.0 version: git checkout 3.0.0
10. Create a build directory inside opencv directory and change the directory to it:
o cd ~/opencv o mkdir build o cd build
A p p e n d i x | 115
11. Setup the build with the following command:
cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/home/<user>/opencv -D PYTHON_INCLUDE_DIR=/home/<user>/anaconda2/include/python2.7/ -D PYTHON_INCLUDE_DIR2=/home/<user>/anaconda2/include/python2.7 -D PYTHON_LIBRARY=/home/<user>/anaconda2/lib/libpython2.7.so -D PYTHON_PACKAGES_PATH=/home/<user>/anaconda2/lib/python2.7/site-packages/ -D BUILD_EXAMPLES=ON -D BUILD_NEW_PYTHON_SUPPORT=ON -D PYTHON2_LIBRARY=/home/lm/anaconda2/lib/libpython2.7.so -D BUILD_opencv_python3=OFF -D BUILD_opencv_python2=ON ..
12. Make the file: make -j4 (4 refers to the number of cores, so it can be replaced
with another number)
13. Install: sudo make install
14. Configure the necessary links and cache: sudo ldconfig
15. Move the cv2.so file to the site-packages directory: cp build/lib/cv2.so
~/anaconda2/lib/python2.7/site-packages/
To test if the installation was successful:
1. Type: python
2. Import the library: import cv2
o If no error message is displayed, the installation was correct.
Running experiments using theano, lasagne, and nolear.
In order to test the code, it is necessary to follow these steps:
1. Download the datasets JAFFE and KDEF.
2. Install unzip: sudo apt-get install unzip
3. Extract the content:
o unzip jaffe.zip o unzip KDEF.zip
4. Download the Output.tar folder.
5. Extract the folders: tar -xvf Output.tar
6. Change directory to the newly extracted pyCodeP: cd pyCodeP
7. Modify CNN.py according to the desired network.
116 | A p p e n d i x
8. Run CNN.py: python CNN.py
9. Remove inputDataX and inputDataY: rm inputData*
10. Test the network saved in convNetPickles: python calculateAccuracy.py
convNetPickles directory_with_test_dataset
Top Related