Download - “Automatic Facial Expression Recognition”studentnet.cs.manchester.ac.uk/resources/library/thesis_abstracts/M… · Chapter 1. Introduction In this chapter, an explanation about

“Automatic Facial Expression Recognition”

A dissertation submitted to The University of Manchester for the degree of Master of Science in the Faculty of Engineering and

Physical Sciences

2016

by

Hugo Gamboa Valero

School of Computer Science

2 | C o n t e n t

Content

Abstract .......................................................................................................................... 8

Declaration ..................................................................................................................... 9

Copyright ..................................................................................................................... 10

Acknowledgements ..................................................................................................... 11

Chapter 1. Introduction .............................................................................................. 12

1.1 Motivation ......................................................................................................... 12

1.2 Aim ................................................................................................................... 13

1.3 Objectives ........................................................................................................ 14

1.3.1 Learning objectives .................................................................................... 14

1.3.2 Deliverable objectives ................................................................................ 14

1.4 Structure of the dissertation ............................................................................. 14

Chapter 2. Literature Review ..................................................................................... 16

2.1 General approach for facial expression recognition ......................................... 16

2.2 Classification .................................................................................................... 18

2.2.1 Linear classifiers ........................................................................................ 19

2.2.2 Nonlinear classifiers .................................................................................. 19

2.3 Subtleties of creating a model .......................................................................... 32

2.3.1 Bias-Variance trade-off, overfitting and underfitting ................................... 32

2.3.2 Vanishing gradient and exploding gradient ................................................ 33

Chapter 3. Deep Convolutional Neural Network ...................................................... 35

3.1 Overview .......................................................................................................... 35

3.1.1 Convolutional layer .................................................................................... 35

3.1.2 Pooling/subsampling layer ......................................................................... 38

C o n t e n t | 3

3.1.3 Fully-connected layer ................................................................................ 40

3.2 Backpropagation in DCNNs ............................................................................. 40

3.3 Softmax classifier ............................................................................................. 41

3.4 Discrete convolution ......................................................................................... 42

3.5 Deep Convolutional Neural Network Architectures .......................................... 43

3.5.1 Architecture Considerations ....................................................................... 43

3.5.2 Improvement Strategies............................................................................. 44

3.6 Related work .................................................................................................... 48

3.6.1 AlexNet ...................................................................................................... 48

3.6.2 Visual Geometry Group ............................................................................. 48

3.6.3 GoogLeNet ................................................................................................ 49

3.6.4 Convolutional Neural Network for Facial Expression Recognition ............. 50

Chapter 4. Research methodology ........................................................................... 52

4.1 Project phases ................................................................................................. 52

4.1.1 Data collection ........................................................................................... 52

4.1.2 Pre-processing .......................................................................................... 54

4.1.3 System design ........................................................................................... 58

4.1.4 Development ............................................................................................. 62

4.1.5 Training and Testing .................................................................................. 63

4.1.6 Evaluation .................................................................................................. 66

Chapter 5. Implementation ........................................................................................ 67

5.1 Training modules .............................................................................................. 67

5.1.1 Input module .............................................................................................. 67

5.1.2 DCNN module ........................................................................................... 68

5.2 AFER modules ................................................................................................. 68

4 | C o n t e n t

5.2.1 Graphical user interface module ................................................................ 68

5.2.2 Input processing module............................................................................ 72

5.2.3 DCNN application module ......................................................................... 72

Chapter 6. Experiments ............................................................................................. 74

6.1 Training stage .................................................................................................. 74

6.1.1 Hyper parameters ...................................................................................... 74

6.1.2 Other alternatives to improve the DCNN ................................................... 81

6.2 Test stage ........................................................................................................ 86

6.2.1 Difficult scenarios ...................................................................................... 87

6.2.2 Innovative architectures ............................................................................. 90

6.2.3 Additional test ............................................................................................ 92

Chapter 7. Discussion ............................................................................................... 97

7.1 Creating a Deep Convolutional Neural Networks ............................................. 97

7.2 Limitations ........................................................................................................ 99

7.3 Building a Graphical User Interface ................................................................ 100

7.4 Future Work ................................................................................................... 101

Chapter 8. Conclusions ........................................................................................... 102

Reference ................................................................................................................... 104

Appendix .................................................................................................................... 112

Setting the environment on Ubuntu 14.04.3 LTS with GPU support ........................ 112

Installing OpenCV on Ubuntu 14.04.3 LTS .............................................................. 114

Running experiments using theano, lasagne, and nolear. ....................................... 115

Number of words: 17819

Appendix: 699

L i s t o f f i g u r e s | 5

List of figures

Figure 2-1: General facial expression recognition framework . ..................................... 16

Figure 2-2: Comparison between standard momentum and Nesterov Accelerated

Gradient . ................................................................................................................ 27

Figure 2-3: Bias and variance representation . .............................................................. 32

Figure 3-1: Local connectivity and receptive field. ......................................................... 36

Figure 3-2: Parameter sharing. ..................................................................................... 36

Figure 3-3: Activation maps . ......................................................................................... 37

Figure 3-4: Local translation invariance and spatial size reduction. .............................. 39

Figure 3-5: Downsampling (left) and MAX pooling (right) . ............................................ 39

Figure 3-6: Example of discrete convolution. ................................................................ 42

Figure 3-7: Inception module: naïve version (left) and with dimensionality reduction

(right). ..................................................................................................................... 50

Figure 3-8: Architectures from left to right: AlexNet, VGG, GoogLeNet, and CNN for

facial recognition . ................................................................................................... 51

Figure 4-1: Sample images from the KDEF dataset. ..................................................... 53

Figure 4-2: Sample images from the JAFFE dataset. ................................................... 54

Figure 4-3: Haar features . ............................................................................................ 55

Figure 4-4: Area calculation using Integral Image. ........................................................ 55

Figure 5-1: Face detection and numpy array creation. .................................................. 67

Figure 5-2: AFER Graphical User Interface. .................................................................. 69

Figure 5-3: Result from input image. ............................................................................. 70

Figure 5-4: Result from input video. .............................................................................. 70

Figure 5-5: AFER process. ............................................................................................ 71

Figure 6-1: Accuracy of applying diverse GDO algorithms after varying the learning rate.

................................................................................................................................ 75

Figure 6-2: Validation and test accuracy for different learning rate values. ................... 76

Figure 6-3: Highest accuracy result of each GDO algorithms. ...................................... 77

Figure 6-4: Accuracy after varying the size of the max-pooling filter. ............................ 78

Figure 6-5: Accuracy for different activation functions. .................................................. 79

6 | L i s t o f f i g u r e s

Figure 6-6: Validation and test accuracy for FC layers with varying number of neurons.

................................................................................................................................ 80

Figure 6-7: Validation and test accuracy for different number of FC layers. .................. 81

Figure 6-8: Accuracy of different pre-processing methods. ........................................... 82

Figure 6-9: Transformations applied to one image. ....................................................... 83

Figure 6-10: Accuracy of using different data augmentation techniques. ...................... 84

Figure 6-11: Accuracy of using normalisation and dropout. .......................................... 86

Figure 6-12: Images with 10% (top), 30% (middle) and 50% (bottom) occlusion. ......... 87

Figure 6-13: Original images (first row) and images with low (second row), medium

(third row), and high (fourth row) levels of gamma variation. .................................. 88

Figure 6-14: Accuracy for each DCNN model. .............................................................. 90

Figure 6-15: Architectures: AlexNet (Left), VGG (Centre), and GoogLeNet (Right). ..... 91

Figure 6-16: Test accuracy for different models using different datasets. ..................... 93

Figure 6-17: Confusion matrix for the new model trained with both KDEF and JAFFE

datasets (0: Fear, 1: Anger, 2: Disgust, 3: Happiness, 4: Neutral, 5: Sadness, and 6:

Surprise). ................................................................................................................ 96

Figure 7-1: False positive face detection. ...................................................................... 99

Figure 7-2: Incorrect face detection in noisy background. ........................................... 100

L i s t o f t a b l e s | 7

List of tables

Table 3-1: Data augmentation transformations. ............................................................ 45

Table 6-1: Number of images in each dataset. .............................................................. 84

Table 6-2: Occlusion test results. .................................................................................. 89

Table 6-3: Illumination test results. ................................................................................ 90

Table 6-4: AlexNet, GoogleNet, and VGG accuracy tests. ............................................ 92

Table 6-5: Validation and test accuracy for improved model and ensemble. ................ 94

Table 6-6: Precision, recall and f1-score for both models tested against the KDEF

dataset. ................................................................................................................... 95

Table 6-7: Precision, recall and f1-score for both models tested against the JAFFE

dataset. ................................................................................................................... 95

8 | A b s t r a c t

Abstract

Automatic Facial Expression Recognition

This dissertation presents the design, implementation, test, and evaluation of an

Automatic Facial Expression Recognition (AFER) system that applies a machine

learning algorithm based on Deep Convolutional Neural Networks (DCNNs) with the aim

of correctly classifying seven facial expressions (namely surprise, happiness, sadness,

fear, anger, disgust, and neutral). The DCNN module and the AFER system were built

in python, but only the training module exploited the Graphic Processing Unit (GPU)

computational power in order to accelerate this process.

Facial expressions convey helpful information that is difficult to detect for an ordinary

system. However, being capable of recognising them could lead to more responsive

and intelligent systems that might improve the user experience. By experimenting with

different models and architectures on different benchmark facial datasets such as the

Japanesse Female Facial Expression (JAFFE) and the Karolinska Directed Emotional

Faces (KDEF), the most suitable hyper parameters that yielded a good level of

performance were obtained. Additionally, a deep understanding of the strengths and

limitations of DCNNs was gained.

Results from these experiments show that special care must be taken during different

parts of the development process such as architecture selection or hyper parameter

tuning. By selecting the correct combination between these two elements, the accuracy

of the model and the convergence time improve.

D e c l a r a t i o n | 9

Declaration

No portion of the work referred to in this dissertation has been submitted in

support of an application for another degree or qualification of this or any other

university or other institute of learning.

10 | C o p y r i g h t

Copyright

i. The author of this thesis (including any appendices and/or schedules to this

thesis) owns certain copyright or related rights in it (the “Copyright”) and s/he has

given The University of Manchester certain rights to use such Copyright,

including for administrative purposes.

ii. Copies of this thesis, either in full or in extracts and whether in hard or electronic

copy, may be made only in accordance with the Copyright, Designs and Patents

Act 1988 (as amended) and regulations issued under it or, where appropriate, in

accordance with licensing agreements which the University has from time to

time. This page must form part of any such copies made.

iii. The ownership of certain Copyright, patents, designs, trade marks and other

intellectual property (the “Intellectual Property”) and any reproductions of

copyright works in the thesis, for example graphs and tables (“Reproductions”),

which may be described in this thesis, may not be owned by the author and may

be owned by third parties. Such Intellectual Property and Reproductions cannot

and must not be made available for use without the prior written permission of

the owner(s) of the relevant Intellectual Property and/or Reproductions.

iv. Further information on the conditions under which disclosure, publication and

commercialisation of this thesis, the Copyright and any Intellectual Property

and/or Reproductions described in it may take place is available in the University

IP Policy (see http://documents.manchester.ac.uk/DocuInfo.aspx?DocID=487), in

any relevant Thesis restriction declarations deposited in the University Library,

The University Library’s regulations (see

http://www.manchester.ac.uk/library/aboutus/regulations) and in The University’s

policy on presentation of Theses.

http://documents.manchester.ac.uk/DocuInfo.aspx?DocID=487

http://www.manchester.ac.uk/library/aboutus/regulations

A c k n o w l e d g e m e n t s | 11

Acknowledgements

First and foremost, I would like to thank my supervisor, Dr Ke Chen. I am grateful for his expert

advice and guidance during this dissertation.

I would also like to thank my family; their constant support has been the cornerstone of every

project.

Finally, I would like to thank Bere for her financial support; and Coco, Dimitra and Lucas for

showing me a different perspective of life.

This study was done with financial support of the Mexican Council of Science and Technology

(CONACyT) under the scholarship number CVU 669499.

12 | I n t r o d u c t i o n

Chapter 1. Introduction

In this chapter, an explanation about the motivation behind the project is provided.

Afterwards, the major aim and objectives are listed. Finally section 1.4, the structure of

the dissertation is described.

1.1 Motivation

As computers become increasingly ubiquitous and their relationship with users

changes, they need new tools to obtain feedback from their interactions with those

users, and respond accordingly. Nowadays, there are different alternatives to extract

feedback from users such as heart rate, tone of voice, body movement, body language,

etc. However, some of those alternatives are obtrusive to users; or do not provide

enough or accurate feedback in order for a system to be reliable.

Feedback, in the form of user’s emotions, offers valuable information that could have a

positive impact in different areas such as e-marketing, robotics, smart products, etc. For

instance, developers could create a music application that gradually adjusted the type of

music that is being played according to the emotions detected so that people feeling a

negative mood such as sadness or anger would change those emotions and feel better.

One unobtrusive alternative that offers a reasonable amount of feedback is the face,

particularly facial expressions. Facial expressions have been considered a good source

of information to determine the true emotions of an individual [1]. Even before Charles

Darwin conducted “studies on how people recognize emotion in faces” [2], ancient

thinkers, such as Aristoteles, already knew the importance of facial expressions [3].

However, it was until Paul Ekman conducted cross cultural experiments around the

world that a set of universal emotions, namely surprise, happiness, sadness, fear,

anger, and disgust; were finally accepted [4, 5].

In the past, automating facial expression recognition accurately was unimaginable not

only because the computational power was limited and expensive, but also because the

I n t r o d u c t i o n | 13

techniques used performed poorly on image recognition from raw pixels [12]. However,

with the advances in faster Graphic Processing Units (GPUs) and parallelization, the

development of special-purpose machine learning models, and the availability of

sizeable amounts of data, the unimaginable has become possible.

To a certain extent, faster GPUs and parallelization are helpful, but getting to a specific

answer faster does not guarantee that it is the correct response. Finding the answer

which is closer to the correct one or, in other words, learning from data is the purpose of

machine learning models.

Recently, a branch of machine learning that has become popular is deep learning. Deep

learning models have achieved better accuracy than traditional approaches such as

SVM or kNN [13, 26]. For instance, models trained using deep learning in the ImageNet

Challenge took the first three places in the competition. Therefore, a reasonable

approach would be to use a deep learning model in order to train an automatic facial

expression recognition (AFER) system and improve the accuracy of this task.

This project is concerned with developing an AFER system using a state-of-the-art deep

learning model, i.e. Deep Convolutional Neural Networks (DCNNs) in the facial

expression domain. The prototype should analyse images captured in real time via

webcam and files from the local disk, and display the result through the Graphical User

Interface (GUI).

1.2 Aim

Design, implement, test, and analyse an automatic system that classifies correctly

seven facial expressions (namely surprise, happiness, sadness, fear, anger, disgust,

and neutral) by using a machine learning algorithm model based on DCNNs trained with

facial expression datasets.

14 | I n t r o d u c t i o n

1.3 Objectives

The main objectives of this project are:

1.3.1 Learning objectives

Investigate and understand the advantages of using deep convolutional neural

network models over other deep learning models.

Research the advantages and disadvantages of applying DCNNs in the domain

of facial expression images.

Study the hyper parameters, properties, and methods for training DCNN.

Investigate and understand the process of facial expression recognition from

images.

Analyse efficient and rapid hardware and software alternatives in order to train

DCNNs.

1.3.2 Deliverable objectives

Design and implement a GUI application using Python that is able to recognise

seven facial expressions online.

Train a DCNN using Python and Theano on Amazon Web Services

infrastructure, and utilise this model in the GUI application to predict facial

expressions.

Test and measure the application accuracy under different conditions such as

illumination and pose variation, and partial facial occlusion.

Experiment with different hyper parameters and the impact of these elements on

the accuracy of the models.

Analyse the test results from the previous experiments.

1.4 Structure of the dissertation

The structure of this dissertation is as follows:

I n t r o d u c t i o n | 15

Chapter 1 specifies the motivations to solve the problem domain along with the

aim and objectives to achieve in the project.

Chapter 2 provides a literature review of the topics related to the project including

the general approach that has been used to solve this problem; a description of

neural networks, and the complexity of creating a model due to the bias-variance

trade-off.

Chapter 3 describes characteristics and components of the DCNN, the way in

which these components interact with each other and the crucial values that must

be taken into consideration when designing this type of network.

Chapter 4 focuses on the research methodology and the stages in this project,

namely data collection, pre-processing, system design, development, training,

testing, and evaluation.

Chapter 5 details the implementation process and provides information about the

training and Automatic Facial Expression Recognition (AFER) modules.

Chapter 6 specifies the experiments that were performed during the training

stage in order to select the best hyper parameters; and the test stage to improve

the accuracy.

Chapter 7 describes the lessons that were learned during the development of the

system, including the challenges faced in the implementation, training and test

stages. Additionally, it describes new directions for future work.

Chapter 8 summarizes the project, and analyses the objectives that were fulfilled.

16 | L i t e r a t u r e R e v i e w

Chapter 2. Literature Review

This chapter describes the background material for this project. Section 2.1 explains the

general approach to facial expression recognition. Then, section 2.2 describes linear

and nonlinear classifiers, particularly, neural networks, their features, and important

elements that need to be considered when a model is built. Finally, section 2.3 details

the problems faced when a model is created.

2.1 General approach for facial expression recognition

The general framework to approach Facial Expression Recognition is shown in Figure

2-1. It consists of five steps: data acquisition, pre-processing, feature extraction,

classification, and post-processing [10].

Figure 2-1: General facial expression recognition framework [10].

The first step is responsible for obtaining static images or image sequences showing

different perspectives of the face. Both types of images can be two- or three-

dimensional, but image sequences provides more information because they are able to

represent the temporal characteristics of an expression [10].

L i t e r a t u r e R e v i e w | 17

In general, object recognition is a challenging task because of the following elements:

segmentation problems, deformation, illumination, affordance, and viewpoints.

Segmentation problems occur because real-life images are cluttered with other objects

which complicates the segmentation task; deformation happens when objects are

modified in non-affine ways such as the hand-written two which can have a large loop or

cusp; illumination is related to the source of lighting and its effect on the intensity of

each pixel; affordance refers to how the relationship between objects and actions

defines the classes of objects, e.g. chairs have different physical shapes, but all chairs

are used for sitting. Finally, viewpoints involve the different perspectives of an object

which some learning methods cannot handle [13].

Pre-processing is an alternative against the aforementioned problems. In this stage,

images are modified against head translation, rotation, and scale using geometric

normalization; moreover, there are different techniques used in order to solve

illumination problems, e.g. histogram equalization. Additionally, image segmentation is

achieved using different alternatives such as Gaussian mixture model of the skin or

deformable models of face parts [10].

In FER, face detection is a crucial task. There are different approaches for this, namely

appearance based approach, template based approach, feature based approach, and

the local-global graph approach [9].

● In the appearance based approach, the face is recognized as a whole, and the

classifier is trained with face and non-face patterns. Nevertheless, one limitation

of this approach is that it is only accurate in frontal images that are well-

illuminated and have a simple background.

● In the template based approach, a standard face pattern is generated and

correlated with the image to find the face. The limitation with this approach is the

difficulty to generalize to different shapes, sizes and poses.

● In the feature based approach, not-changing facial features are detected. Then,

candidate faces are grouped and verified. The limitation with this approach is that


it is difficult to find features in pictures that do not have simple backgrounds or

are well-illuminated.

● Finally, the local-global graph approach generates a graph with nodes created

according to colour similarity which are, to an extent, invariant to translation,

rotation and scale. The limitations with this approach are the low-image size, and

the poor-image quality.

In the feature extraction phase, the objective is to obtain discriminatory and stable facial

features according to one of two perspectives, geometric feature-based methods, and

appearance-based methods. On the one hand, geometric feature-based methods

involve defining geometric relationships of facial components in terms of their location

and shape. One disadvantage of the facial feature extraction using this approach is the

complexity of selecting precise and reliable detection techniques to find those geometric

relationships; On the other hand, appearance-based methods apply image filters or filter

banks to the face or a section of the face. For these methods, Principal Component

Analysis (PCA), Independent Component Analysis (ICA) and Gabor wavelets have

been used with varying degrees of success [6, 8, 9, 10].

The next step, classification, is performed using one or more parametric or non-

parametric classifiers such as Hidden Markov models, k-Nearest Neighbour, Support

Vector Machine (SVM), etc. The goal of this step is to determine the type of facial

expression that was input using action units (AU) or prototypic facial expressions [10].

Finally, the objective of post-processing is to improve FER accuracy by taking

advantage of the domain knowledge or combining several levels of a classification

hierarchy [10].

2.2 Classification

Learning a model that is able to classify facial expressions requires two important

components: a score function and a loss function. The first component maps the input

data to class scores; and the second component quantifies how close the prediction of


the model is to the true value. For any model, then, it is crucial to establish a

relationship between these functions by creating an optimization problem that minimizes

the loss function with respect to the parameters of the score function [15, 16].

2.2.1 Linear classifiers

Linear classifiers are score functions that produce a linear mapping of the input data.

They are based on linear combinations of fixed nonlinear basis functions 𝜙𝑗(𝒙) and

adopt the form [19]:

𝑦(𝒙, 𝒘) = 𝑓 (∑ 𝑤𝑗

𝑀

𝑗=1

𝜙𝑗(𝒙))

(2-1)

Where 𝑓(∙) is a nonlinear activation function, in the case of classification problems; and,

the identity, in the case of regression problems [19]; 𝜙𝑗(𝒙) is a nonlinear basis function

that depends on parameters; and 𝑊 is formed by weight coefficients and biases.

Although linear classifiers are simple to understand, their linearity presents an important

limitation when they are used to model real-life data given that, often, this data is

nonlinearly separable.

2.2.2 Nonlinear classifiers

In order to solve this problem, an alternative is to use models that learn nonlinear

features such as neural networks.

2.2.2.1 Neural Networks

Neural networks are a series of functional transformations that are formed by a set of

computational units, also known as neurons, that were inspired by biological neural

networks. These networks use basis functions similar to (2-1), in which each of these is

a nonlinear function from a set of input variables to a set of output variables controlled

by a vector 𝑊 of adjustable parameters [16, 19].


According to [19], neural network models can be “significantly more compact, and

hence faster to evaluate, than a support vector machine having the same generalization

performance.” However, the cost of compactness increases the complexity of the

function of the model parameters as it is no longer convex.

2.2.2.1.1 Feed-forward Network function

The basic neural network model is built using M linear combinations of the input

variables 𝑥1, … , 𝑥𝐷, in the form:

𝑎𝑗 = ∑ 𝑤𝑗𝑖(1)

𝑀

𝑖=1

𝑥𝑖 + 𝑤𝑗0(1)

(2-2)

Where 𝑗 = 1, … , 𝑀; 𝑎𝑗 refers to activations; the superscript (1) refers to elements in the

first layer of the network; 𝑤𝑗𝑖(1)

refers to the weights and 𝑤𝑗0(1)

refers to the bias [19].

Activations are transformed applying a differentiable, nonlinear activation function ℎ(∙)

that creates:

𝑧𝑗 = ℎ(𝑎𝑗) (2-3)

The output quantities of this basis function are known as hidden units which perform

feature extraction or feature construction in order to learn nonlinear combinations of the

original input data. This process is “useful for problems where the original input features

are not very individually informative” [21].

The activation functions correspond to the basis function, and are usually sigmoidal

functions such as the logistic sigmoid, the 𝑡𝑎𝑛ℎ function, or Rectified Linear Units

function (ReLU). One important reason for this is that the neural network function

becomes differentiable with respect to the network parameters. These activation units

are grouped into layers which are classified in input, hidden and output layers. For the


output units, the choice of the activation function is determined by “the data and the

assumed distribution of target variables” [19]. In the case of regression problems, an

identity function is used; in the case of binary classification problems, a logistic sigmoid

function is used; and in the case of multiclass problems, a softmax activation function is

used [19, 21].

Following (2-1), the outputs from the hidden layers can be combined to give the output

unit activations:

𝑎𝑘 = ∑ 𝑤𝑘𝑗(1)

𝑀

𝑗=1

𝑧𝑗 + 𝑤𝑘0(1)

(2-4)

Where 𝑘 = 1, … , 𝐾, and 𝐾 is the total number of outputs; 𝑎𝑘 is the output unit activations;

𝑤𝑘𝑗(1)

are weights and 𝑤𝑘0(1)

are bias parameters.

Finally, by combining the previous equations, the overall neural network function can be

represented in the following form:

𝑦𝑘(𝒙, 𝒘) = 𝑓 (∑ 𝑤𝑘𝑗(2)

𝑀

𝑗=0

ℎ (∑ 𝑤𝑗𝑖(1)

𝐷

𝑖=0

𝑥𝑖))

(2-5)

Where all the weights and bias parameters were grouped into a vector 𝒘; and 𝑓 and ℎ

are non-linear activation function.

The main idea of supervised neural networks is that using the training data, the

algorithm learns a model that represents the relationship between the input vector and

the target. Thus, the network becomes more accurate the closer the output values are

to the target values. This is closely related to the selection of the activation and the error

functions which are defined by the type of problem being solved. In the case of


regression, linear outputs and a sum-of-square error are used; in the case of binary

classification, logistic sigmoid outputs and a cross-entropy error function are used; and

in the case of multiclass classification, softmax outputs and a multi-class cross-entropy

error function are used. In turn, these objective functions are utilized to evaluate the

model by searching for the weights that minimize this function [19, 20, 21].

2.2.2.1.2 Activation functions

An activation function, or non-linearity, performs a fixed mathematical operation on an

input. There are different activation functions; however, the four most common

activation functions are: sigmoid, tanh, ReLU, and maxout.

2.2.2.1.2.1 Sigmoid

The sigmoid activation function fits the input value within the range [0, 1]. It becomes 0

for large negative numbers, and 1 for large positive numbers. The function has the

following mathematical form:

𝜎(𝑥) =1

1 + 𝑒−𝑥

(2-6)

Where 𝑥 is an input value; and 𝑒 refers to the Euler’s number.

The function presents two disadvantages: 1) the sigmoid saturate and kills the gradients

(See section 2.3.2); and 2) sigmoid outputs are not zero-centred [58].

2.2.2.1.2.2 Tanh

The tanh activation function fits the input value within the range [-1, 1]. The

mathematical formula of this function is:

𝑡𝑎𝑛ℎ(𝑥) =𝑒𝑥 − 𝑒−𝑥

𝑒𝑥 + 𝑒−𝑥

(2-7)


Similarly to the sigmoid function, the activation units saturate; however, its output is

zero-centred [58].

2.2.2.1.2.3 ReLU and Leaky ReLU

The Rectified Linear Unit is another activation function that does not squash the input

value within a range; instead, it tresholds the values at zero by using the following

mathematical form [17, 58]:

𝑓(𝑥) = max (0, 𝑥) (2-8)

The advantages of the ReLU function are: 1) it accelerates the convergence of the

Gradient Descent Optimization algorithm; and 2) it has a simple implementation. The

main disadvantage of this function is that the activation units could die when the

learning rate is too high.

An alternative that tries to fix ReLU is Leaky ReLU. It attempts to do that by having a

small negative slope for values where 𝑥 < 0. This function Is defined as:

𝑓(𝑥) = 1(𝑥 < 0)(𝛼𝑥) + 1(𝑥 ≥ 0)(𝑥) (2-9)

Where 𝛼 is a small constant (approximately 0.01).

The disadvantage with this function is that the results presented by some researchers

have not been consistent [58].

2.2.2.1.2.4 Maxout

Maxout activation function has “beneficial characteristics both for optimization and

model averaging with dropout” [71]. The maxout function is defined as:


𝑓(𝑥) = max𝑗∈[1,𝑘]

(𝑥𝑇𝑊𝑗 + 𝑏𝑗) (2-10)

Where 𝑊 ∈ ℝ𝑑×𝑚×𝑘 and 𝑏 ∈ ℝ𝑚×𝑘 are learned parameters.

In terms of advantages, maxout: 1) is well-suited for training with dropout; 2) is capable

of training deeper networks than it is possible using ReLU; 3) ensures that every

parameter in the model benefits of using dropout and emulates bagging training; and 4)

has the benefits of ReLU without its drawbacks [58, 71].

One important disadvantage is that maxout “doubles the number of parameters for

every single neuron, leading to a high total number of parameters” [58].

2.2.2.1.3 Network training and Gradient Descent

The purpose of network training is to find the vector 𝑤 that produces the smallest value

𝐸(𝒘). That smallest value occurs at a point in the weight space where the gradient of

the error function vanishes, or in other words:

∇𝐸(𝒘) = 0 (2-11)

The points where this happens are called stationary points, and are further subdivided

in minima, maxima, and saddle points. Moreover, as the objective function has a highly

nonlinear dependence on the weights and bias parameters, there will be many of these

points that are equivalent. In the case of minima points, the points with equivalent value

receive the name of local minima, while the point with the smallest value receives the

name of global minimum [19].

According to [19], for a successful neural network, “it may not be necessary to find the

global minimum, but it may be necessary to compare several local minima in order to

find a sufficiently good solution.”


In order to find the solution for ∇𝐸(𝒘) = 0, most techniques require selecting initial

weight values and then moving through the weight space iteratively in successive steps

of the form:

𝑤(𝜏+1) = 𝑤𝜏 + ∆𝑤𝜏 (2-12)

Where 𝜏 refers to the iteration step.

The updates can be performed using gradient information. This information is used to

improve the speed with which the weight vector that produces the sufficiently good

solution is located. There are three common approaches for this update: batch, mini-

batch, and online method. The batch method uses the whole data set at once; the mini-

batch method uses part of the data set; and the online method uses one data point at a

time.

When the batch method is used, the approach is known as gradient descent or steepest

descent. It is defined by the form:

𝑤(𝜏+1) = 𝑤𝜏 − 𝜂∇𝐸(𝑤𝜏) (2-13)

However for batch optimization, gradient descent is less robust and slower than

conjugate gradient and quasi-Newton methods. The advantage of these methods over

gradient descent is that the error function always decreases at each iteration unless the

weight vector is at a local or global minimum [19].

When the mini-batch method is used, the approach is known as mini-batch gradient

descent. As it was mentioned before, the difference is that mini-batch uses b examples

in each iteration instead of the whole set.


Finally, when the online method is used, the approach is known as sequential gradient

descent or stochastic gradient descent. This method makes an update of the weight

vector using only one data point at a time either by cycling through the data in sequence

or selecting points at random with replacement [15, 19]. It is defined by:

𝑤(𝜏+1) = 𝑤𝜏 − 𝜂∇𝐸𝑛(𝑤𝜏) (2-14)

There are two important advantages of stochastic gradient descent (SGD) over batch

and mini-batch gradient descent: 1) SGD handles redundancy in the data more

efficiently; and 2) it can escape local minima.

2.2.2.1.3.1 Gradient Descent Optimization Algorithms

Gradient Descent optimization algorithms were developed in order to avoid getting

trapped in suboptimal points. There are numerous algorithms to solve this problem;

however, the most commonly used are: Momentum, Nesterov Accelerated Gradient,

Adagrad, Adadelta, RMSprop, and Adam. The following subsections describe these

algorithms.

2.2.2.1.3.1.1 Momentum

This method helps gradient descent to move in the correct direction by reducing the

effect of ravines which are areas where the surface “curves more steeply in one

dimension than in another” [65], and are common around local minima.

In order to reduce this effect, the method adds a fraction of the previous update to the

current update which can be expressed as:

𝑣𝜏+1 = 𝜇𝑣𝜏 − 𝜂∇𝐸(𝑤𝜏)

(2-15)

𝑤𝜏+1 = 𝑤𝜏 + 𝑣𝜏+1 (2-16)


Where the momentum 𝜇 ∈ [0,1] represents the weight of the previous update; and 𝑣𝜏+1

and 𝑣𝜏 are the updated values at the current and previous iteration, respectively.

In other words, the updated value increases when the movement of the update is in the

same direction, and decreases when the movement of the updates change directions.

The result of this is faster convergence, and reduced oscillation [65].

2.2.2.1.3.1.2 Nesterov Accelerated Gradient

Nesterov Accelerated Gradient (NAG) is an improvement over the momentum method

as it takes into account the future value of the gradient, and makes a correction before

jumping towards the value [62, 65].

Figure 2-2: Comparison between standard momentum and Nesterov Accelerated

Gradient [13].

In practice, it “consistently works slightly better than standard momentum” [62] as a

result of the “lookahead” gradient step. This is expressed as:

𝑣𝜏+1 = 𝜇𝑣𝜏 − 𝜂∇𝐸(𝑤𝜏 − 𝜇𝑣𝜏)

(2-17)

𝑤𝜏+1 = 𝑤𝜏 + 𝑣𝜏+1 (2-18)

Where ∇𝐸(𝑤𝜏 − 𝜇𝑣𝜏) is the “lookahead” gradient step, and the other elements are the

same as before.


2.2.2.1.3.1.3 Adagrad

It is an adaptive learning rate method that adjusts the dynamic learning rate of each

parameter by updating infrequent parameters using a larger learning rate, and updating

frequent parameters using a smaller learning rate. A nice property of this approach is

that the progress of infrequent and frequent parameters even out over time.

𝑤𝜏+1 = 𝑤𝜏 − 𝜂

√∑ ∇𝐸(𝑤𝑡)2𝜏𝑡=1 + 𝜖

⋅ ∇𝐸(𝑤𝜏) (2-19)

Where 𝜖 is a smoothing term (set somewhere between 1e-4 to 1e-8) in order to avoid

division by zero.

An advantage of Adagrad is that the learning rate is tuned automatically. Due to this, the

learning rate value used in the implementation is usually 0.01 [65].

Adagrad has a couple of drawbacks. The first disadvantage is that this method is

sensitive to the choice of the learning rate given that this value becomes low when the

initial gradient is large. The second disadvantage is that the L2 norm of all the previous

parameters, calculated in the denominator, makes the monotonic learning rate too

aggressive. For deep learning, this means that the learning rate vanishes as the

accumulated sum keeps growing [62, 65, 66].

2.2.2.1.3.1.4 Adadelta

This method is an extension of Adagrad, but it tries to reduce the aggressive monotonic

learning rate by restricting the number of accumulated past gradients to a fixed number

of past gradients defined by a window of size 𝜔.

In order to define the sum of gradients over a window, Adadelta does not inefficiently

store 𝜔 previous squared gradients. Instead, the method uses a decaying average of all

past squared gradients. This decaying running average depends on previous average

and the current gradient.


Adadelta is defined as follows:

𝐷[ϕ]𝜏 = 𝜌𝐷[ϕ]𝜏−1 + (1 − 𝜌)ϕ𝜏

(2-20)

Δ𝑤𝜏 =√𝐷[Δ𝑤]𝜏−1 + 𝜖

√𝐷[∇𝐸(𝑤𝜏)2]𝜏 + 𝜖⋅ ∇𝐸(𝑤𝜏)

(2-21)

𝑤𝜏+1 = 𝑤𝜏 − Δ𝑤𝜏 (2-22)

Where 𝐷[ϕ]𝜏 is the running average at iteration 𝜏; 𝜌 is a constant that controls the

decay of the previous parameter updates; and Δ𝑤 refers to the previous weight update.

The advantage of Adadelta is that it solves the drawbacks of Adagrad. Additionally, it

makes unnecessary to tune the hyper parameters despite the variation of input data

types, number of hidden units, and nonlinearities. This makes Adadelta a robust

learning rate method [66].

2.2.2.1.3.1.5 RMSprop

This unpublished adaptive learning rate method, proposed by Geoffrey Hinton in [67], is

a different approach at improving Adagrad. RMSprop utilizes a moving average of

squared gradients in order to modulate the learning rate of each weight [62].

It is defined as:

𝐷[ϕ]𝜏 = 𝐷𝑅 ⋅ 𝐷[ϕ]𝜏−1 + (1 − 𝐷𝑅)ϕ𝜏

(2-23)

𝑤𝜏+1 = 𝑤𝜏 − 𝜂

√𝐷[∇𝐸(𝑤𝜏)2]𝜏 + 𝜖⋅ ∇𝐸(𝑤𝜏) (2-24)


Where 𝐷[∇𝐸(𝑤𝜏)2]𝜏 is the moving average of squared gradients at iteration 𝜏; 𝐷𝑅

(decay rate) is a hyper parameter whose usual values are: 0.9, 0.99, or 0.999 [62, 65,

67].

In the same way as Adadelta, RMSprop solves the vanishing learning rate caused by

the monotonically smaller updates [62].

2.2.2.1.3.1.6 Adam

Adaptive Moment Estimation (Adam) is an alternative method that calculates the

adaptive learning rates for each parameter. Adam combines “the ability of Adagrad to

deal with sparse gradients, and the ability of RMSprop to deal with non-stationary

objectives” [68] by keeping track of both an exponentially decaying average of past

square gradients 𝑣𝜏, and an exponentially decaying average of past gradients 𝑚𝜏. The

first element corresponds to the second moment of the gradients (variance). The

second element represents the first moment of the gradients (mean). 𝑣𝜏 and 𝑚𝜏 values

correspond to 𝐷[ϕ2]𝜏 and 𝐷[ϕ]𝜏, respectively [65].

As it is mentioned in [68], 𝑣𝜏 and 𝑚𝜏 are initialised as vectors of zeros. This causes the

moment estimates to be biased towards zero, particularly when the decay rates are

small and during the initial time steps. They counteract these biases in the following

way:

��𝜏 =𝑚𝜏

1 − 𝛽1𝜏

(2-25)

𝑣𝜏 =𝑣𝜏

1 − 𝛽2𝜏 (2-26)

Where 𝛽1 and 𝛽2 are exponential decay rates for the moment estimates; and ��𝜏 and 𝑣𝜏

are the bias-corrected first moment estimate and second raw moment estimate,

respectively.


These exponentially decaying averages are then used to update the parameters as

follows:

𝑤𝜏+1 = 𝑤𝜏 − 𝜂

√𝑣𝜏 + 𝜖⋅ ��𝜏 (2-27)

According to [68], Adam 1) requires little memory; 2) outperforms other adaptive

learning methods for a variety of models and datasets; and 3) scales to large-scale high

dimensional machine learning problems.

2.2.2.1.4 Backpropagation

Backpropagation provides a computationally efficient method for evaluating the

derivatives of the error function with respect to the weights [19]. The purpose of this

algorithm is to calculate the error contribution of a particular node in the output.

The following block describes the algorithm:

Algorithm 1 Backpropagation algorithm [19, 26]

Apply a feed-forward to an input vector 𝑥𝑛 using 𝑎𝑗 = ∑ 𝑤𝑗𝑖𝑥𝑖𝑗=0 and 𝑧𝑗 = ℎ(𝑎𝑗) to

find the activations of all the hidden and output units.

Evaluate 𝛿𝑘 for all the output units using 𝛿𝑘 = 𝑦𝑘 − 𝑡𝑘.

Backpropagate the 𝛿𝑘 using 𝛿𝑘 = ℎ′(𝑎𝑗) ∑ 𝑤𝑘𝑗𝛿𝑘𝑘 to obtain 𝛿𝑘 for each hidden unit

in the network.

Use 𝜕𝐸𝑛

𝜕𝑤𝑖𝑗= 𝛿𝑗𝑧𝑖 to evaluate the required derivatives.


2.3 Subtleties of creating a model

Building a machine learning model is not a simple task. This subsection describes

general problems in every model and particular drawbacks when using gradient descent

that need to be considered during the training and testing process.

2.3.1 Bias-Variance trade-off, overfitting and underfitting

It is important to notice that the objective of modelling the data is not to find a model that

fits the training data perfectly, but to find one that generalises to unseen data. This goal

is closely related to the concepts of bias and variance.

In machine learning, the first element, bias, refers to the difference between the

expected value and the true value; the second element, variance, refers to the variability

of the target around its true mean [75]. The figure below illustrates these concepts:

Figure 2-3: Bias and variance representation [76].

When the model shows low bias and high variance, the model is overfitting the data.

This means that the predicted values follow the true values closely (the blue dots are

close to the target in red), but the variability of these predictions across different trials is

big (the blue dots are spread). Conversely, when it displays high bias and low variance,

the model is underfitting the data. This means that the error between the predicted


values and the true values is large (the blue dots are far from the target in red), but the

variability of the predictions across different trials is small (the blue dots are clustered).

There are two alternatives to combat overfitting: reducing the number of dimensions of

the parameter space or reducing the effective size of each dimension.

The methods applied to reduce the number of dimensions of the parameters are:

pruning and weight sharing. Pruning involves removing information from the model

once it is over-fitted. This process is different according to the model, but the idea

remains the same. For instance, when pruning is applied to a decision tree model, it

means that the subtrees are replaced with a leaf; when pruning a neural network, the

unimportant weights are removed. Weight sharing is the method in which a single

weight is shared among many connections in the network. This means that the number

of adjustable weights in the network is less than the number of connections [13, 75, 77].

The methods applied to reduce the size of each dimension are: regularisation and early

stopping. Regularisation is a method that reduces overfitting by adding a complexity

penalty to the loss function. Early stopping refers to a method in which the training

process is stopped as soon as the model’s performance ceases to improve or

decreases [74, 75].

2.3.2 Vanishing gradient and exploding gradient

Two important problems arise when neural networks and deep neural networks are

trained using gradient descent: vanishing gradient and exploding gradient. The first

problem occurs when the gradient becomes weaker in the earlier hidden layers as the

gradient signal passes back through multiple layers causing the learning algorithm to

get stuck in poor local minima. In other words, neurons in later layers learn faster than

neurons in earlier layers. Conversely, the second problem occurs when neurons in later

layers learn slower than neurons in earlier layers [21].


A way to prevent these problems is to initialize the parameters using unsupervised

learning, also called generative pre-training. The advantage of applying unsupervised

learning is that the model is forced to represent a “high-dimensional response, namely

the input feature vector, rather than just predicting a scalar response. This acts like a

data-induced regularizer, and helps backpropagation find local minima with good

generalization properties” [21].

Another way to prevent that is to change the activation function. The sigmoid function

was the preferred function in the past; however, due to the disadvantages discussed in

section 2.2.2.1.2.1, this tendency has changed. As it was mentioned, the alternative

activation functions to the sigmoid function are: tanh, ReLU, and maxout. Tanh solves

the zero-centred inconvenience, but it still suffers from vanishing and exploding

gradient; ReLU presents one important drawback: the units in the network die during

training when the learning rate is too high. Finally, maxout seems to be the best

alternative since it generalises the ReLU and leaky ReLU; and has the benefits of ReLU

without its drawbacks [58].

D e e p C o n v o l u t i o n a l N e u r a l N e t w o r k | 35

Chapter 3. Deep Convolutional Neural Network

This chapter describes the important elements of Deep Convolutional Neural Network.

In section 3.1, an overview is presented; afterwards, backpropagation applied to DCNN

is detailed. Section 3.3 and 3.4 provide information about the softmax classifier and the

discrete convolution, respectively. Then, DCNN architectures and strategies to improve

their results are given. Finally, section 3.6 contains information about related work.

3.1 Overview

CNNs, a form of multilayer perceptron (MLP), are neural networks that are specially

designed to exploit the structure of the data, namely 1d signals such as speech or text,

or 2d signals such as images [16, 19, 21]. CNNs are similar to DCNNs, the only

difference between them is that DCNNs have a larger amount of hidden layers than

CNNs. Both architectures use three types of layers: Convolutional layer,

pooling/subsampling layer, and fully-connected layer.

3.1.1 Convolutional layer

The convolutional layer is an important element of the network that transforms one

volume of activation into another by convolving a small area with the input volume. This

operation is called discrete convolution and is defined in section 3.4. Convolutional

layers provide the network with two important features: Local connectivity and

parameter sharing.

Local connectivity is achieved when neurons are connected to a local region of the input

volume. This local region is a hyper parameter called receptive field with dimensions r x

r. The connections of the neuron to the receptive field are along the receptive field

dimensions, but full along the entire depth of the input volume. Local connectivity

drastically reduces the number of connections on a neural network. This not only

diminishes the processing time, but also improves the performance of the model by

preventing overfitting [16, 24, 25].

36 | D e e p C o n v o l u t i o n a l N e u r a l N e t w o r k

Figure 3-1: Local connectivity and receptive field.

Parameter sharing occurs due to neurons within a group, called feature map, or

activation map, that share the same parameters and cover different parts of the image

by using different receptive fields. This provides the network with translation invariance

which means that useful features that are learnt in some portion of the image can be

used everywhere else without independently learning those features [16, 21, 24].

Figure 3-2: Parameter sharing.

In each convolutional layer, there are filters that represent sets of weights. These filters,

also known as kernels, are convolved with an input volume to generate an output

volume. Each convolution generates an activation map which detects a specific type of

feature. The number of these activation maps is proportional to the number of filters in

the corresponding layer [16, 24, 25].


Figure 3-3: Activation maps [16].

Activation maps can be thought as the output of neurons after convolving the same

kernel. Their dimensions are determined by three hyper parameters: depth, stride and

zero-padding.

Depth controls the number of neurons that connect to the same region of the input

volume, but activate due to different features in the input such as edges or blobs of

colour. These neurons form a depth column [16, 25].

Stride determines the distance between depth columns. A smaller stride means less

distance between columns, and more overlapping receptive fields between columns

which translates into a larger output volume. Conversely, a higher stride results in a

smaller output volume.

Zero-padding is a hyper parameter that is used to pad the border of the input volume

with zeros in order to control the spatial size of the output volume.

Once these hyper parameters have been defined, it is possible to calculate how many

neurons can be arranged in the output volume using:

𝑊 − 𝐹 + 2P

𝑆+ 1

(3-1)


Where 𝑊 is the size of the input volume; F is the receptive field size; P is the amount of

zero padding used on the border, and S is the stride. It is important to notice that this

result cannot be a decimal number [16, 25].

Similarly, the output volume can be calculated using:

𝑉 = 𝑊𝑖𝑑𝑡ℎ ∙ 𝐻𝑒𝑖𝑔ℎ𝑡 ∙ 𝐷𝑒𝑝𝑡ℎ (3-2)

Where:

𝑊𝑖𝑑𝑡ℎ =𝑊1 − 𝐹 + 2P

𝑆+ 1

(3-3)

𝐻𝑒𝑖𝑔ℎ𝑡 =𝐻1 − 𝐹 + 2P

𝑆+ 1

(3-4)

𝐷𝑒𝑝𝑡ℎ = 𝐾 (3-5)

3.1.2 Pooling/subsampling layer

The pooling layer reduces the spatial size of the input decreasing the number of

parameters and the computation in the network. As a consequence, the network gains

local translation invariance and a mechanism to control overfitting [16]. The spatial size

reduction is achieved by taking a set of hidden units within a neighbourhood and

aggregating their activations using a pooling function (MAX pooling, average pooling,

L2-norm pooling or fractional max pooling) [16, 25]. It is worth noting that the spatial

size remains unchanged in the depth dimension.


Figure 3-4: Local translation invariance and spatial size reduction.

There are two configurations commonly used: the overlapping and non-overlapping

pooling. The former applies a 3x3 MAX-pooling filter, while the latter applies a 2x2 MAX-

pooling filter, both with a stride of 2 [16].

Figure 3-5: Downsampling (left) and MAX pooling (right) [25].

This convolutional layer operates independently on every depth slice of the input and

resizes this input spatially using the MAX-pooling operation by selecting the maximum

value within a local neighbourhood. In the figure above, the neighbourhood is formed by

4 elements.

The MAX-pooling operation can be expressed using the following equation:

𝑦𝑖𝑗𝑘 = 𝑚𝑎𝑥𝑝,𝑞 𝑥𝑖,𝑗+𝑝,𝑘+𝑞 (3-6)

Where 𝑥𝑖𝑗𝑘 is the value of the 𝑖𝑡ℎ feature map at position 𝑗, 𝑘; 𝑝 is the vertical index in

the local neighbourhood; 𝑞 is the horizontal index in the local neighbourhood; and 𝑦 is

the result of this operation in the pooling layer [25].


An alternative to the pooling layer is a convolutional layer with larger strides which

would also reduce the spatial size. According to [16], given the aggressive reduction in

the size of the representation when the pooling layer is used, “the trend in the literature

is towards discarding the pooling layer in modern ConvNets.”

3.1.3 Fully-connected layer

After the previous layers, there can be any number of fully-connected layers. These

layers are regular neural networks with neurons that have full connections to all

activations in the previous layer.

3.2 Backpropagation in DCNNs

In a DCNN model, the convolutional and pooling operations are applied to the gradient

as part of the backpropagation algorithm.

For the convolutional operation, the gradient is calculated as follows:

∇𝑤𝑘

𝑙 J(W, b; x, y) = ∑(𝑎𝑖(𝑙)

) ∗ 𝑓𝑙𝑖𝑝(𝛿𝑘(𝑙+2)

)

𝑚

𝑖=1

(3-7)

∇𝑏𝑘

𝑙 J(W, b; x, y) = ∑(𝛿𝑘(𝑙+1)

)𝑎,𝑏

𝑚

𝑖=1

(3-8)

Where J is the cost function; (W, b) are the parameters; and (x, y) are the training data

and label pairs [15, 26].

Similarly, for the pooling layer, the gradient is calculated as follows:

𝛿𝑘𝑙 = 𝑢𝑝𝑠𝑎𝑚𝑝𝑙𝑒 ((𝑊𝑘

𝑙)𝑇

𝛿𝑘(𝑙+1)

) ∙ ℎ′(𝑎𝑗) (3-9)


Where 𝑘 is the index of the number of the filter and ℎ′(𝑎𝑗) is the derivative of the

activation function. Additionally, the upsample operation propagates the error through

the pooling layer by calculating the error with respect to each unit incoming to this layer

[15, 26].

3.3 Softmax classifier

There are complex scenarios where binary classification presents limitations as real-life

data is usually classified into multiple classes. In those cases, multi-class classification

is required. This can be attained by using a classifier such as softmax. Unlike other

classifiers that treat the output as uncalibrated scores for each class, the softmax

classifier provides K mutually exclusive normalized class probabilities.

The outputs of this classifier are interpreted as 𝑦𝑘(𝑥, 𝑤) = 𝑝(𝑡𝑘 = 1|𝑥), with the following

error function:

𝐸(𝑤) = − ∑ ∑ 𝑡𝑘𝑛 ln 𝑦𝑘(𝑥𝑛, 𝑤)

𝐾

𝑘=1

𝑁

𝑛=1

(3-10)

The softmax classifier transforms a K-dimensional vector 𝒛 of arbitrary real-valued

scores into a vector of values between zero and one that sum to one. This is achieved

by using the softmax function:

𝑦𝑘(𝑥, 𝑤) =𝑒𝑥𝑝(𝑎𝑘(𝑥, 𝑤))

∑ 𝑒𝑥𝑝 (𝑎𝑗(𝑥, 𝑤))𝑗

(3-11)

Which satisfies 0 ≤ 𝑦𝑘 ≤ 1 and ∑ 𝑦𝑘𝑘 ≤ 1.


3.4 Discrete convolution

An important element in the CNN is the discrete convolution operation which is used to

compute pre-activation values. These pre-activation quantities are then passed through

a non-linearity to get the values of the hidden units in the feature maps.

The discrete convolution of an input image 𝑥 with a (𝑟 × 𝑟)-kernel 𝑘 is expressed as

follows:

(𝑥 ∗ 𝑘)𝑖𝑗 = ∑ 𝑥𝑖+𝑝, 𝑗+𝑞 𝑘𝑟−𝑝, 𝑟−𝑞

𝑝𝑞

(3-12)

Where (𝑥 ∗ 𝑘)𝑖𝑗 is the convolution at position (𝑖, 𝑗); 𝑥𝑖+𝑝,𝑗+𝑞 is the value of the input

image at position (𝑖 + 𝑝, 𝑗 + 𝑞); and 𝑘𝑟−𝑝,𝑗−𝑞 is the value of the kernel at position

(𝑟 − 𝑝, 𝑟 − 𝑞). The image below shows an example of the discrete convolution operation:

Figure 3-6: Example of discrete convolution.

The values of the hidden units are calculated using:

𝑦𝑗 = 𝑔𝑗 tanh (∑ 𝑘𝑖𝑗 ∗ 𝑥𝑖

𝑖

)

(3-13)

Where 𝑘𝑖𝑗 is the convolution kernel, 𝑥𝑖 is the 𝑖𝑡ℎ channel of input, and 𝑔(⋅) is the ReLU

activation function described in section 2.2.2.1.2.3.


It is worth noting that, with a non-linearity, the discrete convolution operation provides

feature detection to the hidden layers by emphasizing a correspondence between a

learned filter and a particular region [25]. This combination helps to detect features such

as edges or wrinkles.

3.5 Deep Convolutional Neural Network Architectures

DCNN architectures are created by combining the layers described in section 3.1. The

most common form of DCNN stacks a number of convolutional and pooling layers

together until the input image has been spatially reduced. This intermediate output is

followed by fully-connected layers, in which the last one outputs a value such as the

class score [16, 25].

3.5.1 Architecture Considerations

DCNNs are powerful neural networks, but it is important to consider some elements

during their design:

3.5.1.1 Input layer

The size of the input volume should be divisible by two several times. These size values

range from 32 to 512 [16].

3.5.1.2 Convolutional layer

Small filters are commonly used with a stride of one and zero-padding. This zero-

padding is selected using the formula 𝑃 =𝐹−1

2 so that the size of the input volume is

preserved. It is worth noting that small filters such as 3x3 and 5x5 can be used in any

layer, but large filters such as 7x7 should only be used in the first convolutional layer

[16].


It is preferable to stack several small filters to use one equivalent large filter because

the small filters express more powerful features of the input by preserving the non-

linearities; and require fewer parameters [16].

3.5.1.3 Pooling layer

It is common to use 2x2 or 3x3 filters. The reason to prefer small filters is that large

filters are too lossy and aggressive which causes poor performance [16].

3.5.1.4 Strides and Zero-Padding

According to [16], a stride of one is preferred because it preserves the spatial size of the

input volume and works better in practice. Similarly, zero-padding maintains the spatial

size of the input, and prevent the information at the border to disappear too quickly.

3.5.2 Improvement Strategies

There are some strategies to improve the performance of the network:

Data augmentation: This method creates additional versions of the original

image that help the network to be resistant against small changes that do not

affect the structure of the image [21].

The transformations usually applied to increase the data set are: rotation,

translation, zooming, flipping, and random cropping. These transformations are

not applied to each image; instead, they are randomly applied to some of them in

order to prevent overfitting. The table below shows the result of applying these

transformations to an image.


Original Image Transformation Image Transformation Image

Rotation

Horizontal flipping

Translation

Vertical flipping

Zooming

Random cropping

Table 3-1: Data augmentation transformations.


Dropout method: This method prevents neurons from being only helpful in the

context of several other specific neurons by randomly dropping neurons, along

with their connections, from the neural network during training. In other words, it

thwarts two or more neurons from detecting the same feature repeatedly and

wasting not only the network’s capacity, but also computational resources [23,

61].

Therefore, the main idea of dropout is to remove individual activations at random

during training which makes the model “more robust to the loss of individual

pieces of evidence and thus less likely to rely on particular idiosyncrasies of the

training data” [61].

The dropout neural network model is defined as follows:

𝑟𝑗(𝑙)

~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝑝) (3-14)

��(𝑙) = 𝒓(𝑙) ∗ 𝒚(𝑙) (3-15)

𝑧𝑖(𝑙+1)

= 𝒘𝑖(𝑙+1)

��𝑙 + 𝑏𝑖(𝑙+1)

(3-16)

𝑦𝑖(𝑙+1)

= 𝑓(𝑧𝑖(𝑙+1)

) (3-17)

Where 𝑟 is a vector of independent Bernoulli random variables each of which has

probability 𝑝 of being 1; �� is the thinned output that is calculated by multiplying 𝑟

and the output from the layer 𝒚(𝑙); and the operator ∗ represents an element-wise

product [61].

At test time, it is necessary to adjust the weights according to:

𝑊𝑡𝑒𝑠𝑡(𝑙)

= 𝑝𝑊(𝑙) (3-18)


It is mentioned in [23] and [61] that although dropout alone improves the

performance of the network, dropout with max-norm regularization, large

decaying learning rates and high momentum improves the performance of the

SGD further.

Early stopping: It is a procedure that controls overfitting by tracking the

performance of the model on a validation set. Once this performance stops

improving or decreases, the gradient descent is terminated. This method is

widely used because it is simple to understand and implement, and has been

proved to be better than other regularisation methods in numerous cases [53,

74].

Ensemble method: It is a method that is based on the law of large numbers

which establishes that the average of the predicted values tends to get closer to

the expected value as the number of trials increase [63]. In other words, by

creating different models and averaging their prediction, the likelihood of

predicting the correct expression increases. However, the models must be

different from each other so that bias for specific values decreases.

In order to create different models, there are a couple of alternatives: 1) Same

model, different initializations; and 2) Top models obtained during cross-

validation. The first alternative involves generating the best model using cross-

validation, then training multiple models based on this model, but utilizing

different random initializations. The drawback of this approach is that variety is

only introduced through random initialization. The second alternative requires

determining the best hyper parameters, then selecting the top models to form the

ensemble. The disadvantage of this approach is that it can choose suboptimal

models and affect the accuracy [61, 62].

The disadvantage with this method is that it takes longer to evaluate as it

involves more than one model. Recent work from Geoff Hinton on “dark


knowledge”, suggest that it is possible to extract the essence of the ensemble

model and to create a new single model by “incorporating the ensemble log

likelihoods into a modified objective" [62, 64].

3.6 Related work

This section describes briefly some CNN architectures that have been used to classify

different types of images. The first three architectures were developed to participate in

the ImageNet Challenge, while the last architecture was applied in facial expression

recognition.

3.6.1 AlexNet

AlexNet was proposed by Alex Krizhevsky to classify the images into 1000 different

classes as an entry for the ImageNet 2010 challenge. It contains eight learned layers of

which five are convolutional layers, two are fully-connected layers and one is fully-

connected 1000-way softmax. Moreover, this model implements a local normalization

scheme which aids generalisation [35].

There are three important elements used in this model in order to reduce overfitting: 1)

neurons with the ReLU nonlinearity (since DCNNs with ReLU train “several times faster

than their equivalents with tanh units” [35]); 2) dropout regularisation method; and 3)

data augmentation (image translation, horizontal reflection, and PCA on the set of RGB

pixel values) [35].

This architecture is displayed in Figure 3-8. It can be noticed that some of the

convolutional layers are followed by max-pooling and normalisation layers. Moreover,

the first two fully-connected layers apply the dropout regularisation method.

3.6.2 Visual Geometry Group

The Visual Geometry Group (VGG) architecture was developed by Karen Simonyan

and Andrew Zisserman as an entry for the ImageNet 2014 challenge. These DCNNs


push the number of weight layers to 19, demonstrating that increasing the depth of the

network is beneficial for the classification accuracy. The novel approach of this

architecture was using very small 3x3 receptive fields throughout the whole network.

There are three advantages of very small over large receptive fields: 1) given that more

non-linear rectification layers are used, the decision function is more discriminative; 2)

the number of parameters is greatly reduced; and 3) as a consequence of the previous

point, the training time is reduced [79].

This VGG architecture is shown in Figure 3-8. Similarly as AlexNet, the VGG model

applied two fully-connected layers with the dropout regularisation method (𝑝 = 0.5) [79].

3.6.3 GoogLeNet

The DCNN architecture, named inception, was developed by Christian Szegedy and his

team at Google. This architecture was used to participate in the ImageNet 2014.

The novelty idea was to stack inception modules, shown below, upon each other with

occasional max-pooling layers which improves the utilization of computing resources as

a result of the “ubiquitous use of dimensionality reduction prior to expensive

convolutions with larger patch sizes” [80]. In other words, by halving the resolution of

the grid, it is possible to increase the number of layers without creating computational

difficulties [80].

The inception module can be observed in the next figure:


Figure 3-7: Inception module: naïve version (left) and with dimensionality

reduction (right).

The model follows the "practical intuition that visual information should be processed at

various scales and then aggregated so that the next stage can abstract features from

the different scales simultaneously" [80].

As a result, when compared with AlexNet, the model used 12 times less parameters

(only five million parameters) and improved the top-5 error by almost 9.73% [24, 35, 80].

This architecture can be observed in Figure 3-8. It is worth noting that the network

eliminates the fully-connected layer completely.

3.6.4 Convolutional Neural Network for Facial Expression Recognition

This architecture is similar to AlexNet, but the authors used heavily batch regularisation

after each convolutional layer and each fully-connected layer. The authors omitted to

mention the dropout value applied in each layer, but provided information of each

convolutional filter and number of neurons for each fully-connected layer [81].

The dataset used in their project was taken from Kaggle which consists of about 37,000

gray-scale images of centred-faces with dimensions 48x48. Moreover, the purpose of

their work was to recognise the same seven facial expressions recognised in this

dissertation [81].

The structure of the deep network used in this project is also shown in Figure 3-8. It is

important to notice that this model also increase the number of convolutional filters as

the size of the images is reduced after each max-pooling layer [81].


Figure 3-8: Architectures from left to right: AlexNet, VGG, GoogLeNet, and CNN

for facial recognition [35,79,80, 81].

52 | R e s e a r c h m e t h o d o l o g y

Chapter 4. Research methodology

This chapter focuses on the research methodology used during this project and

describes its different phases. The chapter also details the components that were

selected during the development of this system and the justification for those decisions,

including an overview of the datasets used to train the DCNN model.

4.1 Project phases

This section elaborates on the phases the project went through. These phases were as

follows: data collection; pre-processing; system design; development; training and

testing; and evaluation phase.

4.1.1 Data collection

Considering that all machine learning algorithms are data driven, this stage was critical

for the system development given that selecting the incorrect dataset would have

increased the difficulty of the project. For instance, datasets with noisy backgrounds

would have unnecessarily imposed an additional challenge to the face detector

algorithm since the main objective of this project is not to detect faces.

There are different facial expression datasets available with varying image quality;

however, most high-quality datasets require permission for usage to be granted. This

permission for usage is fairly easy to obtain since most dataset owners require the user

to fill in a form as a mere formality.

Therefore, after careful consideration, the KDEF, and the JAFFE databases were

selected for this project to train the DCNN because they offered labelled images that

contains the required facial expressions with noise-free background. These databases

are described below:

R e s e a r c h m e t h o d o l o g y | 53

4.1.1.1 KDEF database

The Karolinska Directed Emotional Faces (KDEF) database is a set of photographs

formed by 4900 images displaying seven different human facial expressions. These

images were obtained by photographing, twice, 70 amateur actors (35 females and 35

males) from five different angles [47], and contain people without beards, moustaches,

earrings or eyeglasses, or visible make-up.

Figure 4-1: Sample images from the KDEF dataset.

4.1.1.2 JAFFE database

The Japanese Female Facial Expression (JAFFE) database is a set of 213 photographs

of 10 Japanese female models displaying seven facial expressions (six basic facial

expressions and one neutral) [48].

These databases contain posed expression images, in which the subject is ordered to

show a specific emotion; and do not display any occlusion, or varying levels of

illumination. It is important to notice that, although the datasets do not show occlusion, it

was possible to modify the images to include this element, and as a result, to increase

the size of the training set.


Figure 4-2: Sample images from the JAFFE dataset.

4.1.2 Pre-processing

In the second phase the data was cleaned and transformed. This section was divided in

two parts: face detection, and data modification.

4.1.2.1 Face detection

Face detection was employed to extract faces from the datasets using a boosted

cascade of simple feature as described by Paul Viola and Michael Jones in [52].

In this process, a method proposed by Viola and Jones was used. The approach trained

a machine learning model using Haar-like features in order to extract important

characteristic from an image. The effect of these features was to reduce by “over one

half the number of locations where the final detector must be evaluated” [52]. For face

detection, the authors utilised three types of features: the value of a two-rectangle

feature, which calculates the difference between the sum of the pixels within two

rectangular regions; three-rectangle feature, which calculates the difference between

the sum of the pixels within the central rectangular region and the sum of the pixels

within the two outside rectangles; and four-rectangle feature, which calculates the

difference between the pixels within the diagonal pairs of rectangles [52]. These Haar-

like features can be observed in the Figure 4-3.


Figure 4-3: Haar features [42].

To improve the evaluation time, they used an image representation, called integral

image, to obtain the sum of values in a rectangular area. The integral image is defined

as:

𝐼(𝑥, 𝑦) = ∑ 𝑖(𝑥′, 𝑦′)

𝑥′≤𝑥, 𝑦′≤𝑦

(4-1)

Where 𝐼(𝑥, 𝑦) is the integral image and 𝐼(𝑥′, 𝑦′) is the original image. This represents

the sum of the pixels from the origin to the point (𝑥, 𝑦). Therefore, as it is shown in

Figure 4-4, in order to calculate the sum of values in a rectangular area delimited by

(𝑥0, 𝑦0) and (𝑥1, 𝑦1) where 𝐴𝑟𝑒𝑎0 < 𝐴𝑟𝑒𝑎1 , the following operation is performed:

𝐼(𝑥, 𝑦) = 𝐼(𝑥1, 𝑦1) − 𝐼(𝑥1, 𝑦0) − 𝐼(𝑥0, 𝑦1) + 𝐼(𝑥0, 𝑦0) (4-2)

Figure 4-4: Area calculation using Integral Image.

Despite simplifying the rectangular area calculation, performing this operation through

the whole image would still require a significant amount of processing. Ideally, it would

be preferred to focus on objects of interest and ignore other parts of the image that do

not contain valuable information. The solution to this problem was a machine learning


model that combined “increasingly more complex classifiers in a ‘cascade’” [52] which

substantially decreased the face detection time. Given that the system processed

images in real-time, it was reasonable to select this approach.

4.1.2.2 Data modification

After the previous step, the images were resized to either 48x48 or 96x96 (depending

on the experiment performed) and converted to grayscale images. Then, the data was

modified by applying one or a combination of the following methods: PCA, whitening,

and normalisation.

4.1.2.2.1 Principal Component Analysis

PCA is a pre-processing method that decorrelates the input data by projecting it into a

new coordinate system. PCA utilises the eigenvalues from the covariance matrix to

project the data into the new basis (which is assumed to be orthonormal). Then, by

selecting the components with the largest variance, the dimensionality of the input data

can be reduced.

In [73], the steps to calculate PCA are described:

Centralise data

The first step is to subtract the mean vector from all the instances in a given dataset

𝑑 × 𝑁, 𝑋. This is denoted by �� and defined as:

�� = 𝒙 − �� (4-3)

Where �� is the mean vector of 𝒙.

Calculate covariance matrix

This matrix is calculated with �� in the following form:

𝑆 =��𝑇

𝑁

(4-4)


Perform Eigen analysis

Then, the eigenvalues are calculated using this matrix. Afterwards, the eigenvalues and

eigenvectors are ordered so that 𝜆1 ≥ ⋯ ≥ 𝜆𝑑.

Find principal components

The largest eigenvectors of 𝑆 are selected to form the projection matrix 𝑈𝑀 =

[𝑢1, … , 𝑢𝑀], where 𝜆1 ≥ ⋯ ≥ 𝜆𝑀.

Encode data

In order to encode the data, the following formula is applied:

𝑧 = 𝑈𝑀𝑇 (𝒙 − ��) (4-5)

Where 𝑧 is the encoded 𝑀-dimensional vector of 𝑥.

4.1.2.2.2 Whitening

Whitening is another process to standardise input data. According to [57], the geometric

interpretation of the whitening transformation is that "if the input data is a multivariable

Gaussian, then the whitened data will be a Gaussian with zero mean and identity

covariance matrix."

The transformation is defined as follows:

𝑊ℎ𝑖𝑡𝑒𝑛𝑖𝑛𝑔 =��

𝑆

(4-6)

Where �� is the decorrelated data (which is the dot product between the eigenvectors

and the input data), and 𝑆 is a one dimensional array that contains the singular values

(which correspond to the squared eigenvalues).


4.1.2.2.3 Normalisation

Normalisation is a process that involves zero-centring the input data, and dividing this

value by the standard deviation. This method is defined by the following operation:

�� =𝑥 − 𝜇𝑥

𝜎𝑥 (4-7)

Where �� ∈ 𝑅 and 𝑥 ∈ 𝑅 are the original and normalised feature vectors respectively, 𝜇𝑥

is the mean of 𝑥, and 𝜎𝑥 represents the standard deviation of 𝑥.

After performing the pre-processing stage, the values were split into training and test

sets. The proportion in which these values were divided was four to one.

4.1.3 System design

The decisions taken during this phase were crucial given the repercussions that bad

design choices have in the software industry. For instance, the Obamacare website had

glitches that prevented people from signing up and increased the total development cost

to amounts ranging from $840 million to over $2 billion dollars [55, 56].

With that in mind, a careful analysis of different elements was made in order to select

the most suitable components.

4.1.3.1 Elements involved

Sections 4.1.3.1.1 - 4.1.3.1.3 describe the elements that were used to train the DCNN

model, and create the GUI application.

4.1.3.1.1 Amazon Web Services and Graphics Processing Unit

Amazon Web Services (AWS) is a cloud services platform that offers database

solutions, storage, application hosting, compute power, etc. [28]. As the name suggests,

these services are provided by Amazon.com and are used by popular companies such

as Netflix, Spotify, Unilever, Yelp, to name a few.


The system was trained in an Elastic Compute Cloud (EC2) instance on AWS running

the Ubuntu operating system. There were some reasons for this: the first reason was

that the Ubuntu instance on AWS provides access to NVIDIA GPUs (which train

machine learning models faster than CPUs) with up to 1,536 CUDA cores and 4GB of

video memory per GPU [29]; the second reason was that, according to [35], training

machine learning models, particularly deep learning models, is a time-consuming

process ranging from a couple of hours to several days depending on the model

complexity and the processing power available; finally, another reason was the higher

cost of buying a GPU compared with the cost of renting an instance in AWS.

The GPU installed on the server is a NVIDIA GRID k520 model based on the Kepler

architecture which was designed to address "the most daunting challenges in High

Performance Computing (HPC)" [30], improving up to three times the performance of

previous architectures. This model contains two GPUs, in other words, it contains 3072

CUDA cores and 8GB of video memory [31, 32].

4.1.3.1.2 Programming language

Selecting the correct programming language to develop the system was a daunting task

since, nowadays, there are more than 2000 high level programming languages [33].

Fortunately, this problem was simplified by the project requirements.

During the training phase, the programming language needed to be simple to learn and

apply, and, since the project involves neural networks, it needed to be compatible with

the GPU libraries. Moreover, it was important that it had enough documentation and

community support so that problems during the development cycle were solved within a

reasonable time.

During the GUI design and neural network implementation, the programming language

required to be also simple and accessible. Moreover, since the application classifies


images in real time, this language had to be fast enough to deal with this type of input

and the classification process.

After considering these factors, the list was reduced substantially. Among the

programming languages left, R, C++, Python, and Matlab are one of the most popular

languages regarding machine learning. R is regarded as a powerful language, but, it

has a steep learning curve; C++ is considered the best option when performance is

critical, although, it is difficult to learn and even more difficult to optimise. Moreover, the

development cycle is longer than in other programming languages; Python is a user-

friendly programming language that contains a variety of libraries that allow a

transparent use of GPUs; Matlab is a balanced alternative, it is user-friendly, fast, and

has an acceptable performance, but purchased of a licence is required in order to use

the main product, and some of its libraries [34, 36].

After a reasonable analysis, Python was selected to train the DCNN, and to create the

GUI given that it has good performance, offers a gentle learning curve, and does not

require a licence to be bought.

4.1.3.1.3 Libraries and development environments

According to the programming languages selected, the following section describes the

elements used to develop the system.

Anaconda: Anaconda is a Python distribution that includes 400 packages for

science, math, engineering, and data analysis. It is easy to install and provides

flexibility when interacting with other languages [37].

CUDA and CuDNN: CUDA is “a parallel computing platform and programming

model invented by NVIDIA” that allows to take advantage of the power of the

GPU [38]. Similarly, CUDA Deep Neural Network (CuDNN) library is a collection

of primitives for DNN that provides “highly tuned implementations for standard


routines such as forward and backward convolution, pooling, normalization, and

activation layers” [39].

Theano: It is a Python library that helps to “define, optimize, and evaluate

mathematical expressions involving multi-dimensional arrays efficiently” [40]. It

features, among other things, efficient symbolic differentiation, transparent use

of a GPU, speed and stability optimizations, and dynamic C code generation [40].

Lasagne: Lasagne is a lightweight library that is used to build and train neural

networks using Theano [41].

OpenCV: It is an open source library with C++, Python, and Java interfaces that

offers more than 2500 optimized computer vision and machine learning

algorithms. These algorithms facilitate tasks such as: face detection, object

recognition, object tracking, etc. [43].

PyQt: It is a set of python bindings which are application programming interfaces

that provide code to use the Qt application framework. The Qt application

framework is a toolkit that contains abstractions not only of GUI components, but

also of network sockets, threads, etc. [78].

4.1.3.1.4 Justification

There were some technical factors that were considered in order to select these

elements, but special attention was given to three attributes: simplicity, familiarity, and

flexibility.

In the case of Anaconda, it was selected to simplify the installation process, and to

avoid disrupting other libraries and dependencies.

Regarding CUDA and CuDNN, these elements were selected to take advantage of the

NVIDIA GPUs on the servers.


Concerning Theano and Lasagne, they were selected because they offer a simpler and

more efficient implementation of deep learning algorithms. There were other alternatives

to Theano and Lasagne such as TensorFlow, Torch and Caffe; however, the

alternatives presented some disadvantages: TensorFlow is up to four times slower than

other deep learning tools because it does not support inline matrix operations [45];

Torch is a powerful computing framework widely regarded as one of the best options,

but due to some constraints in the available time, learning LuaJIT was not a viable

alternative [46]; finally, Caffe has good performance, but it is not as simple to use as

Lasagne [45].

In terms of image processing, OpenCV was selected because it contains the code

necessary to read webcam images. Moreover, it includes not only the functions required

for data augmentation, but also the Viola and Jones model for face detection which

significantly reduces the pre-processing phase.

Finally, PyQt was selected because it simplifies the development of the GUI and

provides a way to create appealing interfaces.

4.1.4 Development

The development phase was another important part of this process in which

underestimating the development time or creating poor and complex code were

potential risks. Therefore, in order to prevent these risks, a methodology that is based

on modular design was used. Modular design simplifies the development of software by

subdividing a system into smaller parts. The key idea about this methodology is that

each part performs a specific task and, in case of modifying a particular module, the

other modules are not affected or are affected minimally.

This phase was divided in two parts: the training part and the application part.

In the training part the following modules were developed:


● The input module: It is in charge of extracting the features and labels.

● The DCNN module: It receives the features and labels, and creates the DCNN

model.

In the application part, the following modules were developed:

The GUI module: It is the main component. This module displays the input data,

the classification of the facial expression, and the likelihood of each expression.

The input processing module: it reads the input data depending on the source

(image or webcam) and processes the image following the same steps as the

input module in the training part.

The DCNN application module: it receives the pre-process vector and outputs the

classification and probabilities for all the expressions by using the previously

trained DCNNs.

4.1.5 Training and Testing

This stage focuses on training the models by varying the hyper parameters, applying

data augmentation techniques, and utilising different datasets; and studying the DCNN

by performing different experiments with them.

4.1.5.1 Model selection

Model selection requires choosing the DCNN architecture and the hyper parameters.

Selecting the best elements is not as simple as following a recipe given that there are

several variables involved in this process. For instance, using different datasets means

that hyper parameters need to be adjusted to achieve similar performance.

Subsections 4.1.5.1.1 and 4.1.5.1.2 list the elements that were considered during the

training and testing stage.


4.1.5.1.1 Hyper parameter tuning

Training an accurate model involved an important process named hyper parameter

tuning. This process is concerned with adjusting the hyper parameters, which are the

values that define the model, in order to achieve good performance. During this phase,

the following hyper parameters were tuned:

Learning rate: It conditions the number of iterations the model takes to converge

to a satisfactory solution by controlling the step size when moving towards the

global minimum or local minima. According to [74], this is often the “single most

important hyper-parameter and one should always make sure that it has been

tuned.” Values for a neural network can vary greatly, ranging from values greater

than 10−6 to less than 1. However, a default value that works for standard neural

networks is 0.01 [62, 74].

Gradient descent optimization algorithm: This hyper parameter affects the

convergence speed as it specifies the algorithm used during gradient descent. As

it was mentioned in Chapter 2, there are six algorithms commonly utilised:

Momentum, Nesterov Accelerated Gradient, Adagrad, Adadelta, RMSprop, and

Adam.

Number of epoch: It controls the number of iterations that the training process

will last. On the one hand, a small number will prevent convergence because the

training process will stop prematurely; on the other hand, a great number will

waste processing time without achieving better performance.

It is important to notice that by using early stopping, it is not required to set the

number of epochs given that, as described in section 2.3.1, the model will stop as

soon as the performance ceases to improve or decreases [74, 75].

Number of neurons in the FC layer: It specifies the number of neurons in the

fully-connected layer. A small number of neurons will prevent the network to


model complex data, while a sizeable number of neurons will increase the

training time and overfit the data.

Number of FC layers: It specifies the number of layers which affects the type of

decision boundary that the architecture is able to model. However, the greater

the number of layers, the more complex the model and the more difficult it is to

select the hyper parameters so that the architecture can converge.

Size of max-pooling filters: Varying this value affects the spatial dimensions of

the input; however, the most common and recommended setting for this filter is a

2x2 filter. Bigger filters, as it was mentioned in section 3.5.1.3, are uncommon

because they affect the performance [16].

Size of convolutional filters: It controls how the DCNN respond to certain type

of features. When the filter is big, which means 7x7 or more, the network can

overlook essential details in the input. When the filter is small, more details are

considered by the network which is also not always correct. 3x3 and 5x5 filters

are the recommended sizes of convolutional filters in any layer; however, 7x7

filters should only be used in the first layer [16].

Number of convolutional filters: It controls the quantity of activation maps and

the features to which the network respond. This corresponds to the output

volume described in section 3.1.1.

Weight initialization: It is suggested that the weights are initialized randomly to

ensure symmetry breaking. This is explained because initializing all the weights

to zero causes the gradient during backpropagation to be the same resulting,

incorrectly, in the same parameter updates [57].

Activation function: It controls the firing of the neurons. This is an important

hyper parameter when the architecture is formed by more than one layer. As it


was mentioned in section 2.3.2, vanishing gradient and exploding gradient are

two consequences of backpropagation that are closely link to the initialization

weights and the activation function. According to [58], the ReLU and maxout

activation functions are preferred as they solve these problems.

4.1.5.1.2 Architecture

As it was mentioned in section 3.6, there have been different architectures proposed by

different academics and companies. The experiments in this phase involved training

three models: AlexNet, VGG, and GoogLeNet in order to compare their results.

4.1.6 Evaluation

This phase focuses on analysing and contrasting the results from the previous stage.

Different tools such as figures, tables and plots were used in order to do this and

answer the research questions described previously.

I m p l e m e n t a t i o n | 67

Chapter 5. Implementation

This chapter details the implementation process of both the training and application

modules. Section 5.1 describes the main purpose the training module used to train the

DCNNs, while section 5.2 provides information about the AFER modules used in the

main system.

5.1 Training modules

The section outlines the main purpose of each module in order to create the DCNN

model. The developed modules in this section were: input module and DCNN module.

5.1.1 Input module

The input module is in charge of extracting features and labels from the dataset. This

module is in charge of pre-processing the images in each dataset.

First, images are transformed to grayscale images. Then, the face is located in each

image and the values are transformed into numpy arrays. This process is shown in

Figure 5-1.

Figure 5-1: Face detection and numpy array creation.

After that, each column in this array is normalised by removing the mean and dividing

by the standard deviation. Additionally, as it will be described in Chapter 6, some

experiments involved decorrelating and whitening the data by computing the SVD

factorization of the covariance matrix, and dividing the decorrelated data by the

eigenvalues. Finally, this data is sent to the DCNN module where it is divided into train,

validation, and test sets, and used to train the DCNN model.

68 | I m p l e m e n t a t i o n

5.1.2 DCNN module

The DCNN module creates the model using the features and labels created by the

previous module. As part of the creation of the DCNN architecture, the input data is

separated into training, validation and testing sets. The validation set is used to obtain

the best performing model by tuning the hyper parameters. Without this set, the model

risks overfitting the data given that the hyper parameters would be tuned specifically for

the test data and would generalised poorly [27]. The test set is used to assess the

performance of the fully-trained DCNN.

This module combines Lasagne and nolearn in order to abstract the creation of the

DCNN and train the network. While Lasagne provides various classes that represent the

layers of a neural network and simplify the use of theano variables, nolearn abstracts

the creation of neural networks by combining different Lasagne layers. So, instead of

creating various Lasagne layer objects, and combining them, the only object that is

created is a nolearn neural network [41, 44].

The result of the DCNN module is a trained model that is stored as a pickle file so that

the AFER modules can use it to predict the facial expression.

5.2 AFER modules

After the DCNN model was created, it was used to identify facial expressions in images.

There are three components of the AFER system: Graphical User Interface module,

input processing module, and DCNN application module.

5.2.1 Graphical user interface module

This module is designed with simplicity in mind so that any user can quickly interact with

the system. As shown in the figure below, the interface is austere. The main window

has an image area, a button to load the data, an output section and a lateral part that

contains seven meters representing all the emotions the system is capable of

recognizing.


Figure 5-2: AFER Graphical User Interface.

When the “Select input Data” button is pressed, the user is prompted with a dialog box

to decide whether the input should come from a file or webcam. Clicking on the

“Images” button displays a file selector that returns the image filename, and adds two

buttons to the interface: “Detect face” and “Show emotion”. Similarly, clicking on the

“Webcam” button creates a webcam object that is returned to the GUI.

In the case of using files, the new buttons control the rest of the process. Pressing

“Detect face” calls the face detector function from the input processing module which

locates the face and outputs the coordinates of the face in order to enclose it in a

rectangle. Then, clicking “Show emotion” calls the DCNN application module.

In the case of utilising the frames from the webcam, the system calls the input

processing module for each frame collected at equally spaced intervals in order to

transform this frame into a numpy array. Then, it continues to the next module.

After clicking “Show emotion” or calling the input processing module, the DCNN

application module is called to identify the facial expression. The output values of this

function are: a vector containing the likelihood of the expression and an index that

indicates the position of the highest ranking expression. After that, the system displays

the result of this classification (i.e. the name and emoji of the emotion) in the lower

section of the GUI. Additionally, the meters and labels are modified according to the

likelihood of each expression. Each path described above is shown in figures 5-3 and 5-

4; and the whole process is summarised in Figure 5-5.


Figure 5-3: Result from input image.

Figure 5-4: Result from input video.


Figure 5-5: AFER process.


5.2.2 Input processing module

This module is necessary to process the input data in the same fashion as the data

used to train the model. Unlike the pre-process module before, this component is part of

the GUI module, and is only called after specific intervals, when the webcam is used as

input; or after pressing “Detect face”, when images are used as input.

The module begins by receiving the image or a frame from a webcam. If the input is an

image, then, the process follows the same steps as described in section 5.1.1; if the

input is obtained from a webcam, it is necessary to transform the format of the frame

from BGR (Blue-Green-Red) to 24-bit RBG (Red-Blue-Green) before proceeding with

the same steps. This transformation involves reordering the colours without modifying

the original values. The OpenCV library works with the BGR colour format because it

used to be popular “among camera manufacturers and software providers” [59, 60]

which forced the developers to adopt it.

Once the image is transformed into a numpy array, the same steps described in the

input module are followed. After that, the DCNN application module is automatically

called if the webcam is used or it is called after the user presses “Show emotion” if input

files are used.

5.2.3 DCNN application module

This module receives the numpy array containing the face and identifies the facial

expression. The module returns two values: the index of the most likely expression, and

the likelihood of each expression.

There were two approaches taken in order to identify the facial expression. The first

approach consisted in using the most accurate DCNN, while the second approach

consisted in using an ensemble of the most accurate models and taken the most voted

facial expression. It is important to notice that the accuracy increased as more models

were added. However, the models used in the ensemble were required to be different


from each other in order to avoid overfitting by varying hyper parameters and the

architecture.

74 | E x p e r i m e n t s

Chapter 6. Experiments

This chapter details the experiments performed during the training stage to evaluate the

best hyper parameters and DCNN model; and the experiments carried out during the

test stage in order to assess the accuracy of the models.

6.1 Training stage

In order to evaluate the DCNN model and hyper parameters during this phase, the

accuracy of the model was tested against popular benchmark facial expression

datasets, namely the KDEF and JAFFE Databases described in section 4.1.1. This

section is divided in two parts. In section 6.1.1, the objective of the experiments was to

test different combinations of hyper parameters. In section 6.1.2, the purpose of the

experiments was to tests additional steps to improve the performance of the DCNN.

6.1.1 Hyper parameters

In this section, a set of experiments was performed five times using the same

architecture and, depending on the test, the same dataset. The results were obtained by

averaging each output. The experiments are described next.

6.1.1.1 Learning rate

The tests in this part consisted on training a model using a specific GDO algorithm while

varying the learning rate. These tests were performed to find the ideal combination

between Gradient Descent Optimization (GDO) algorithm and learning rate.

The experiments were performed using a model containing two convolutional layers

with 32 and 64 filters, respectively; and two fully-connected layers with 512 neurons

each and dropout with 𝑝 = 0.5 after the first FC layer. The learning rate values were

varied from 0.001 to 0.5 according to the recommended values [16, 49].

The results of these tests are displayed in the following image:

E x p e r i m e n t s | 75

Figure 6-1: Accuracy of applying diverse GDO algorithms after varying the

learning rate.

In Figure 6-1, three things can be noticed: 1) RMSprop and Adam perform better when

the learning rate is very small; 2) SGD, Adagrad, and Adadelta show ordinary results

throughout the experiments; and 3) Nesterov momentum and Momentum display more

stable accuracy for medium learning rates, but they show poor performance when the

learning rates are either very small or very big.

Moreover, these results can be explained for two reasons: the first reason is that a

training set can be biased and contain images that favour specific weight values; the

second reason is that very small or very big learning rates affect the weight

convergence.

Finally, it can be observed that the learning rate should be selected carefully since it

controls the convergence speed of the model and can also affect the performance of the

GDO algorithm.


6.1.1.1.1 KDEF dataset learning rate

The tests in this subsection were performed to determine whether the same learning

rate applied to the JAFFE dataset was also useful in the KDEF dataset or not. For these

experiments, a network was formed using seven layers: two convolutional layers

followed by max-pooling layers, two fully-connected layers with 128 neurons with

dropout in the first layer, and one fully-connected layer with 7-way softmax in the output

layer. Moreover, the Nesterov momentum GDO algorithm and the ReLU activation

function were used in these networks.

The results of these experiments are shown in the following figure:

Figure 6-2: Validation and test accuracy for different learning rate values.

It can be seen that a learning rate of 0.001 achieves the highest validation accuracy and

the third highest test accuracy. This suggests that the model finds difficult to generalised

when it is tested using the JAFFE dataset. However, as in the previous section, a

Nesterov momentum with a medium learning rate works better than the same GDO

algorithm with a higher or smaller learning rate.

6.1.1.2 Gradient descent optimization algorithm

As it was mentioned in Chapter 2, GDO algorithms are used to avoid suboptimal points

and find values that achieve high performance. In this section, the highest accuracy


result that each GDO algorithm achieved in section 6.1.1.1 was selected and compared

here. The results are shown in the following plot:

Figure 6-3: Highest accuracy result of each GDO algorithms.

It can be observed in Figure 6-3 that Momentum, RMSprop, Adam and Nesterov

momentum achieved the highest accuracy. While, Adam and RMSprop got that

performance using a learning rate of 0.001, Nesterov momentum and Momentum

obtained the same performance with a learning rate of 0.015.

6.1.1.3 Size of max-pooling filters

In order to identify the best max-pooling-filter size, the experiments in this section were

performed by varying the size of the filter. They involved three sizes: small (2x2),

medium (4x4), and big (10x10). The outcome of this experiment can be observed in the

following figure:


Figure 6-4: Accuracy after varying the size of the max-pooling filter.

It can be noticed that the best size for the max-pooling filter is 2x2 as it achieved the

highest accuracy. This value is confirmed in [16] where it is noted that "the most

common setting is to use max-pooling with 2x2 receptive fields and with a stride of 2"

given that bigger filters are aggressive which, as it can also be observed with the 10x10

filter (red line), "leads to worse performance."

6.1.1.4 Activation function

The experiments in this section involved applying the activation functions described in

section 2.2.2.1.2. These tests were performed using the same pre-processed dataset,

but some hyper parameters were modified to speed up the process. The modified

values were: the number of iterations which was reduced to 50 epochs; the number of

neurons in the fully-connected layers which was reduced to 128, and the dropout layer

which was removed.

The results of these experiments are displayed in the next figure:


Figure 6-5: Accuracy for different activation functions.

Figure 6-5 shows the behaviour of three activation functions, namely tanh, sigmoid and

ReLU. The ReLU function achieved the highest accuracy among the activation

functions; the sigmoid function failed to converge; and the tanh function displayed an

erratic behaviour. This information is in line with the results in [35] and [58]. Particularly,

in the second reference, it is suggested to never use the sigmoid or tanh function to

train a DCNN; instead, ReLu or maxout should be preferred in order to obtain a better

performance.

6.1.1.5 Number of neurons in the FC layer

The experiments performed in this section involved building models with different

number of neurons in the fully-connected layers. The network consisted of two fully-

connected layers with dropout in the first layer and one fully-connected layer with 7-way

softmax in the output layer. Additionally, the activation function used was ReLU, and the

max-pooling-filter size was 2x2.

The results from these tests are presented in the next table:


Figure 6-6: Validation and test accuracy for FC layers with varying number of

neurons.

As Figure 6-6 shows, using more than 128 neurons achieves similar validation

accuracies. The fact that the validation accuracy is greater than the test accuracy could

be explained given that the model might have overfitted the data, which makes

generalisation harder to obtain.

An important point to observe is that a small number of neurons achieve lower accuracy

given that the function to model is more complex than what they are capable of

modelling.

6.1.1.6 Number of fully-connected layers

This experiment involved adding convolutional neural networks and evaluating their

accuracy. The network was formed by two convolutional layers followed by max-pooling

layers; a varying number of 128-fully-connected layers with dropout regularization

method; and one fully-connected layer with 7-way softmax in the output layer.

The results from these tests are shown in the following table:


Figure 6-7: Validation and test accuracy for different number of FC layers.

As it can be noticed in this table, the accuracy decreases when the network is formed

by 10 layers or more. This is an expected result as it becomes more difficult to train a

network the more layers it has. In order to solve this problem, care should be taken

when selecting other hyper parameters such as the learning rate or the size of the max-

pooling filters [54, 62, 74, 75].

Figure 6-7 shows that five FC layers or less are capable of achieving high accuracy.

Particularly, three FC layers obtained a good validation accuracy, and the highest test

accuracy.

6.1.2 Other alternatives to improve the DCNN

In addition to modifying the hyper parameters listed in the previous section, other

methods were used to improve the performance of the DCNN, namely data pre-

processing, data augmentation and the dropout technique.

6.1.2.1 Data Pre-processing

The first alternative was to pre-process the input data. [57] describes three methods

(detailed in Section 4.1.2.2) that were used to perform this task, namely normalisation,

PCA and whitening.


In this set of experiments, normalisation, PCA, and whitening were used to modify the

input data. The first experiment normalised the data; the second, applied PCA to the

data; the third, whitened the data; and the last applied a combination of all of them.

The results can be observed in the following figure:

Figure 6-8: Accuracy of different pre-processing methods.

Based on these results, it is worth nothing that PCA and whitening performed slightly

better than the scenario in which pre-processing was not applied. However, they

performed worse than only normalising the input data. In [57], it is mentioned that PCA

and whitening are not used with CNN, so the previous result confirms this assertion.

Additionally, combining different techniques was thought to increase the performance

since “scaling the data has a substantial effect on the result obtained” [69]. However, it

did not perform better than only normalising the data. Instead, applying all the

techniques seems to average the accuracies.


6.1.2.2 Data augmentation

This process was applied to the original datasets in order to increase the number of

images. The transformations used on the input data were: rotation, translation, zooming,

flipping, random cropping, gamma correction, and occlusion.

Rotation was applied by turning the image 90 degrees clockwise and counter-clockwise;

Translation consisted on randomly moving images between -4 to 4 pixels; Zoom was

the result of randomly extracting sections of images and increasing their size; Flipping

was applied vertically, and horizontally; Random cropping involved arbitrarily trimming

the images; Gamma correction was applied by modifying the luminance of the image;

finally, occlusion was used to obstruct parts of the image.

These transformations can be observed in the following figure:

Original

Zoom and

Translation

Flip and Rotation

30%

Occlusion

Gamma variation

Random crop

Figure 6-9: Transformations applied to one image.


The transformations were added to the original dataset in order to create six datasets,

namely zoom, gamma variation, flipped, medium occlusion, and random crop datasets;

and a dataset with all the modifications combined. Given that these transformations

were applied randomly to the original dataset in order to prevent overfitting, the size of

each of them is different. For these tests, the number of images for each datasets is

described in the following table:

Table 6-1: Number of images in each dataset.

Given that the dataset was increased, more layers were added to the DCNN. The

architecture was formed by nine learned layers: three convolutional layers followed by

max-pooling layers, two fully-connected layers with 512 neurons and dropout, and one

fully-connected layer with 7-way softmax.

The results of these experiments are shown below:

Figure 6-10: Accuracy of using different data augmentation techniques.

Flip Gamma Occlusion Random Crop Zoom All

Number of images 213 864 973 639 1059 333 6727

Transformations

Original


In this figure, it can be seen that applying different data augmentation techniques affects

the accuracy of the model dissimilarly. While some techniques, such as gamma or

occlusion, improved the accuracy, other techniques, such as flip or zoom, achieved

lower accuracy levels.

On the one hand, a higher accuracy in the case of occlusion and gamma correction

might be explained given that the modified images are more similar to the original than

other transformations. While other methods change the position or the size of feature,

which in turn creates a different activation map; occlusion and gamma correction modify

these features, but preserve the position and the size, which creates a similar activation

map and reinforces the knowledge of neurons in the same position.

On the other hand, the lower accuracy obtained by other transformations can be

explained given that the model decreases overfitting and improve generalisation by

creating different version of the input image which forces the model not to favour any

particular characteristic of the data.

Moreover, an interesting outcome was observed when all the data augmentation

methods were combined. Although the accuracy was lower, the convergence rate in

these experiments was faster than in the other experiments.

6.1.2.3 Dropout

Dropout was a regularisation technique used to decrease the tendency to overfit. The

basic idea of this layer is that it randomly removes individual activations during the

training process. The experiments in this stage consisted in comparing the results of the

models trained using normalisation, and normalisation and dropout combined. The

dropout utilised a Bernoulli distribution probability of 𝑝 = 0.5 of being 1.

The results from these experiments are shown in the next figure:


Figure 6-11: Accuracy of using normalisation and dropout.

Figure 6-11 shows that normalisation alone achieves a higher accuracy than

normalisation and dropout combine. This is expected given that the dropout technique

breaks co-adaptation by not considering certain neurons at a given time which means

that the neurons are different in every iteration and break the dependency of learning

specific features in the presence of other neurons [23, 61]. Moreover, this result is in

line with the observations made in [61]: "One of the drawbacks of dropout is that it

increases training time. A dropout network typically takes 2-3 times longer to train that a

standard neural network of the same architecture."

6.2 Test stage

In this section, the objective was to test the trained networks built in the previous part.

Section 6.2.1 describes the occlusion and illumination tests. Section 6.2.2 presents

additional test performed in order to improve the accuracy of the models created in

section 6.1.1. Finally, section 6.2.3 details the test performed using other architectures.


6.2.1 Difficult scenarios

The accuracy of the model obtained under perfect conditions is not as important as the

accuracy that is achieved under difficult conditions. This section tests the network

against difficult scenarios in order to study their effect in the accuracy of the models.

6.2.1.1 Occlusion and gamma variation tests

In these experiments, three models were tested against three sets of facial expression

images with varying levels of occlusion and illumination. As it was mentioned before, the

datasets did not contain images showing different levels of occlusion and illumination.

Therefore, it was necessary to modify these images to artificially include these

characteristics by inserting random black rectangles covering 10, 30 and 50% of the

image; and varying the level of illumination from low to high.

The next figure shows the different type of occlusion applied to the dataset.

Figure 6-12: Images with 10% (top), 30% (middle) and 50% (bottom) occlusion.


Similarly, the figure below contains a sample of the gamma variation applied to the

images in the dataset. The first row from top to bottom has the original images; the

second row contains low levels of illumination; the third row corresponds to medium

levels; and the last row is formed by images with high levels of illumination.

Figure 6-13: Original images (first row) and images with low (second row),

medium (third row), and high (fourth row) levels of gamma variation.


The models contained 13 learned layers of which five were convolutional layers

followed by max-pooling layers, two were fully-connected layers with dropout, and one

was a fully-connected layer with 7-way softmax. They were trained with a dataset

formed by KDEF images occluded 10, 30 or 50%; low, medium and high levels of

illumination; and the original image dataset. Then, each model was tested against

datasets formed by JAFFE images occluded 10, 30 or 50%, and one set without any

modifications which was used as a baseline.

The result of the occlusion tests are shown in the following table:

Table 6-2: Occlusion test results.

This table shows that, for most cases, the models performed better when the dataset

did not contain occlusion and, as it would be expected, the accuracy decreased as the

level of occlusion increased. This decrement was especially severe in the case of

images occluded 50% where the accuracy got reduced by an average of 9% with

respect to the accuracy of the images occluded 30%.

Additionally, it can be noticed that the third model, the one that was trained with images

containing 50% occlusion, obtained the best accuracy of all the models for all the

datasets containing occlusion, particularly the dataset showing 30% occlusion. These

points can be explained given that an occlusion of 50% is a drastic reduction of

information that forces the DCNN to learn better and that reduces overfitting. However,

the disadvantage of these images occluded 50% is that recognising facial expressions

becomes more difficult. For instance, discerning between anger and disgust, or surprise

and fear is more complex when the mouth is covered.

These results can also be visualised in the figure below:

No Occlusion Occlusion 10% Occlusion 30% Occlusion 50%

Occlusion 10% 34.74 33.57 28.64 24.18

Occlusion 30% 39.44 38.26 37.32 24.65

Occlusion 50% 38.97 39.44 39.91 29.58

Dataset

Model


Figure 6-14: Accuracy for each DCNN model.

Similarly, the results of the illumination tests are displayed in the next table:

Table 6-3: Illumination test results.

From the previous table, it can be notice that varying levels of illumination does not

have a strong impact in the accuracy of the model. This is explained given that the

modified datasets contain a similar amount of information as the original dataset. When

low illumination is present, some features are ignored and the accuracy decreases

since this is similar to occluding the image. However, when there is high illumination in

the dataset, the number of training images and the accuracy increase as this is similar

to augmenting the data.

6.2.2 Innovative architectures

This section tests the architectures described in section 3.6, namely AlexNet, VGG, and

GoogLe Net.

The architectures used in each experiment are shown in the following figure:

Original Low illumination Medium illumination High illumination

Model 38.97 35.65 38.41 40.69

Dataset


Figure 6-15: Architectures: AlexNet (Left), VGG (Centre), and GoogLeNet (Right).


These models were configured using the values described in each paper with the

exception of their output, which was modified to match the seven facial expressions.

The architectures were trained using the augmented KDEF dataset for 40 iterations and

tested using the JAFFE dataset. In the case of AlexNet, three models were trained and

tested, separately and collectively (as an ensemble). In the case of GoogLe Net and

VGG, only one of each model was created.

The results from these experiments are shown in the following table:

Table 6-4: AlexNet, GoogleNet, and VGG accuracy tests.

As it can be noticed in this table, after 40 iterations the accuracy of the AlexNet and the

ensemble are lower than VGG and GoogLe Net. Moreover, the ensemble model got a

lower accuracy than the AlexNet 1 because AlexNet 3 had a lower accuracy and, after

being combined, the errors of the third model affected the overall performance. It is

expected that the more models with different hyper parameters are added to the

ensemble, the better the accuracy.

6.2.3 Additional test

Section 6.2.3 was added in an attempt to improve accuracy given that in previous

sections the validation accuracy was high, but once the models were assessed using

the test datasets, the accuracy diminished notably.

The first experiment was to test the models against different datasets using both

versions (original and augmented) in order to determine if the model was overfitted or

underfitted.

Model Accuracy

AlexNet 1 14.91

AlexNet 2 14.05

AlexNet 3 13.63

AlexNet Ensemble 14.26

VGG 32.96

GoogLe Net 29.75


The following figure shows the results of these experiments:

Figure 6-16: Test accuracy for different models using different datasets.

It can be noticed that model 1 and 2 have high accuracy when tested with the

augmented KDEF dataset as this was the dataset used to train these models. However,

the other values show low accuracy which might mean that the model overfitted the

data. This seems to be the case as the accuracy of the original KDEF dataset is lower

than the augmented KDEF dataset which means that the model did not generalise

correctly with other images outside of the training data given that it learned the modified

dataset better than the original dataset as there were numerous modified images for

each original image. Therefore, in order to improve this result, it was necessary to use

techniques to reduce overfitting.

One alternative to reduce overfitting is to augment the KDEF dataset. In order to do this,

it is possible to combine both the JAFFE and KDEF dataset, but it is important to leave


some images out of the training set in order to test the accuracy of the model with these

images.

The new model was formed by 14 layers of which five were convolutional layers

followed by max-pooling layers; three were fully-connected layers, and one was a fully-

connected layer with 7-way softmax. Each layer had a ReLU activation function, and

used the dropout regularisation technique. Additionally, a learning rate of 0.008 and

Nesterov momentum were used.

The accuracies for this experiment are shown in the following table:

Table 6-5: Validation and test accuracy for improved model and ensemble.

In Table 6-5, it can be noticed that the test accuracy has improved substantially. One

reason for this is that care was taken when selecting hyper parameters, particularly the

learning rate, the number of convolutional and fully-connected layers, and the value of

the dropout regularisation technique after each layer. As a result, a deeper network was

created.

Another reason is that the datasets used to train the model share similar characteristics

with the images utilised to test the network such as the quality of the picture, the race of

the people in the dataset, etc. This caused the DCNN to learn these features as part of

the model, instead of learning more relevant facial expression features.

In addition, the accuracy of the test KDEF dataset is higher in the case of all models

than in the case of the new model given that all the models were trained using this

dataset.

Combined JAFFE KDEF

Validation Test Test

Combined

Dataset84.33 75.07 64.89

All n/a 44.74 78.95

Accuracy

Models


Another way to visualise this is to calculate the performance metrics for both models.

Two of this metrics are precision, and recall. Precision gives the value of facial

expressions that were predicted correctly from all the predictions made correctly and

incorrectly. Recall provides the value of facial expressions that were predicted correctly

from the relevant facial expressions.

The following tables show the precision, recall and f1- score per emotion obtained in

each model tested against both datasets:

Table 6-6: Precision, recall and f1-score for both models tested against the KDEF

dataset.

Table 6-7: Precision, recall and f1-score for both models tested against the JAFFE

dataset.

It can be noticed that the new model achieved higher results than all the models

combined for the JAFFE dataset. Conversely, the new model obtained lower values

precision recall precision recall

Fear 0.71 0.54 0.79 0.81

Anger 0.52 0.55 0.72 0.81

Disgust 0.68 0.58 0.92 0.78

Happiness 0.81 0.83 0.87 0.89

Neutral 0.63 0.71 0.96 0.77

Sadness 0.48 0.66 0.54 0.87

Surprise 0.83 0.69 0.98 0.61

Avg/total 0.67 0.65 0.83 0.79

EmotionKDEF - Combined dataset model KDEF - All models combined

precision recall precision recall

Fear 0.95 0.66 0.54 0.59

Anger 0.68 0.81 0.4 0.29

Disgust 0.91 0.68 0.81 0.19

Happiness 0.79 0.79 0.66 0.49

Neutral 0.71 0.83 0.5 0.61

Sadness 0.53 0.83 0.23 0.67

Surprise 0.86 0.7 0.81 0.33

Avg/total 0.79 0.75 0.57 0.45

EmotionJAFFE - Combined dataset model JAFFE - All models combined


when the KDEF dataset was tested. As it was mentioned in the previous page, both

results are explained by the dataset used to train these models.

For the new model, the confusion matrix is shown in the following figure:

Figure 6-17: Confusion matrix for the new model trained with both KDEF and

JAFFE datasets (0: Fear, 1: Anger, 2: Disgust, 3: Happiness, 4: Neutral, 5:

Sadness, and 6: Surprise).

Figure 6-17 shows that the new model is capable of predicting happiness, sadness,

surprise and the neutral state with high accuracy; however, the network has a few

problems when it tries to classify anger, disgust and fear. The first two, anger and

disgust, are predicted instead of the other in some occasions, while fear is confused

mostly with surprise. Overall, the new model obtains reasonable test accuracy in both

datasets.

D i s c u s s i o n | 97

Chapter 7. Discussion

This chapter describes the strengths and weaknesses of the project, and the knowledge

that was acquired through the development of the project. Additionally, it discusses new

directions for future work.

7.1 Creating a Deep Convolutional Neural Networks

Building a machine learning model based on Deep Convolutional Neural Networks was

a challenging, but interesting journey. Through this project, revolutionary ideas created

by various academics such as Geoffrey Hinton, Alex Krizhevsky, Andrew Ng, Yoshua

Bengio, Yann LeCun, etc. were instrumental in not only improving the system, but also

simplifying complex concepts.

During the development phase, designing, training and testing the DCNN proved to be

hard since the literature was abundant and the procedure for building the best model

was not as simple as just replacing the input data with the new dataset used in this

project. In fact, applying the same settings to two different datasets will not achieve the

same performance.

In the design phase, the literature contained a copious set of alternatives. These

alternatives include the work of academics such as Alex Krizhevsky, Yann LeCun and

Karen Simonyan. The main limitation to implementing the different type of architectures

was the training time which increased as different factors such as the number of images

or the number of convolutional filters augmented. In some cases, these architectures

required to be trained for days in order to achieve good performance.

In fact, there is a trade-off between the complexity of the network and the processing

time. The greater the number of layers, number of neurons, or number of filters the

network has, the higher the processing time. Therefore, it is important to evaluate

whether increasing the number of elements is worth the processing time, but it is also

98 | D i s c u s s i o n

important to consider that adding more elements does not guarantee that the

performance of the network will be better.

During the training stage, the selection of the hyper parameters was based on trial and

error within a range of recommended values or methods [16, 49, 57]. However, these

recommendations were applied to different type of datasets than the sets used in this

project which meant that, in some cases, the results obtained from the experiments

diverged from what the literature reported.

As far as it was possible, all the experiments were performed by varying only one

element at a time, namely hyper parameters, input data, or architecture; however, there

were some tests, in which one of these elements had to be modified from one

experiment to the other. For instance, during the data augmentation and occlusion tests,

the input data was created every time a new transformation was applied which added

some uncertainty to the process since, as it was mentioned in chapter 6, it was required

to transform the input data randomly.

The next stage, testing, was performed using a separated dataset. This dataset was

created by using images that were excluded from the original dataset. Afterwards, a

model was trained using the best hyper parameters discovered during the experiments

and the test dataset was used to evaluate the performance of the model. This whole

process proved to be time consuming given that a mistake at any point could affect

other experiments. For example, if the experiment to select the gradient descent

optimization algorithm had been performed incorrectly, the experiment to elect the

learning rate of the model would have had to be repeated.

Moreover, based on the results, it seems that some models overfitted given that the

validation accuracy was much higher than the test accuracy which was below 50%. The

models failed to generalise because they learned the noise in the training data as

patterns and when new data was processed, this impacted negatively the ability of the

models to recognise facial expressions.


In an attempt to fix the discrepancy between these values, a new model was trained

combining KDEF and JAFFE datasets. This created a DCNN that generalised better

than previous models and obtained reasonable accuracy when tested against unseen

images from these datasets.

7.2 Limitations

In both training and test stages, one important drawback that affected the accuracy was

the pre-processing stage, particularly face recognition. As it can be noticed in the figure

below, the face recognition algorithm did not always perform correctly, failing to identify

the face and instead confusing it with other parts of the body such as ears or skin with

moles. This error was most noticeable when using the KDEF dataset since it contained

facial images from different angles, and it caused an increase of junk images that did

not provide information, and affected the convergence point.

Figure 7-1: False positive face detection.

The problem was solved during training and testing by manually removing the

problematic images, but this alternative was not possible during the evaluation of the

system as it can be shown in Figure 7-2 where, due to the noisy background, a false

positive was found in the column:

100 | D i s c u s s i o n

Figure 7-2: Incorrect face detection in noisy background.

Moreover, the algorithm was incapable of recognising the face in profile pictures, which

decreased the number of images in the dataset. However, this problem was mitigated

by performing data augmentation.

7.3 Building a Graphical User Interface

The creation of a GUI was divided in two phases: design and implementation. As it was

mentioned, during the design phase, the idea behind the GUI was simplicity. The

interface was thought so that the user could quickly test the application using the

webcam or images from a file without the need of complex menus. This was achieved

by taking the place of the user and thinking how to create an intuitive interface.

Designing the GUI was not as complicated as implementing it given that thinking about

a new design is faster than actually transforming that idea into a real application. The

first problem found was that python was limited in the number of alternatives to create

an appealing interface. Even after finding a reasonable alternative, PyQt, this python

binding did not implement all the capabilities of the Qt library which meant that the GUI

appearance had to be adapted to these limitations.


7.4 Future Work

Deep Convolutional Neural Networks are powerful models once hyper parameters have

been tuned correctly. Therefore, one approach to further enhance the results in this

project is to improve the tuning process. As it is suggested in [74], an alternative for

hyper parameters selection is to wrap a “pure” learning algorithm around the DCNN

model to be trained. This would require either modifying the lasagne and nolearn

implementation or building the implementation from the ground up.

Additionally, modifying the facial detection process could improve the accuracy of the

system given that, during the training stage, junk images would be reduced, and during

the AFER system test stage, faces in images containing noisy backgrounds would be

recognised.

The application needs to run on a computer with openCV, cuda, lasagne, nolearn, and

python installed. Therefore, the portability is limited. One option to avoid installing all

those elements is to create a website that replicates the functionality of this application.

Finally, another direction to improve this project is to experiments with different

architectures. In this dissertation, some of them were trained and tested. However,

training these complex architectures was time-consuming. Due to this constraint, it was

only possible to test a limited number of architectures.

102 | C o n c l u s i o n s

Chapter 8. Conclusions

The purpose of this dissertation was to automate the facial expression recognition

process. This was achieved by extensively training and testing several Deep

Convolutional Neural Networks and finding the most suitable hyper parameters.

The dissertation focused on seven stages required to build the Automatic Facial

Expression Recognition (AFER) system, namely data collection, pre-processing, system

design, development, training, testing, and evaluation.

One of the most crucial parts on this dissertation was the background which defined and

delimited this project. Moreover, it provided the ideas from experts in the field of

machine learning, and in particular, neural networks, to implement during the training

and testing stages.

The experimental section was performed using two benchmark datasets: the JAFFE

and KDEF datasets. The results obtained from these tests demonstrated that it is critical

to apply different techniques in order to avoid overfitting and increase accuracy.

Moreover, it was showed that the greater the number of layers the DCNN has, the more

care has to be taken when selecting the hyper parameters to prevent a decrease in

accuracy. This holds particularly true when selecting the learning rate and the gradient

descend optimization algorithm which as it was shown are closely dependent.

Additionally, it was proven that the ReLU activation function should be the preferred

option over sigmoid or tanh. All these experiments allowed for a deeper understanding

of the advantages and limitations of the models.

The deliverables of this project were described and documented in this dissertation,

including the research methodology, system implementation, experimental results, and

analysis of these results in order to grasp a deeper understanding of DCNN.

C o n c l u s i o n s | 103

Finally, the document presented some alternatives to expand this project based on the

experimental results and the experience acquired during the development of this

system.

104 | R e f e r e n c e

Reference

[1] Ekman, P. and Rosenberg, E. (1997). What the face reveals. New York: Oxford

University Press.

[2] Jabr, Ferris. (2010) "The Evolution Of Emotion: Charles Darwin's Little-Known

Psychology Experiment". Scientific American Blog Network, 2 Mar. 2016.

[3] Russell, James A, and Jose Miguel Fernandez Dols. (1997) The Psychology Of

Facial Expression. Cambridge: Cambridge University Press.

[4] Bettadapura, Vinay. (n.d.) "Face Expression Recognition And Analysis: The State Of

The Art". arXiv:1203.6722 [cs.CV]

[5] Siegman, Aron Wolfe, and Stanley Feldstein. Nonverbal Behavior And

Communication. Chapter 4. Facial Expression. Hillsdale, N.J.: L. Erlbaum Associates,

1978.

[6] Liu, Ping et al. (n.d.) "Facial Expression Recognition Via A Boosted Deep Belief

Network".

[7] Zhao, Xiaoming, Xugan Shi, and Shiqing Zhang. (2015). "Facial Expression

Recognition Via Deep Learning". IETE Technical Review 32.5: 347-355.

[8] Fasel, B. and Luettin, J. (2003). Automatic facial expression analysis: a survey.

Pattern Recognition, 36(1), pp.259-275.

[9] Y. Lv, Z. Feng and C. Xu, "Facial expression recognition via deep learning," Smart

Computing (SMARTCOMP), 2014 International Conference on, Hong Kong, 2014, pp.

303-308. doi: 10.1109/SMARTCOMP.2014.7043872

[10] Chibelushi, C. and Bourel, F. (2016). Facial Expression Recognition: A Brief

Tutorial Overview.

[11] P. K. Manglik, U. Misra, Prashant and H. B. Maringanti, "Facial expression

recognition," Systems, Man and Cybernetics, 2004 IEEE International Conference on,

2004, pp. 2220-2224 vol.3. doi: 10.1109/ICSMC.2004.1400658

[12] LeCun, Y. and Huang, F. (n.d.). Large-scale Learning with SVM and Convolutional

Nets for Generic Object Categorization.

R e f e r e n c e | 105

[13] Hinton, G. (2012). Neural Networks for Machine Learning. [online] Coursera.

Available at: https://www.coursera.org/course/neuralnets [Accessed 8 Mar. 2016].

[14] Nielsen, M. (2015). Neural Networks and Deep Learning. Determination Press.

[online] Available at: http://neuralnetworksanddeeplearning.com/chap5.html [Accessed

24 Apr. 2016].

[15] Ufldl.stanford.edu. (2016). Unsupervised Feature Learning and Deep Learning

Tutorial. [online] Available at: http://ufldl.stanford.edu/tutorial/supervised/Convolutional

NeuralNetwork/ [Accessed 15 Apr. 2016].

[16] Cs231n.github.io. (2016). CS231n Convolutional Neural Networks for Visual

Recognition. [online] Available at: http://cs231n.github.io/convolutional-networks/

[Accessed 15 Apr. 2016].

[17] Poczos, B. and Singh, A. (2016). Introduction to Machine Learning. Deep Learning.

CMU.

[18] Six Novel Machine Learning Applications (2016). Forbes. [online] Available at:

http://www.forbes.com/sites/85broads/2014/01/06/six-novel-machine-learning-applicati

ons/#1d3c954367bf [Accessed 17 Apr. 2016].

[19] Bishop, C. (2006). Pattern recognition and machine learning. New York: Springer,

pp.227 - 249 and 256 - 272.

[20] MacKay, D. (2003). Information theory, inference, and learning algorithms.

Cambridge, UK: Cambridge University Press, pp. 467 - 492.

[21] Murphy, K. (2012). Machine learning. Cambridge, Mass.: MIT Press, pp. 563 - 579.

[22] LeCun, Y. et al. (1998). Object recognition with gradient-based learning. doi:

10.1007/3-540-46805-6_19

[23] Krizhevsky, A. et al. (2014). Dropout: A Simple Way to Prevent Neural Networks

from Overfitting.

[24] YouTube. (2016). CS231n Winter 2016: Lecture 7: Convolutional Neural Networks.

[online] Available at: https://www.youtube.com/watch?v=LxfUGhug-iQ&feature=youtu.be


https://www.coursera.org/course/neuralnets

http://neuralnetworksanddeeplearning.com/chap5.html

http://ufldl.stanford.edu/tutorial/supervised/ConvolutionalNeuralNetwork/

http://ufldl.stanford.edu/tutorial/supervised/ConvolutionalNeuralNetwork/

http://cs231n.github.io/convolutional-networks/

http://www.forbes.com/sites/85broads/2014/01/06/six-novel-machine-learning-applications/#1d3c954367bf




https://www.youtube.com/watch?v=LxfUGhug-iQ&feature=youtu.be

106 | R e f e r e n c e

[25] YouTube. (2016). Convolutional Neural networks. [online] Available at:

https://www.youtube.com/watch?v=rxKrCa4bg1I [Accessed 15 Apr. 2016].

[26] Rodrigues, C. (2015). Exploring the transfer learning aspect of deep neural

networks in facial information processing. University of Manchester., p.23.

[27] Win-Vector Blog. (2014). Estimating Generalization Error with the PRESS statistic.

[online] Available at: http://www.win-vector.com/blog/2014/09/estimating-generalization-

error-with-the- press-statistic/ [Accessed 30 Apr. 2016].

[28] Amazon Web Services. (2016) Amazon Web Services. [online] Available at:

https://aws.amazon.com [Accessed 25 Apr. 2016].

[29] Docs.aws.amazon.com. (2016). Linux GPU Instances - Amazon Elastic Compute

Cloud. [online] Available at: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/

using_cluster _computing.html [Accessed 25 Apr. 2016].

[30] Nvidia.com. (2016). NVIDIA KEPLER GK110 Next-Generation CUDA Compute

Architecture. [online] Available at: http://www.nvidia.com/content/PDF/kepler

/NV_DS_Tesla_KCompute_ Arch_May_2012_LR.pdf [Accessed 25 Apr. 2016].

[31] Nvidia.com. (2016). Specs & Features of GRID Cloud Gaming GPUs | NVIDIA.

[online] Available at: http://www.nvidia.com/object/cloud-gaming-gpu-boards.html


[32] Nvidia.com. (2016). NVIDIA Kepler Compute Architecture | High Performance

Computing | NVIDIA. [online] Available at: http://www.nvidia.com/object/nvidia-

kepler.html [Accessed 25 Apr. 2016].

[33] School of Computer Science (2016). Computer Languages. [online] Available at:

http://www.cs.man.ac.uk/~pjj/cs1001/software/node3.html [Accessed 27 Apr. 2016].

[34] Brownlee, J. (2014). Best Programming Language for Machine Learning - Machine

Learning Mastery. [online] Machine Learning Mastery. Available at: http://

machinelearningmastery.com/best-programming-language-for-machine-learning/


[35] Krizhevsky, A. and Hinton, G. (2012). ImageNet Classification with Deep

Convolutional Neural Networks.

https://www.youtube.com/watch?v=rxKrCa4bg1I

http://www.win-vector.com/blog/2014/09/estimating-generalization-error-with-the-press-statistic/

http://www.win-vector.com/blog/2014/09/estimating-generalization-error-with-the-press-statistic/

https://aws.amazon.com/about-aws/

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/%20using_cluster%20_computing.html

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/%20using_cluster%20_computing.html

http://www.nvidia.com/content/PDF/kepler/NV_DS_Tesla_KCompute_Arch_May_2012_LR.pdf

http://www.nvidia.com/content/PDF/kepler/NV_DS_Tesla_KCompute_Arch_May_2012_LR.pdf

http://www.nvidia.com/object/cloud-gaming-gpu-boards.html

http://www.nvidia.com/object/nvidia-kepler.html

http://www.nvidia.com/object/nvidia-kepler.html

http://www.cs.man.ac.uk/~pjj/cs1001/software/node3.html

http://machinelearningmastery.com/best-programming-language-for-machine-learning/

http://machinelearningmastery.com/best-programming-language-for-machine-learning/

R e f e r e n c e | 107

[36] Fernández Villaverde, J. and Aruoba, B. (2014). An empirical comparison of seven

programming languages. P.15.

[37] Docs.continuum.io. (2016). Anaconda | Continuum Analytics: Documentation.

[online] Available at: https://docs.continuum.io/anaconda/index [Accessed 26 Apr.

2016].

[38] Nvidia.com. (2016). Parallel Programming and Computing Platform | CUDA |

NVIDIA | NVIDIA. [online] Available at: http://www.nvidia.com/object/cuda_home

_new.html [Accessed 26 Apr. 2016].

[39] NVIDIA Developer. (2014). NVIDIA cuDNN. [online] Available at: https://developer

.nvidia.com/cudnn [Accessed 26 Apr. 2016].

[40] Deeplearning.net. (2016). Welcome — Theano 0.8.0 documentation. [online]

Available at: http://deeplearning.net/software/theano/ [Accessed 26 Apr. 2016].

[41] Lasagne.readthedocs.io. (2016). Welcome to Lasagne — Lasagne 0.2.dev1

documentation. [online] Available at: http://lasagne.readthedocs.io/en/latest/ [Accessed

26 Apr. 2016].

[42] Docs.opencv.org. (2016). OpenCV: Face Detection using Haar Cascades. [online]

Available at: http://docs.opencv.org/master/d7/d8b/tutorial_py_face_detection.html#

gsc.tab=0 [Accessed 18 Jul. 2016].

[43] Opencv.org. (2016). OpenCV. [online] Available at: http://opencv.org/ [Accessed 24

Apr. 2016].

[44] Pythonhosted.org. (2016). nolearn.lasagne — nolearn 0.6 documentation. [online]

Available at: https://pythonhosted.org/nolearn/lasagne.html [Accessed 26 Apr. 2016].

[45] Chris Nicholson, A. (2016). Deep Learning Comp Sheet: Deeplearning4j vs. Torch

vs. Theano vs. Caffe vs. TensorFlow - Deeplearning4j: Open-source, distributed deep

learning for the JVM. [online] Deeplearning4j.org. Available at:

http://deeplearning4j.org/compare-dl4j-torch7 -pylearn.html [Accessed 27 Apr. 2016].

[46] Torch.ch. (2016). Torch | Scientific computing for LuaJIT. [online] Available at:

http://torch.ch/ [Accessed 27 Apr. 2016].

https://docs.continuum.io/anaconda/index

http://www.nvidia.com/object/cuda_home_new.html

http://www.nvidia.com/object/cuda_home_new.html

https://developer.nvidia.com/cudnn

https://developer.nvidia.com/cudnn

http://deeplearning.net/software/theano/

http://lasagne.readthedocs.io/en/latest/

http://docs.opencv.org/master/d7/d8b/tutorial_py_face_detection.html#gsc.tab=0

http://docs.opencv.org/master/d7/d8b/tutorial_py_face_detection.html#gsc.tab=0

http://opencv.org/

https://pythonhosted.org/nolearn/lasagne.html

http://deeplearning4j.org/compare-dl4j-torch7-pylearn.html

http://torch.ch/

108 | R e f e r e n c e

[47] The Karolinska Directed Emotional Faces. Lundqvist, D., Flykt, A., & Öhman, A.

(1998). The Karolinska Directed Emotional Faces - KDEF, CD ROM from Department of

Clinical Neuroscience, Psychology section, Karolinska Institutet, ISBN 91-630-7164-9.

[48] Michael J. Lyons, Shigeru Akemastu, Miyuki Kamachi, Jiro Gyoba. Coding Facial

Expressions with Gabor Wavelets, 3rd IEEE International Conference on Automatic

Face and Gesture Recognition, pp. 200-205 (1998). doi: 10.1109/AFGR.1998.670949



30 Apr. 2016].

[50] Deeplearning.net. (2016). Convolutional Neural Networks (LeNet) — DeepLearning

0.1 documentation. [online] Available at: http://deeplearning.net/tutorial/lenet.html


[51] Simard, P., Steinkraus, D. and Platt, J. (2003). Best Practices for Convolutional

Neural Networks Applied to Visual Document Analysis. Proceedings of the Seventh

International Conference on Document Analysis and Recognition. IEEE Computer

Society. ISBN: 0-7695-1960-1

[52] Viola, P. and Jones, M. (2001). Rapid Object Detection using a Boosted Cascade

of Simple Features. doi: 10.1109/CVPR.2001.990517

[53] Deeplearning.net. (2016). Getting Started — DeepLearning 0.1 documentation.

[online] Available at: http://deeplearning.net/tutorial/gettingstarted.html [Accessed 9 May

2016].

[54] Ufldl.stanford.edu. (2016). Data Preprocessing - Ufldl. [online] Available at:

http://ufldl.stanford.edu/wiki/index.php/Data_Preprocessing [Accessed 18 Jul. 2016].

[55] Baker, S. (2014). Obamacare Website Has Cost $840 Million. [online] The Atlantic.

Available at: http://www.theatlantic.com/politics/archive/2014/07/obamacare-website-

has-cost-840-million/440478/ [Accessed 24 Jul. 2016].

[56] Bloomberg.com. (2016). Obamacare Website Costs Exceed $2 Billion, Study Finds.

[online] Available at: http://www.bloomberg.com/news/articles/2014-09-24/obamacare-

website-costs-exceed-2-billion-study-finds [Accessed 24 Jul. 2016].


http://deeplearning.net/tutorial/lenet.html

http://deeplearning.net/tutorial/gettingstarted.html

http://ufldl.stanford.edu/wiki/index.php/Data_Preprocessing

http://www.theatlantic.com/politics/archive/2014/07/obamacare-website-has-cost-840-million/440478/

http://www.theatlantic.com/politics/archive/2014/07/obamacare-website-has-cost-840-million/440478/

http://www.bloomberg.com/news/articles/2014-09-24/obamacare-website-costs-exceed-2-billion-study-finds

http://www.bloomberg.com/news/articles/2014-09-24/obamacare-website-costs-exceed-2-billion-study-finds

R e f e r e n c e | 109


Recognition. [online] Available at: http://cs231n.github.io/neural-networks-2/ [Accessed

26 Jul. 2016].



26 Jul. 2016].

[59] kaurdavinder, V. (2015). Difference between RGB and BGR. [online] Davinder

Kaur. Available at: https://lifearoundkaur.wordpress.com/2015/08/04/difference-

between-rgb-and-bgr/ [Accessed 29 Jul. 2016].

[60] RGB, W. (2016). Why OpenCV Using BGR Colour Space Instead of RGB. [online]

Stackoverflow.com. Available at: http://stackoverflow.com/questions/14556545/why-

opencv-using-bgr-colour-space-instead-of-rgb [Accessed 30 Jul. 2016].



31 Jul. 2016].



31 Jul. 2016].

[63] Wolframalpha.com. (2016). Wolfram|Alpha: Computational Knowledge Engine.

[online] Available at: http://www.wolframalpha.com/input/?i=law+of+large+numbers

[Accessed 31 Jul. 2016].

[64] YouTube. (2016). TTIC Distinguished Lecture Series - Geoffrey Hinton. [online]

Available at: https://www.youtube.com/watch?v=EK61htlw8hY [Accessed 31 Jul. 2016].

[65] Sebastian Ruder. (2016). An overview of gradient descent optimization algorithms.

[online] Available at: http://sebastianruder.com/optimizing-gradient-descent/index.html

[Accessed 31 Jul. 2016].

[66] Zeiler, M. (2016). Adadelta: An adaptive learning rate method. Google Inc., USA -

New York University, USA. arXiv:1212.5701 [cs.LG]

http://cs231n.github.io/neural-networks-2/


https://lifearoundkaur.wordpress.com/2015/08/04/difference-between-rgb-and-bgr/

https://lifearoundkaur.wordpress.com/2015/08/04/difference-between-rgb-and-bgr/

http://stackoverflow.com/questions/14556545/why-opencv-using-bgr-colour-space-instead-of-rgb

http://stackoverflow.com/questions/14556545/why-opencv-using-bgr-colour-space-instead-of-rgb



http://www.wolframalpha.com/input/?i=law+of+large+numbers

https://www.youtube.com/watch?v=EK61htlw8hY

http://sebastianruder.com/optimizing-gradient-descent/index.html

110 | R e f e r e n c e

[67] Hinton, G. (2016). Lecture 6e. rmsprop: Divide the gradient by a running average of

its recent magnitude. [online] Available at: http://www.cs.toronto.edu/~tijmen

/csc321/slides/ lecture_slides_lec6.pdf [Accessed 02 Aug. 2016].

[68] Kingma, D. and Ba, J. (2015). Adam: A method for Stochastic Optimization.

arXiv:1412.6980 [cs.LG]

[69] James, G. (2014). An introduction to statistical learning. New York, NY: Springer,

pp.374-385.

[70] Alpaydin, E. (2010). Introduction to machine learning. Cambridge, Mass.: MIT

Press, pp. 113-120.

[71] Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A. and Bengio, Y. (2013).

Maxout Networks. JMLR WCP 28, pp.1319-1327. arXiv:1302.4389 [stat.ML]

[72] Shlens, J. (2003). A tutorial on Principal Component Analysis: Derivation,

discussion and Singular Value Decomposition. [online] princeton.edu. Available at:

https://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf [Accessed 8

Aug. 2016].

[73] Chen, K. (2015). Principal Component Analysis (PCA).

[74] Yoshua Bengio. (2012) Early Stopping – But When? In Neural Networks: Tricks of

the Trade. Springer, pp. 53 – 69.

[75] Hastie, T., Tibshirani, R. and Friedman, J. (2001). The elements of statistical

learning. New York: Springer, pp.219-228.

[76] Duncan, B. Bias and variance. [online] Fliptop blog. Available at:

http://blog.fliptop.com/blog /2015/03/02/bias-variance-and-overfitting-machine-learning-

overview/ [Accessed 11 Aug. 2016].

[77] Le Cun, Y., Denker, J. and Solla, S. (1990). Optimal Brain Damage. AT&T Bell

Laboratories. ISBN:1-55860-100-7

[78] Riverbankcomputing.com. (2016). Riverbank | Software | PyQt | What is PyQt?.

[online] Available at: https://riverbankcomputing.com/software/pyqt/intro [Accessed 12

Jun. 2016].

http://www.cs.toronto.edu/~tijmen%20/csc321/slides/

http://www.cs.toronto.edu/~tijmen%20/csc321/slides/

http://blog.fliptop.com/blog%20/2015/03/02/bias-variance-and-overfitting-machine-learning-overview/

http://blog.fliptop.com/blog%20/2015/03/02/bias-variance-and-overfitting-machine-learning-overview/

https://riverbankcomputing.com/software/pyqt/intro

R e f e r e n c e | 111

[79] Simonyan, K. and Zisserman, A. (2015). Very deep convolutional networks for

large-scale image recognition. arXiv:1409.1556 [cs.CV].

[80] Szegedy, C. et al. (2014). Going Deeper with Convolutions. arXiv:1409.4842

[cs.CV].

[81] Alizadeh, S. and Fazel, A. (2015). Convolutional Neural Networks for Facial

Expression Recognition.

[82] Whitehead, N. and Fit-Florea, A. (n.d.). Precision & performance: Floating point and

IEEE 754 Compliance for NVIDIA GPUs.

112 | A p p e n d i x

Appendix

Setting the environment on Ubuntu 14.04.3 LTS with GPU

support

The following steps will install theano, lasagne, and nolearn with GPU support:

1. Update the default packages: sudo apt-get update

2. Install Ubuntu updates: sudo apt-get -y dist-upgrade

3. Create a Download folder: mkdir Downloads

4. Change directory: cd Downloads

5. Download Anaconda:

o wget https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux -x86_64.sh

6. Change file permissions: chmod 770 Anaconda2-4.1.1-Linux-x86_64.sh

7. Install Anaconda: ./Anaconda2-4.1.1-Linux-x86_64.sh

8. Reload the profile to update the PATH variable changed by the Anaconda

installation process: source ~/.bashrc

9. Install dependencies:

o sudo apt-get install -y gcc g++ gfortran build-essential git wget linux-image-generic libopenblas-dev python-dev python-nose

10. Install LAPACK: sudo apt-get install -y liblapack-dev

11. Download CUDA Toolkit:

o wget http://developer.download.nvidia.com/compute/cuda/7.5/ Prod/local_installers/cuda-repo-ubuntu1404-7-5-local_7.5-18_ amd64.deb

12. Disable nouveau drivers:

o Create a file at: /etc/modprobe.d/blacklist-nouveau.conf with the

following content:

blacklist nouveau options nouveau modeset=0

https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux%20-x86_64.sh

https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux%20-x86_64.sh

http://developer.download.nvidia.com/compute/cuda/7.5/%20Prod/local_installers/cuda-repo-ubuntu1404-7-5-local_7.5-18_%20amd64.deb



A p p e n d i x | 113

o Regenerate the kernel initramfs: sudo update-initramfs –u

13. Reboot the system: sudo reboot

14. Install repository meta-data: sudo dpkg -i cuda-repo-ubuntu1404-7-5-

local_7.5-18_amd64.deb

15. Update the Apt repository cache: sudo apt-get update

16. Install CUDA: sudo apt-get install cuda

17. Add the environment variables to the .bashrc file:

o export PATH="/usr/local/cuda-7.5/bin:$PATH" o export LD_LIBRARY_PATH="/usr/local/cuda-7.5/lib64:$LD_LIBRAR

Y_PATH"

18. Reload the profile: source ~/.bashrc

19. Install theano:

o conda install theano o pip install --upgrade --no-deps git+git://github.com/Theano/

Theano.git

20. Install nolearn: pip install nolearn

21. Install lasagne: pip install https://github.com/Lasagne/Lasagne/

archive/master.zip

To test if the installation was successful:

22. Type: python

23. Import theano: import theano

o The next message should be displayed:

“Using gpu device 0: GRID K520 (CNMeM is enabled with initial size: 95.0% of memory, cuDNN not available)”

Additionally, in order to install cuDNN libraries, it is necessary to:

24. Create an NVIDIA developer account.

25. Download the cuDNN tar file.

26. Decompress the file: tar -xvf cudnn-7.5-linux-x64-v5.1.tgz

27. Change directory to the extracted directory: cd cuda

28. Copy the files to the corresponding directories:

https://github.com/Lasagne/Lasagne/%20archive/master.zip

https://github.com/Lasagne/Lasagne/%20archive/master.zip

114 | A p p e n d i x

o sudo cp lib64/* /usr/local/cuda-7.5/lib64/ o sudo cp include/* /usr/local/cuda-7.5/include/


29. Type: python

30. Import theano: import theano

o The next message should be displayed:

“Using gpu device 0: GRID K520 (CNMeM is enabled with initial size: 95.0% of memory, cuDNN 5103)”

Installing OpenCV on Ubuntu 14.04.3 LTS

The following steps will install OpenCV on Ubuntu:

1. Install cmake and git: sudo apt-get install cmake git

2. Install the image IO package: sudo apt-get install libjpeg8-dev

libtiff4-dev libjasper-dev libpng12-dev

3. Install the library to display the image on the screen: sudo apt-get install

libgtk2.0-dev

4. Install libraries to handle video stream: sudo apt-get install libavcodec-

dev libavformat-dev libswscale-dev libv4l-dev

5. Install libraries to optimise some openCV routines: sudo apt-get install

libatlas-base-dev gfortran

6. Pull down openCV from GitHub: git clone https://github.com/Itseez

/opencv.git

7. Checkout the 3.0.0 version: git checkout 3.0.0

8. Pull down openCV contrib from GitHub: git clone

https://github.com/Itseez/opencv_contrib.git

9. Checkout the 3.0.0 version: git checkout 3.0.0

10. Create a build directory inside opencv directory and change the directory to it:

o cd ~/opencv o mkdir build o cd build

https://github.com/Itseez%20/opencv.git

https://github.com/Itseez%20/opencv.git

A p p e n d i x | 115

11. Setup the build with the following command:

cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/home/<user>/opencv -D PYTHON_INCLUDE_DIR=/home/<user>/anaconda2/include/python2.7/ -D PYTHON_INCLUDE_DIR2=/home/<user>/anaconda2/include/python2.7 -D PYTHON_LIBRARY=/home/<user>/anaconda2/lib/libpython2.7.so -D PYTHON_PACKAGES_PATH=/home/<user>/anaconda2/lib/python2.7/site-packages/ -D BUILD_EXAMPLES=ON -D BUILD_NEW_PYTHON_SUPPORT=ON -D PYTHON2_LIBRARY=/home/lm/anaconda2/lib/libpython2.7.so -D BUILD_opencv_python3=OFF -D BUILD_opencv_python2=ON ..

12. Make the file: make -j4 (4 refers to the number of cores, so it can be replaced

with another number)

13. Install: sudo make install

14. Configure the necessary links and cache: sudo ldconfig

15. Move the cv2.so file to the site-packages directory: cp build/lib/cv2.so

~/anaconda2/lib/python2.7/site-packages/


1. Type: python

2. Import the library: import cv2

o If no error message is displayed, the installation was correct.

Running experiments using theano, lasagne, and nolear.

In order to test the code, it is necessary to follow these steps:

1. Download the datasets JAFFE and KDEF.

2. Install unzip: sudo apt-get install unzip

3. Extract the content:

o unzip jaffe.zip o unzip KDEF.zip

4. Download the Output.tar folder.

5. Extract the folders: tar -xvf Output.tar

6. Change directory to the newly extracted pyCodeP: cd pyCodeP

7. Modify CNN.py according to the desired network.

116 | A p p e n d i x

8. Run CNN.py: python CNN.py

9. Remove inputDataX and inputDataY: rm inputData*

10. Test the network saved in convNetPickles: python calculateAccuracy.py

convNetPickles directory_with_test_dataset