Deep Learning for Non Verbal Sentiment Analysis: Facial ...

EasyChair Preprint№ 3014

Deep Learning for Non Verbal SentimentAnalysis: Facial Emotional Expressions

Nour Meeki, Abdelmalek Amine, Mohamed Amine Boudia andReda Mohamed Hamou

EasyChair preprints are intended for rapiddissemination of research results and areintegrated with the rest of EasyChair.

March 22, 2020

Deep Learning for non verbal sentiment analysis:

facial emotional expressions

Nour MEEKI ∗ 1, Abdelmalek AMINE †2, Mohamed Amine BOUDIA ‡3, Reda

Mohamed HAMOU §4

1,2,3,4GeCoDe Laboratory, Department of Computer Science, Tahar Moulay University of

Saıda, Algeria

Abstract

People are of a lazy nature and always look for the easiest ways to express them-

selves and share their experiences and opinions. Due to the popularity of social net-

works, and to the images expressivity, people have the ability to express themselves

throught their use. Our work is about non verbal sentiment analysis using on of the

Deep Learning models: CNN (Convolutional Neural Networks). Specifically, we are

interested in analyzing the sentiment expressed in facial expressions according to Kag-

gle’s Dataset fer2013 for facial emotion recognition based on the emotions defined by

the famous psychologist Ekman namely joy, anger, fear, disgust, sadness and surprise,

neutrality is added to the six emotions. Thus, different proposed architectures are

used and compared to determine the parameters that affect the results.

The best evaluation resulted in details of around 0,88 showing that the number of

convolution layers, the batch size, the dropout and the epoch number have an impact

on the results. However using a CPU cost us a lot which proves that the use of a GPU

when using huge amount of data is better and guarantee good results .

Key words: Deep Learning (DL), Sentiment Analysis (SA), Emotionnal Facial

Expression(EFE), image classification, Convolutional Neural Network (CNN).

1 Introduction

Data on the web are very large in quantity and of different qualities. With the explosion of

the internet and social networks has emerged the need to analyze millions of posts, tweets

or opinions in order to know what Internet users think and try to identify the emotions

in their messages. The recognition of human emotions has been studied for decades but

still remains one of the most complex areas (Liu et al., 2003; Li et al., 2010). There

are thre types of emotion recognition: 7% verbal (text), 55% non verbal (gesture, facial

expressions) and 38% vocal (voice, intention) (Mehrabian and Wiener, 1967). Our study is

focused on non verbal sentiment analysis, specifically facial expressions as the human face

is the most expressive part, using one of the Deep Learning models: CNN (convolutional

∗[email protected]†abd [email protected]‡[email protected]§[email protected]

1

neural networks) which is widely used in the field of image processing and has given very

good results. The dataset used is Kaggle’s Dataset ”fer2013” for facial emotion recognition

based on the emotions defined by the famous psychologist Ekman namely joy, anger,

fear, disgust, sadness and surprise. Neutrality is added to the six emotions. In ordre

to determine the factors that affect the results different architectures are proposed and

compared with other works done within the same framework.

Figure 1: Images in fer2013 dataset

2 Related work

Convolutional Neural Networks (CNNs) are widely used in most of image processing ap-

plications, i.e. classifying images(Krizhevsky et al., 2012), grouping them by similarity,

performing object recognition in scenes(Grangier et al., 2009). CNNs are typically consti-

tute of input layer, convolution layers, fully connected layers and an output layer. Between

the convolutions layers and fully connected layers, there may also be other layers such as

pooling, dropout and normalization layers.

• The convolution layer allows to extract the features of an input image by applying

a set of filters. It is defined by the following equation:

G[m,n] = (f ∗ h)[m,n] =∑j

∑k

h[j, k]f [m− j, n− j] (1)

• The pooling consists in reducing the dimensions of the images. Its goal is to keep

as much relevant information as possible and reduce the number of parameters and

calculations in the network.

• The ReLU (Rectified Linear Units) is a non-linear activation function, it is defined

by:

Relu(x) =

{0 if x < 0

x if x > 0(2)

• The dropout is proposed as a regularization method in order to avoid the overfitting

problem.

2.1 Non verbal sentiment analysis through CNNs

The task of recognizing facial emotions is a required task for humans, but the transmis-

sion of this knowledge by machine is a challenge. Decades of time have been spent by

2

engineering researchers writing computer programs that accurately recognize functional-

ity. Thanks to Deep Learning methods, instead of programming a machine, we can teach

it to recognize emotions with great precision. The following works proves this:

(Fathallah et al., 2017) is a study focused only on facial expression recognition that uses

a CNN architecture with four convolution layers (with three maximum grouping

layers) to extract entities hierarchically, followed by a fully connected layer and a

softmax output layer indicating 6 expression classes to predict facial expression. The

results and recognition rates have shown that the method used in the work surpasses

the methods of the state of the art.

(Jindal and Singh, 2015) studied the sentiment analysis in images that exploits its aver-

age level attributes in addition to facial expression recognition. Experiments were

conducted on a set of manually labelled Flickr image data, with a rich repertoire

of images and associated tags reflecting the user’s emotions. A CNN architecture

composed of seven internal layers and a softmax layer. The hidden layers consist

of five successive convolution layers, followed by two fully connected layers, to de-

termine whether there are advantages when applying CNN to the visual sentiment

analysis. Their results prove that CNNs can give good results for the problem of

visual sentiment analysis.

(You et al., 2015) is another study where the same dataset as (Jindal and Singh, 2015)

was used with half a million Flickr images(from SentiBank). They proposed a PCNN

(Progressive Convolution Neural Network) network, as well as its training strategies,

which have made it possible to further generalize the trained model and increase the

accuracy. The results obtained in (Jindal and Singh, 2015) remain better among the

other ones even after a new study made by (Gajarla and Gupta, 2015) on collected

data from Flickr.

(Moran, 2019) : used the CNN architecture initially proposed by researcher Amogh

Gudi (Gudi et al., 2015), they trained the network for 100 epochs on a set of 14,524

images. Validation was performed using 9000 of the remaining images from the

FER2013 dataset. The network achieves optimal prediction accuracy (0,66), but

strives to distinguish between the emotions of fear and sadness.

Figure 2: Architecture (Moran, 2019)

(Giannopoulos et al., 2018) : examined the performance of two known deep learning

approaches (GoogLeNet and AlexNet) on facial expression recognition. The training

process was designed to go through 5000 iterations on the GoogLeNet and AlexNet

3

experiences. The average loss was presented every 10 iterations, while the accuracy

of each network was presented every 500 iterations. The performance of GoogLeNet

and AlexNet was studied on the FER-2013 dataset under three aspects, where each

evaluates a specific functionality of the methods. In the first part of the study, the

performance of networks was studied by recognizing or not the existence of emo-

tional content in a facial expression and then, in the second part of the study, their

performance was studied in specifying the exact emotional content of facial expres-

sions. Finally, in the third part, the two methods of deep learning were trained on

the emotional and neutral data studied. The results of the three parts are illustrated

in figure 3.

Figure 3: Accuracy of (Giannopoulos et al., 2018)

(Nishchal et al., 2018) : two different models were evaluated in their article according

to their precision and the number of parameters. The initial architecture proposed is

a standard CNN which includes 9 convolution layers, ReLU, batch normalization and

Global Average Pooling, it contains on average 600,000 parameters. The evaluation

of the proposed model made it possible to obtain details of approximately 0,66. The

first model was given the name fully-CNN sequential. The second model is driven by

the Xception (Chollet, 2017) architecture, this architecture has reached an accuracy

of 0,95.

Figure 4: Xception’s architecture (Chollet, 2017)

4

(Serengil, 2018): is another work carried out in the same framework, the constitution

of their architecture is as follows: 5 layers of convolutions followed by 3 layers of

pooling and 3 layers fully connected. The evaluation of their model made it possible

to obtain an accuracy of 0,92 and a loss of 0,22 with a number of epochs equal to

100.

Table1 represents accuracy of the related works using other dataset than ours and table2

is about the related works using the same dataset as ours (fer2013).

Work Method Dataset Accuracy

(Fathallah et al., 2017) 4 Conv + FC + Soft-Max

CK+/KDEF/RaFD/MUG 0.969

(Jindal and Singh,2015)

5 Conv + 2FC Flicker 0.535

(You et al., 2015) PCNN(ProgressiveCNN)

Flicker 0.781

(Gajarla and Gupta,2015)

ResNet-50 Flicker 0.733

Table 1: Related works accuracy 1

Architecture Accuracy

(Serengil, 2018) 0.92

(Moran, 2019) 0.65

AlexNet 0.82(Giannopoulos et al., 2018)

GoogleNet 0.87

sequential fully-CNN 0.66(Nishchal et al., 2018)

Xception 0.95

Table 2: Related works accuracy 2

3 Our experimentation and results

The puropse of our study is to determine wich parameters affect the results of a model.

Figure 5: Emotional expression recognition system

3.1 Model evaluation metrics

To evaluate our models we used the following metrics:

5

• Accuracy: Calculates the average accuracy rate of all forecasts.

Accuracy =TruePositif + TrueNegatif

TruePositif + TrueNegatif + FalsePositif + FalseNegatif(3)

• Loss: used to measure the inconsistency between the predicted value (p) and the

actual wording (t).

Loss = −∑j

ti,j log(Pi,j) (4)

• Confusion Matrix: used to describe the performance of a classification model on a

set of test data whose actual values are known.

3.2 Proposed architectures

We have proposed 3 different architectures described bellow.

1. Architecture 01: Consists of 5 convolution layers, 3 pooling layers and 3 fully con-

nected layers. The input image is (48*48), the image first goes to the first convolution

layer. This layer is composed of 64 filters of size (5*5), each of our convolutional

layers is followed by a function of activation ReLU this function forces the neurons

to return positive values, after this convolution 64 features maps of size (44*44) will

be created, then a Maxpooling with cells of size (5*5) is applied. Its function is

to reduce the spatial size of the incoming entities and thus contributes to reducing

the number of parameters and calculations in the network, thus helping to reduce

over-learning. At the end of the Maxpooling layer, we will have 64 feature maps

of size (20*20). The 64 feature maps obtained are input to the second convolu-

tion layer which is also composed of 64 (3*3) size filters. The 64 feature maps will

serve as input for the third convolutional layer that is similar to the second layer in

its constitution and are followed by an AveragePooling layer with(3*3) cells. The

fourth and fifth convolutional layers have 128 filters of size (3*3) and the activation

function chosen is the same as that of the other layers. They are followed by an

AveragePooling layer. At the exit of the AveragePooling layer, we will have 128

feature maps of size (1*1). The feature vector resulting from the convolutions has

a dimension of 128. To finish the construction of the architecture, we use 3 fully

connected layers. The first 2 fully-connected layers calculate a vector of size 1024,

and are each followed by a ReLU layer and a dropout equal to 0.2. The last layer

returns the vector of probabilities of size 7 (the number of classes) by applying the

softmax function. The given summary includes informations about : the layers and

their order in the model, the output shape of each layer, the number of parameters

(weight) in each layer, and the total number of parameters (weight) in the model.

Parameters in convolutional layers are calculated by the given formular:

Parameter = (filtreSize× inputFeatureMaps + 1) × outputFeatureMaps (5)

Parametrs in fully connected layers are calculated by the given formular:

Parameter = (input + 1) × output (6)

6

Figure 6: First model architecture and summary

From the confusion matrix presented in table 3, we deduce that among a set of images

consisting of 3589 a number of 1997 images have been well classified. The model

made a good learning of the classes of emotions expressing: happiness, surprise and

neutral.

Anger Disgust Fear Happiness Sad Surprise Neutral

Anger 161 4 75 65 76 10 76

Disgust 9 19 6 5 10 0 7

Fear 41 1 187 49 91 47 80

Happiness 26 0 33 709 36 16 75

Sad 46 2 81 79 274 21 150

Surprise 16 1 46 37 7 292 16

Neutral 37 0 43 83 71 18 355

Table 3: First model confusion’s matrix

2. Architecture 02: Consists of 2 convolution layers, 2 pooling layers and 2 fully con-

nected layers. The input image has a size of (48 * 48), the two convolution layers

have a set of 32 (3*3) size filters and a Relu activation function. Each layer is fol-

lowed by a MaxPooling layer with (2*2) size windows. At the end of the MaxPooling

layer, we will have 32 feature maps of size (10*10). The feature vector resulting from

the convolutions therefore has a dimension of 3200. The fully connected first layer

calculates a vector of size 128 and followed by a layer Relu and a dropout which is

equal to 0.2. The fully connected second layer returns a vector of probabilities of

size 7 (the number of classes) and the softmax function is applied to it.

7

Figure 7: Second model architecture and summary

For this model two representations were used. In the first respresentation a batch size

of size 16 was used while a batch size of 96 was used in the second one.

Architecture 02 batch size Accuracy Loss

First representation 16 0.718 0.757

second representation 96 0.76 0.63

Table 4: A comparaison between the two representations of the second architecture

3. Architecture 03: Consists of 6 convolution layers, 3 pooling layers and one fully

connected layer. Same as the two previous architectures, the input image is 48 * 48.

The first two convolution layers have a set of 32 (3*3) size filters and a Relu activation

function, followed by a MaxPooling with 2 * 2 size windows. The convolution layers

3 and 4 have a set of 64 (3*3) size filters and the Relu activation function is used in

both layers. They are also followed by a MaxPooling layer with (2*2) size windows.

The only difference with layers 5 and 6 is that the applied filter set is 128. 128

feature maps of size (2*2) are output from the third layer of MaxPooling. A vector

of dimension 512 is obtained. For the last layer, the fully connected layer, the softmax

function is applied, and the returned probability vector is of size 7 (the number of

classes).

Figure 8: Third model architecture and summary

8

The table 5 summarizes the results obtained for the three proposed models. From these

results, we notice that the performance of a model depend on the number of convolution

layers used (the higher the number, the more accurate the model is) and the size of the

batch size used (the larger it is, the better we have a good training model). The execution

time depends on the complexity of the model (the more fully connected layers we have,

the more time it takes to execute).

Despite the complexity of the first model, its results are not satisfactory because we have

a considerable loss of information (0,70). So, we can see that the third model is the most

reliable of the three models proposed since we have an accuracy of 0,889 and a loss of 0,29.

Architecture Accuracy Loss Execution time

Architecture 01 0.77 0.70 120 h

Architecture 02 R1 0.718 0.757 72 h

Architecture 02 R2 0.76 0.63 48 h

Architecture 03 0.889 0.29 72 h

Table 5: Table comparision between the proposed models

The following figures represent predictions made on a few images from FER2013’s

private test devoted to the evaluation. The first model predictions were not that succeful

as the predictions made by the two other models.

Figure 9: Predictions of the first model

Figure 10: Predictions of the second model

9

Figure 11: Predictions of the third model

From the comparison of our results Table 5 with the related works using the same

dataset ”fer2013” Table 2, we can conclude that the obtained results are acceptable since

the accuracy values are included between the two architectures proposed in (Nishchal

et al., 2018), the Xception architecture that gave very good results (accuracy= 0,95) and

the sequential architecture fully-CNN which has an accuracy value of 0,66.

4 Conclusion

A picture is worth a thousand words. This work have been interested in analyzing non

verbal sentiment, especially emotional facial expressions. To solve our problem, we used

Convolutional Neural Networks (CNNs) as a deep learning architecture. We presented

three different models in order to determine the factors that give good results in terms

of accuracy and loss. Thus, we found that the more we have: convolution layers, a good

dropout, a large size of batch size, and a large number of epochs, the more the results

are satisfactory. The third model showed the best results. The number of convolution

layers and the size of the batch size reflect these good results, but the execution time

was expensive. The use of a CPU during the training phase has cost precious time. This

amounts to the large size of the dataset, which requires the use of a GPU instead.

For our next work, we plan to do a general non verbal sentiment analysis (images and

videos), in order to be able to determine the feeling released in. We also plan to do a hybrid

sentiment analysis between the verbal, non verbal and introduce the vocal eventually.

References

Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In

Proceedings of the IEEE conference on computer vision and pattern recognition, pages

1251–1258.

Fathallah, A., Abdi, L., and Douik, A. (2017). Facial expression recognition via deep

learning. In 2017 IEEE/ACS 14th International Conference on Computer Systems

and Applications (AICCSA), pages 745–750. IEEE.

Gajarla, V. and Gupta, A. (2015). Emotion detection and sentiment analysis of images.

Georgia Institute of Technology.

Giannopoulos, P., Perikos, I., and Hatzilygeroudis, I. (2018). Deep learning approaches for

10

facial emotion recognition: A case study on fer-2013. In Advances in Hybridization

of Intelligent Methods, pages 1–16. Springer.

Grangier, D., Bottou, L., and Collobert, R. (2009). Deep convolutional networks for scene

parsing. In ICML 2009 Deep Learning Workshop, volume 3, page 109. Citeseer.

Gudi, A., Tasli, H. E., den Uyl, T. M., and Maroulis, A. (2015). Deep learning based

facs action unit occurrence and intensity estimation. 2015 11th IEEE International

Conference and Workshops on Automatic Face and Gesture Recognition (FG).

Jindal, S. and Singh, S. (2015). Image sentiment analysis using deep convolutional neu-

ral networks with domain specific fine tuning. In 2015 International Conference on

Information Processing (ICIP), pages 447–451. IEEE.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep

convolutional neural networks. In Advances in neural information processing systems,

pages 1097–1105.

Li, G., Hoi, S. C., Chang, K., and Jain, R. (2010). Micro-blogging sentiment detection

by collaborative online learning. In 2010 IEEE International Conference on Data

Mining, pages 893–898. IEEE.

Liu, B., Dai, Y., Li, X., Lee, W. S., and Philip, S. Y. (2003). Building text classifiers using

positive and unlabeled examples. In ICDM, volume 3, pages 197–188. Citeseer.

Mehrabian, A. and Wiener, M. (1967). Decoding of inconsistent communications. Journal

of personality and social psychology, 6(1):109.

Moran, J. L. (2019). Classifying human emotion using convolutional neural networks.

UC Merced Undergraduate Research Journal. https://escholarship.org/uc/item/

1t89f7xk.

Nishchal, P. C., Chengappa, P., Raman, T., Pandey, S., and Shyam, G. K. (2018). Emotion

identification and classification using convolutional neural networks. International

Journal of Advanced Research in Computer Science, 9(Special Issue 3):357.

Serengil, S. I. (2018). Facial expression recognition with keras. https:

//sefiks.com/2018/01/01/facial-expression-recognition-with-keras/

?fbclid=IwAR2sJwE1ydZmpErqyBEzFaZNY9crXz-g-oDuYZJ_YmRf7RoSUjxLzmbbhCo

Consulte le 20/04/2019.

You, Q., Luo, J., Jin, H., and Yang, J. (2015). Robust image sentiment analysis using

progressively trained and domain transferred deep networks. In Twenty-Ninth AAAI

Conference on Artificial Intelligence.

11

https://escholarship.org/uc/item/1t89f7xk

https://escholarship.org/uc/item/1t89f7xk

https://sefiks.com/2018/01/01/facial-expression-recognition-with-keras/?fbclid=IwAR2sJwE1ydZmpErqyBEzFaZNY9crXz-g-oDuYZJ_YmRf7RoSUjxLzmbbhCo



Deep Learning for Non Verbal Sentiment Analysis: Facial ...

Documents

Transcript of Deep Learning for Non Verbal Sentiment Analysis: Facial ...