Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes...
Transcript of Audio and Text based Multimodal Sentiment …...For song classification, SVM and Naive-Bayes...
Audio and Text based Multimodal Sentiment Analysis using Features
Extracted from Selective Regions and Deep Neural Networks
Thesis submitted in partial fulfillment
of the requirements for the degree of
Master of Science by Research
in
Computer Science and Engineering
by
Harika Abburi
201450880
International Institute of Information Technology
Hyderabad - 500 032, INDIA
June 2017
Copyright c© Harika Abburi, 2017
All Rights Reserved
To My Family Members
Acknowledgments
I owe deep gratitude to everyone who have contributed greatly in completion of this thesis.
Foremost, I would like to express my deep gratitude to my advisors Dr. Suryakanth V Gangashetty
and Dr. Manish Shrivastava for their guidance and encouragement throughout my research work. I must
be very fortunate to have advisors who gave me the freedom to explore on my own, and at the same time
their guidance helped me to recover whenever my steps faltered.
I owe special thanks to Prof. B. Yegnanarayana for sharing his immense knowledge on speech. I am
grateful to Dr. Radhika Mamidi for her constant support and encouragement. I thank Dr. Kishore S
Prahallad and Dr. Vinay Kumar Mital for guiding me in the beginning of my research work.
I am grateful to KNRK Raju and Akhil Akkireddy, for their valuable feedback and comments. I
thank VVV Raju, Ramakrishna sir, Harikrishna, Ravi Kumar sir, Mounika for their valuable feedback,
discussions and always helping me in all the situations. I thank all my lab members Gangamohan,
Sudarshan, Sivanand, Nivedita, Vishala, Ravi, Aneeja, Bhanu Teja, Harsha, Anandswaroop, Sirisha,
Bhavya, Ayushi for helping me to complete my thesis work.
Most importantly, none of this would have been possible without the love and patience of my family
members. I formally thank them for their support and strength through. I would like to personally thank
my husband who guided me through the difficult situations in my life and never let me fail. I thank him
immensely for that.
- Harika Abburi
v
Abstract
Sentiment analysis has emerged as a field, that has attracted a significant amount of attention over
the last decade. This is the area of study to analyze people reviews, songs and attitudes from different
types of data and classify whether it is a positive, negative or neutral. Recent advancement of social
media which is an enormous ever-growing source has led people to share their views through various
modalities such as audio, text and video. This source of information is important to automatically make
out the sentiment embedded in the different types of data such as reviews and songs.
In this thesis, an improved multimodal approach to detect the sentiment of product reviews and
songs based on their multi-modality natures (audio and text) is proposed. The basic goal is to classify
the input data as either positive or negative sentiment. Database used in this study are Spanish product
reviews, Hindi product reviews and Telugu songs. Most of the existing systems for audio or speech
based sentiment analysis use the conventional audio features which are extracted from entire signal, but
they are not domain specific features to extract the sentiment. In this work, instead of extracting the
features from entire signal, a specific regions of an audio signal have been identified and experiments
are performed on these regions by extracting the relevant features.
For songs data, experiments are performed over each song, and its beginning and ending regions. For
all these cases, Gaussian Mixture Models (GMM) and Support Vector Machine (SVM) classifiers are
built using the prosody, temporal, spectral, tempo and chroma features. Experimental results shown that
the ate of detecting the sentiment of a song is high at beginning of a song compared to its ending region
and over the entire duration of the song. This is because, the instruments and vocals which convey the
sentiment for beginning part of the song may or may not sustain throughout the song.
For the reviews data, we could not perform these experiments because in such cases the sentiment
may not be present at beginning or ending regions of an utterance. So for the reviews data, the stressed
and normal regions are identified using the strength of excitation. From the stressed regions, Mel Fre-
quency Cepstral Coefficients (MFCC) features are extracted and GMM classifier is built. Further, ex-
periments are performed by extracting the prosody (energy, pitch and duration) and relative prosody
features from both the regions and from the entire audio signal and a GMM classifier is built. From the
results, it is observed that, the performance at specific regions is better as compared to the entire sig-
nal. It is also observed that, relative prosody features extracted from both the regions has high accuracy
of detecting the sentiment compared to the prosody and MFCC features. This is because, the natural
variations present in the prosody features are reduced using the relative prosody features.
vi
vii
Recently, neural networks have achieved good success on sentiment classification. In this work
also different deep learning architectures like Deep Neural Network (DNN) and Deep Neural Network
Attention Mechanism (DNNAM) are explored. Here stressed regions concept fail because of less train-
ing data. DNN performance depends on the amount of training data. The more the training data, the
more accurate it is. So here the experiments are performed by combination of frames which result in
better performance because each frame will not carry the sentiment. MFCC features considered are
13-dimensional, 65-dimensional and 130-dimensional feature vectors. From the studies, it is observed
that DNNAM classifier gives better results compared to DNN, because the DNN approach is a frame
based one where as the DNNAM approach is a utterance level classification there by efficiently making
use of the context.
For text based sentiment analysis, transcriptions are carried out manually from the audio signal.
For song classification, SVM and Naive-Bayes classifiers are built using the textual features which are
computed by Doc2Vec vectors. As in the audio, here also experiments are performed at the beginning,
the ending and over the entire song. The studies shown that beginning of a song has high accuracy in
detecting the sentiment compared to the ending region and over the entire song. As similar experiments
could not be carried out with the reviews data, entire document is taken as an input to extract the
sentiment. Support Vector Machine (SVM) and Long Short-Term Memory Recurrent Neural Network
(LSTM-RNN) classifiers are used to develop a sentiment model with the textual features computed by
Doc2Vec and Word2Vec. From the experimental studies, it is observed that LSTM-RNN outperforms
the SVM because LSTM-RNN is able to memorize long temporal context.
Finally, we combine both the modalities such as audio and text to extract the sentiment. Both the
modalities are hypothesized based on the highest average probability of the classifiers. It is observed
from the studies that the simultaneous use of these two modalities help to create a better sentiment
analysis model to detect whether the given input is positive or negative sentiment.
Keywords: Sentiment Analysis, Multimodal Classification, Text features, Audio features, Lyric fea-
tures, Stressed regions, Normal regions, Relative prosody features, Mel frequency cepstral coefficients,
Doc2Vec, Word2vec, Gaussian Mixture Models (GMM), Support Vector Machine (SVM), Naive-Bayes
(NB), Deep Neural Networks (DNN), Deep Neural Network Attention Mechanism (DNNAM), Long Short
Term Memory-Recurrent Neural Network(LSTM-RNN).
Contents
Chapter Page
1 INTRODUCTION TO SENTIMENT ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Issues Addressed in this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Applications of Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 REVIEW OF APPROACHES TO SENTIMENT ANALYSIS . . . . . . . . . . . . . . . . . 6
2.1 Audio-based Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Text-based Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Multimodal Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 MULTIMODAL SENTIMENT ANALYSIS USING SELECTIVE REGIONS . . . . . . . . . 12
3.1 Databases used in this Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Spanish database of opinion utterances . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Hindi database of reviews utterances . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.3 Telugu songs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Detection of Sentiment in Selective Regions of Songs . . . . . . . . . . . . . . . . . . 13
3.2.1 Sentiment analysis using audio features . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 Sentiment analysis using lyric features . . . . . . . . . . . . . . . . . . . . . . 15
3.2.3 Multimodal sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Detection of Sentiment from Selective Regions of Reviews . . . . . . . . . . . . . . . 18
3.3.1 Sentiment analysis using speech features . . . . . . . . . . . . . . . . . . . . 18
3.3.1.1 Detecting stressed and normal regions . . . . . . . . . . . . . . . . 18
3.3.1.2 Computation of strength of excitation using ZFF method . . . . . . 18
3.3.1.3 Detecting stressed and normal regions of an audio signal . . . . . . . 19
3.3.1.4 Features extracted at stressed and normal regions . . . . . . . . . . . 20
3.3.1.5 Sentiment analysis experimental setup and results . . . . . . . . . . 23
3.3.2 Sentiment analysis using text features . . . . . . . . . . . . . . . . . . . . . . 24
3.3.3 Multimodal sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 MULTIMODAL SENTIMENT ANALYSIS USING DEEP NEURAL NETWORKS . . . . . 27
4.1 Sentiment Analysis using Audio Features . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1 Deep neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.2 Deep neural network attention mechanism . . . . . . . . . . . . . . . . . . . . 29
viii
CONTENTS ix
4.1.3 Gaussian mixture models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Sentiment Analysis using Text Features . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.1 Support vector machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.2 LSTM-RNN model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Multimodal Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5 SUMMARY AND CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
List of Tables
Table Page
3.1 Sentiment classification performance (in %) of songs using 37 dimension feature vector. 15
3.2 Sentiment classification performance (in %) of songs using lyric features. . . . . . . . 16
3.3 Multimodal sentiment classification performance (in %) with lyric and audio features. . 17
3.4 Sentiment classification performance (in %) with different features. . . . . . . . . . . 24
3.5 Sentiment classification performance (in %) using text features. . . . . . . . . . . . . . 25
3.6 Sentiment classification performance (in %) for different models on Spanish and Hindi
datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 Performance (in %) of sentiment analysis using deep neural network. . . . . . . . . . . 29
4.2 Performance (in %) of sentiment analysis using deep neural network attention mecha-
nism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Performance (in %) of sentiment analysis using different classifiers. . . . . . . . . . . 32
4.4 Performance (in %) of sentiment analysis using text features. . . . . . . . . . . . . . . 35
4.5 Performance (in %) of multimodal sentiment analysis. . . . . . . . . . . . . . . . . . . 36
x
List of Figures
Figure Page
3.1 Block diagram of multimodal sentiment analysis of songs. . . . . . . . . . . . . . . . 17
3.2 An approach to detect stressed significant regions of Spanish speech utterance “pero
igual con las lavadas”. (a) Input speech signal, (b) ZFF signal (c) Strength of the excita-
tion at each epoch, (d) Strength of the excitation which is mean smoothed using a frame
size of 20 ms and (e) Stressed significant regions of utterance are detected. . . . . . . . 20
3.3 Block diagram to extract relative prosody features. . . . . . . . . . . . . . . . . . . . 21
3.4 Histogram of (a) maximum pitch, (b) relative maximum pitch for positive and negative
sentiment audio files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Scatter plot of maximum pitch vs maximum energy for positive and negative sentiment
audio files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 Block diagram of sentiment analysis using deep neural network. . . . . . . . . . . . . 28
4.2 Deep neural network attention model [48]. . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 LSTM Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Block diagram of multimodal sentiment analysis. . . . . . . . . . . . . . . . . . . . . 36
xi
List of Abbreviations
ASR Automatic Speech Recognition
DNN Deep Neural Network
DNNAM Deep Neural Network Attention Mechanism
GMM Gaussian Mixture Models
HMM Hidden Markov Model
LPCC Linear Predictive Cepstral Coefficients
LSTM-RNN Long Short-Term Memory Recurrent Neural Network
MFCC Mel Frequency Cepstral Coefficients
NB Naive-Bayes
RNN Recurrent Neural Network
SoE Strength of Excitation
SVM Support Vector Machine
ZFF Zero Frequency Filter
HMM Hidden Markov Model
VSM Vector Space Model
xii
Chapter 1
INTRODUCTION TO SENTIMENT ANALYSIS
Sentiment analysis is a systematic study to identify and extract information present in the source
materials using natural language processing, computational linguistics, and text analytics. It focuses on
the automatic identification of opinions, evaluations, judgments, capturing the users behavior, thoughts,
views, mood, attitude, likes and dislikes of an individual from the generated web content. Sentiments are
considered as the manifestation of human feelings and emotions. The field of computer science helps in
analyzing and predicting the hidden information stored in the different modalities like audio, video and
text. This hidden information provide valuable insights about users intentions, taste and likeliness.
User-generated content is an important source of information to mine the sentiment/opinion of people
on different products and services. To obtain the factual and subjective information on companies
and products, analysts are turning towards web to gather information. The world wide web plays an
important role in gathering public opinion which will be useful in making business related decisions.
Extracting the sentiment from these public opinion is a major task. Industrialists spend a large chunk
of their revenue on business intelligence to read minds of general public and interpret what they think
about their products. Sentiment analysis tries to mine information from various kinds of data such as
reviews, songs, news, blogs and classify them as either positive, negative or neutral.
Much of the work to date on sentiment analysis has focused on textual data. Available dataset
and resources are restricted to text-based sentiment analysis only. With the advent of social media,
people are now extensively using the videos, images and audios in youtube to express their opinions on
social media platforms. Audio reviews represent a growing source of consumer information that gained
increasing interest from companies, researchers, and consumers. Compared to traditional text reviews,
audio reviews provide a more natural experience as they allow the viewer to better sense the reviewers
emotions, beliefs, and intentions through richer channels such as intonations. Thus, it is highly crucial
to mine opinions and identify sentiments from the diverse modalities.
The modalities other than text can often be used as clues for the expression of sentiment. The joint
use of modalities brings significant advantages over text, including linguistic disambiguation (audio
features can help in disambiguate linguistic meaning), linguistic sparsity problem (means audio features
bring additional sentiment information) and grounding the audio modalities enhance the connection to
1
real-world environments. Here, we address the task of multimodal sentiment analysis. We experiment
with several linguistic, audio features which are extracted from specific regions and show that the joint
use of these two modalities significantly improves the classification accuracy compared to using one
modality at a time.
1.1 Motivation
Research in sentiment analysis is rapidly growing and attracting the attention of both academia and
industry alike. The natural ability of a human being is understanding emotions, analyzing situations and
the sentiments associated with the context. But how efficiently can we train a machine to exhibit same
phenomenon becomes an important and vital question to be explored and answered. Sentiment analy-
sis provides an effective mechanism of understanding individuals attitude, behavior, likes and dislikes
of a user. Both signal processing and AI has led to the development of advanced intelligent systems
that intend to detect and process affective information contained in multimodal sources. The majority of
such state-of-the-art frameworks however, rely on processing a single modality, i.e.,text, audio, or video.
Further, all of these systems are known to exhibit limitations in terms of meeting robustness, accuracy,
and overall performance requirements, which in turn greatly restrict the usefulness of such systems in
real-world applications. The aim of multimodal data fusion is to increase the accuracy and reliability of
estimates. The textual, audio, and visual modalities are exploited to enable effective extraction of the
semantic and affective information conveyed during communication.
With significant increase in the popularity of social media like Facebook and YouTube, many users
tend to upload their opinions on products in video format. So, the task of mining opinions on various
products becomes an useful resource to guide and help people in making choices and decisions. Mining
sentiments and subjective information helps to provide products and services in a personalized fashion
and as per individuals taste and likings. The study of sentiment analysis also provide enough informa-
tion about how human beings perceive and express information in the form of text, audio to express
their feelings and emotions.
Hence, when a person expresses his opinions with more vocal modulation, the audio data may contain
most of the clues for opinion mining rather than only text. So, a generic model needs to be developed
which can adapt itself for any user and can give a consistent result. This wide multi-dimension aspects
motivated to take this multimodal sentiment analysis problem as the current research work.
1.2 Issues Addressed in this Thesis
In this research, a multimodal sentiment analysis is performed at selective regions of an input and
explored deep neural networks concept on sentiment analysis. Previously, research on multimodal sen-
2
timent analysis is done by extracting the conventional features from th entire signal, but they are not
problem specific features to extract the sentiment. In this work, instead of extracting the features from
entire signal, a specific regions of an audio signal have been identified and experiments are performed
on these regions using relevant features. To extract the sentiment from the songs data, a specific regions
like at the beginning and ending of a song are identified. For the reviews, stressed significant and nor-
mal regions of an audio input are identified using Zero Frequency Filtered (ZFF) Signal and perform
sentiment analysis on these specific regions.
A sentiment analysis is performed using new features namely a relative prosody features which are
extracted from the stressed and normal regions of a audio signal. This representation outperforms the
prosody and Mel Frequency Cepstral Coefficients features (MFCC) extracted at both the regions.
A song sentiment classification model is developed for different regions of a song in Telugu language.
A deep learning architectures (like deep neural network and deep neural network with attention
mechanism) are explored using different dimensional Mel Frequency Cepstral Coefficients feature vec-
tors (like 13-Dimensional, 65-Dimensional (5 frames are combined) and 130-Dimensional (10 frames
are combined)) to extract the sentiment present in the audio input.
Hindi product reviews database and Telugu songs database are collected for this work. A total of
100 files are collected out of which 50 are positive and 50 are negative sentiments. The audio files are
downloaded from Youtube and transcriptions are done manually for reviews.
1.3 Applications of Sentiment Analysis
Sentiment Analysis has been widely used for understanding the nature of a context. Few areas where
Sentiment Analysis can be applied are:
(a) Businesses and organizations: Much of the business strategies are been guided with respect to
the response from the customers. Companies aim to satisfy the needs and demands of the users, thus
strategic moves of companies are driven through public opinions and views. With the world connected
through technology events have a global impact, the issue/failure on one part of the world has an impact
on the other part of the globe. So it becomes quite important to drive products/services according to the
public viewpoint. Business men are investing a huge amount of money to find the sentiment.
(b) Individual products analysis and decision making: With the help of sentiment analysis it has
become easier to analyze different products and make the choices accordingly. This kind of analysis
also helps to select a product based on its feature specifications. The comparison between two products
3
has also been made quite easier. Decision making is an integral part of our life. It ranges from which
products to buy, which restaurant to go, to which bank insurance policies to go for and which invest-
ments to make. Sentiment analysis can be used to decide and select from the available options based on
the general opinions expressed by other users.
(c) Ads placements: Placing ads in the user-generated content. Place an ad when one praises a product
and place an ad from a competitor if one criticizes a product. When ever one express opinion about a
product by analyzing that review we can say weather it is positive or negative.
(d) Recommendation systems: Most of the websites we visit have a recommendation system in-built
to assist us, ranging from sites related to books, online-media, entertainment, music, film industry to
other forms of art. These systems uses our personal information, previous history, likes and dislikes and
our friends information to make suggestions.
(e) Opinion retrieval/search: Providing general search based on opinion given by the user.
(f) Designing and building innovative products: With exposed to tough competition and open to
critics through public reviews and opinions, sentiment analysis leads to better analysis of the products
in terms of the usability and human-friendly nature. It creates an environment for better and more inno-
vative products.
(g) Computing customer satisfaction metrics: You can get an idea of how much happy customers
are with your products from the ratio of positive to negative tweets about them.
(h) Identifying detractors and promoters: It can be used for customer service, by spotting dissat-
isfaction or problems with the products. It can also be used to find people who are happy with your
products or services and their experiences can be used to promote your products.
There are more applications like question and answering, text summarization and mining opinion
from product reviews.
1.4 Organization of the Thesis
Organization of the thesis is as follows:
• In chapter 2, we review various approaches for multimodal sentiment analysis. This chapter
discuss the related work on sentimental analysis on different modalities such as audio and text.
4
• Chapter 3 discusses about the importance of specific regions in audio input. Further we also
present the different features which are used to detect the sentiment at specific regions.
• Chapter 4 describes different architectures of deep neural network is developed to built a senti-
ment model.
• Finally we summarize the contributions of this research work and discuss some directions for
further work in chapter 5.
5
Chapter 2
REVIEW OF APPROACHES TO SENTIMENT ANALYSIS
The sentiment analysis plays a key role in classification of many applications such as in online songs
[1] [43], online reviews [12] [54] and Twitter data [2] [26]. In this chapter we provide a brief overview
of related work on multimodal sentiment analysis. We first focus on audio-based sentiment analysis,
and then text-based sentiment analysis. Each modality have some special information which will help
to detect the sentiment. Hence, all kinds of data are required for better sentiment classification. For
any kind of approach like audio and text, sentiment can be extracted using sentiment classification
techniques using machine learning approach [32].
2.1 Audio-based Sentiment Analysis
The study of the relationship between emotional content and audio signals is a very mature field. Re-
searchers have expanded the success found in using the Mel Frequency Cepstral Coefficients (MFCC)
features for speech recognition community to explore their uses in music modeling [29]. MFCCs are
currently a staple in audio processing and are commonly used in MIR applications such as genre clas-
sification [55], since they are a quantifiable method for comparing the timbral texture of songs. Timbre
has been used with some success to classify the emotional content of songs [28] however, class inconsis-
tencies have proven to be a difficult challenge, causing substantial misclassification between edge cases.
Timbre and chroma has also been used to generate songs that evoke particular emotions [6]. These
vectors have been commonly classified using support vector machines (SVM) and naive Bayes classi-
fiers. Instead of using MFCC and chroma features separately research has been done by combining both
features which gives better performance. The chroma features are less informative for classes such as
artist, but contain information which is independent of the spectral features [10]. Due to this reason in
our work, we combine both features along with some other features for song database.
To detect the sentiment from natural audio streams, a sentiment detection system is developed based
on Maximum Entropy modeling and Part of Speech tagging. Transcripts from audio streams are ob-
tained using Automatic Speech Recognition (ASR) [24]. This approach shows that it is possible to
automatically detect sentiment in natural spontaneous audio with good accuracy. Another method is
6
presented for audio sentiment detection based on Keyword Spotting rather than using ASR [23]. Exper-
iments show that the presented method outperform the traditional ASR approach by 12 percent increase
in classification accuracy. Audio features like pitch, intensity and loudness are extracted using Open-
EAR software and Support Vector Machine (SVM) classifier is built to detect the sentiment [45]. The
audio features are automatically extracted from the audio track of each video clip using OpenEAR
software and Hidden Markov Models (HMM) classifier is built to detect the sentiment [37]. Instead of
extracting all the features from the entire input using tools like OpenEAR/OpenSMILE only specific rel-
evant features like MFCC, prosody and relative prosody are extracted from stressed and normal regions
of an input are used in our study.
As prosody have been used before for the task of emotion recognition in speech, it has also been
experimented for the task of sentiment analysis [31]. Speech data is generally extracted from the char-
acteristics of the vocal tract, excitation and prosody. Prosody parameters extracted from segmental,
sub-segmental and supra-segmental are used for emotion recognition in [49]. To develop an emotion
recognition system, relation between various prosody parameters had been explored[25]. In the liter-
ature MFCC, Linear Predictive Cepstral Coefficients (LPCC) are the major spectral features used for
emotion recognition [56]. Various acoustic cues such as Energy of the excitation (EoE), loudness,
Strength of Excitation (SoE), instantaneous Fo, and their combinations are explored to study the emo-
tion discriminating capabilities of the excitation signal [16]. These acoustic cues such as Fo and SoE are
also used for detecting the paralinguistic sounds such as cry, shout and laughter [34] [35][36]. Strength
of excitation of audio input is found using Zero-Frequency Filtering (ZFF) method. Regions whose the
strength of excitation fluctuates above and below 30% from the mean strength of excitation referred
as emotionally significant regions [57]. From the emotionally significant regions MFCC features are
extracted and tested on Guassian Mixture Model (GMM) classifier. A significant improvement in the
performance of a system is observed.
To detect the sentiment from an input data, instead of using SVM, HMM and GMM classifiers
the current research is going on the deep architectures which has ability to discover multiple levels
of features from data. Deep Neural Networks (DNN) are applied on many applications like speech
recognition, language recognition, sentiment analysis etc. The significant improvement in the perfor-
mance is observed by using deep neural networks in acoustic modeling for speech recognition instead
of traditional GMM [17]. In [30] [51], a single DNN acoustic model is used to train both the language
recognition and speaker recognition tasks. From results, a significant improvement in the performance
is observed. The drawback of the DNN system is that, the decision is taken at every frame and the
context used is fixed which is assigned to the entire utterance and it cannot memorize long temporal
context. To overcome this problem, a feed-forward deep neural network with attention mechanism [48]
is proposed to solve long range dependency memory problems. This attention mechanism is used for
language recognition task in [38], which results in better performance compared to the DNN. In this
work, we used this architecture for sentiment analysis and referred as Deep Neural Network Attention
Mechanism (DNNAM). This attention mechanism is parallelized because it is a feedforward neural net-
7
work with no recurrent networks and it is able to memorize the long temporal context. This architecture
is used for classifying whole utterance where as in DNN a frame-level decision is taken and then all the
decisions are combined to extract the sentiment.
2.2 Text-based Sentiment Analysis
A sentiment-vector space model is used for song sentiment classification [60]. Experiments are done
on two approaches: knowledge-based and machine learning. In knowledge-based, HowNet [8] is used to
detect the sentiment words and to locate the sentiment units with in the song lyric. In machine learning,
the SVM algorithm is implemented based on Vector Space Model (VSM) and sentiment-Vector Space
Model (s-VSM), respectively. Experiments show that s-VSM gives better results compared to VSM and
knowledge-based. A previous work includes sentiment analysis for mining the topics from songs based
on their moods [52]. The input lyrics files are measured based on the word net graph representation and
the sentiments of each song are mined using Hidden Markov Model (HMM). Based on single adjective
words available from the audio dataset USPOP (hit pop songs), a new dataset is derived from the last.fm
tags [19]. Using this dataset, K-means clustering method is applied to create a meaningful cluster-based
set of high-level mood categories for music mood classification. This set was not adopted by others
because mood categories developed by them were seen as a domain oversimplification. The authors
in [21] presented the usefulness of text features in music mood classification on 18 mood categories
derived from user tags and they show that these text features outperform audio features in categories
where samples are more sparse. An unsupervised method to classify music by mood is proposed in
[42]. Fuzzy c-means classifier is used to do the automatic mood classification.
Movie review mining using machine learning and semantic orientation is implemented in [5]. Based
on the document features the semantic orientation approach classifies the input review as a negative or
positive class. Machine learning techniques are used to investigate the effectiveness of classification
of documents by overall sentiment. Another work on movie reviews is done using SVM, Naive Bayes
and Maximum Entropy classifiers [41]. Eight different types of features are extracted such as unigrams,
bigrams, combination of both and so on. These features are tested on the three classifiers, among them
SVM with binary-unigram features gives a high rate of detecting the sentiment. By using unigrams
and bigrams the text is somewhat misleading because subjective information is missing. In [40] each
sentence of a review is labeled as either objective or subjective and neglected the objective sentences
by finding minimum cuts in graphs. This prevents the classifier from considering misleading text. On
this data, SVM and Naive Bayes classifiers are implemented to develop a model. From the result, it
is observed that a small improvement is obtained using this method. Sentiment analyzer is developed
to find out all the references on the subject and the sentiment polarity of each reference from online
data documents [63]. A sentiment lexicon and the sentiment pattern database is used for extraction
and association purposes. They classify expressions about specific items and use manually developed
patterns to classify polarity.
8
Sentiment is extracted using the opinion words like a combination of the adjectives along with the
verbs and adverbs in the tweets [26]. Adjectives and negative words are taken into account to calculate
the polarity of the whole phrase. The corpus-based method was used to find the semantic orientation
of adjectives and the dictionary-based method to find the semantic orientation of verbs and adverbs.
Two different Naive Bayes classifiers which make use of polarity lexicon to classify as positive and
negative are used to detect the polarity of English tweets [15]. These classifiers are treated as the
baseline. Features like lemmas, multiword, polarity lexicon and valence shifters are used. The training
data set of tweets is obtained from SemEval 2014 and additional annotated tweets from external sources.
Experiments show that performance is best when binary strategy is used with multiword and valence
shifters. Approach to analyze the sentiment of short Chinese texts is presented in [61]. By using
word2vec tool, sentiment dictionaries from NTU and HowNet are extended. Then the feature weight
of the words are enhanced including the words that appear in the sentiment dictionary and the words
next to the sentiment words. The model is implemented using SVM classifier. A work on sentiment
analysis of online news articles is presented in [13]. By using Machine Learning for Language Toolkit
(MALLET), six text-classification algorithms are compared such as the Naive Bayes, the Maximum
Entropy, a decision tree rule base, a decision tree with the C4.5 algorithm, the Winnow algorithm and the
Balanced Winnow algorithm. Experimental results have shown that the Naive Bayes classifier performs
the best.
The most commonly used classification techniques like SVM, Maximum Entropy and Naive Bayes
are based on bag of words model in which the sequence of words is ignored. This may results in
inefficient in extracting the sentiment from the input because the sequence of words will affect the
sentiment present in it. By overcoming this problem many researches reported by employing deep
learning in sentiment analysis. A deep neural network architecture that jointly uses word level, character
level and sentence level representations to perform sentiment analysis is proposed [9]. By observing
the experiment results performance is improved. A DNN is applied for language modeling [3] which
outperforms the n-gram model. A DNN can have any number of hidden layers and any number of
nodes per each layer in which the weights are connected in between. A DNN can learn more complex
model if the layers of architecture are more. But, a simple feed forward neural network cannot be
more accurate just by only adding layers because the training process is ineffective if there are more
layers [7] and it cannot capture the temporal context accurately. To better capture the temporal context
Recurrent Neural Network (RNN) has been proposed [33]. They applied RNN on speech recognition
for language modeling. They show that RNN outperforms n-gram technique. The advantage of RNN
is it will use previous state information to compute its current state, which is similar to the context in
most of natural languages. However, simple RNN has a problem in passing the information in a long
sequence. A solution to this issue is Long Short Term Memory (LSTM), a RNN with additional long
term memory, that was proposed [18]. In this paper a LOng Short Term Memory -Recurrent Neural
Network (LSTM-RNN) classifier is proposed to extract the sentiment from input data.
9
2.3 Multimodal Sentiment Analysis
Instead of using only lyrics or only audio, research is also done on combinations of both the domains.
A survey on multimodal sentimental analysis and methods are discussed in [14][53].
In [20] work is done on the mood classification in music digital libraries by combining lyrics and
audio features. It is concluded that complementing audio with lyrics could reduce the number of training
samples required to achieve the same or better performance than single source-based systems. Music
sentiment classification using both lyrics and audio is presented [64]. For lyric sentiment classification
task, CHI approach (feature selection method in text categorization) and an improved difference-based
CHI approach were developed to extract discriminative affective words from lyrics text. Difference-
based CHI approach gives good results compare to CHI approach. For audio sentiment classification
task, features like chroma, spectral etc. are used to build SVM classifier. Experiments show that the
fusion approach using data sources help to improve music sentiment classification. In [22], [58] music
is retrieved based on both lyrics and melody information. For lyrics, keyword spotting is used and for
melody MFCC and Pitch features are extracted. Experiments show that by combining both modalities
the performance is increased.
The joint use of multiple modalities such as video, audio and text features is explored for the purpose
of classifying the polarity of opinions in online videos [37] [47]. Both feature level and decision level
fusion methods are used to merge affective information extracted from multiple modalities. They have
reported an improvement in classification by grouping different modalities rather than single modality.
Multimodal sentiment analysis approach is an intelligent opinion mining system for identifying and
understanding sentiment present in the reviews. In order to extract the sentiment they used audio and
video signals and hence overcome the drawbacks of traditional sentiment analysis system [62]. In [59],
the authors introduce the Institute for Creative Technologies Multimodal Movie Opinion (ICT-MMMO)
database of personal movie reviews collected from YouTube and ExpoTV. It consists of English clips
with sentiment annotation of one to two coders. The feature basis is formed by using audio, video
and textual features. Based on the textual movie review corpus, different levels of domain-dependence
are considered: in-domain analysis and cross-domain analysis. This shows that cross-corpus training
works sufficiently well. Authors of [44] introduce MOUD database consists of Spanish videos. They
have explored the effect of using different combinations of text, speech and video features on classifi-
cation. They have also carried out the correlation between visual and acoustic features. This is further
confirmed on another set of English videos. From the results it is observed that the joint use of three
modalities bring significant improvement. To determine the sentiment polarity present in the input [46]
features are extracted from three modalities. The convolution neural network is used to extract the text
features showed significant improvement in detecting the sentiment from a review.
In this work, a method to combine both the text and audio or speech features is explored to detect the
sentiment from the online product reviews and from the online songs. As of now, less research is done
on the multimodal sentiment analysis of online reviews and songs in Indian languages. Our proposed
10
system is developd for Telugu, Hindi and Spanish databases. From the literature, it is observed that, from
the whole input several features are extracted using Openear/Opensmile tool and a sentiment model is
build using different classifiers instead in our work experiments are performed on selective regions.
Experiments on whole song does not shown good accuracy in detecting the sentiment because the
whole song may or may not carry the same attribute like happy (positive) and sad (negative). The begin-
ning and the ending parts of the song may includes the main attribute of the song. Hence, experiments
are done on different parts of the song to extract the sentiment. For reviews, instead of extracting all
the features from toolkit, MFCC, Prosdy, relative prosody features are extracted at stressed significant
and normal regions of an audio input and built sentiment analysis system using SVM, GMM classifiers.
Experiments are even performed using deep neural architectures like DNN and DNNAM. In order to
detect the sentiment of a text input, textual features which are computed by Doc2Vec and word2vec
vectors are used to build the SVM and LSTM-RNN classifiers respectively. The sentiment is extracted
by combining both the audio and text modalities.
In the next chapter we describe about how multimodal sentiment analysis is performed at selective
regions of an input.
11
Chapter 3
MULTIMODAL SENTIMENT ANALYSIS USING SELECTIVE
REGIONS
Till now most of the work on sentiment analysis is done on the entire signal by extracting several
features. In this chapter, instead of using the entire signal, a specific regions of an audio signal have
been identified and studies are performed on these regions to detect the sentiment associated with it.
We present a method to detect the specific regions of audio signals. We also present, how the combina-
tion of audio and text modalities improve the performance in detecting the sentiment. We present the
multimodal sentiment analysis on Spanish reviews, Hindi reviews and Telugu songs databases.
3.1 Databases used in this Study
This section describes Spanish, Hindi and Telugu databases used in this study. Spanish database is
publicly available and the Hindi and Telugu databases are collected to carry out this study.
3.1.1 Spanish database of opinion utterances
This database is obtained from [44] named as MOUD: Multimodal Opinion Utterances Dataset.
Dataset consists of variety of product reviews on movies, perfumes, and books which are collected from
publicly available source YouTube. As the variety of product reviews are used, the database has some
degree of generality within the broad domain of product reviews. Total database has 100 videos, among
them 42 are positive, 36 are negative, and 22 are neutral. For our study only negative and positive
sentiments are considered. Among them 80% is used for training and 20% is used for testing. Only 30
seconds of opinion segments are taken for each video after removing titles and advertisements. For text
based sentiment classification, transcription and sentiment annotations were manually performed. The
average number of words in each input is around 50. Annotators were provided with all the modalities
i.e. audio and transcribed text to correctly figure out the opinion of the reviews.
12
3.1.2 Hindi database of reviews utterances
The database used in our studies is collected from YouTube, which is a publicly available source.
As Hindi is a scarce-resource language, not much work has been done in the area of sentiment analysis
for this language. The dataset includes reviews of phones, lotions and shampoos. The database has
some degree of generality as a variety of product reviews are used within the broad domain of product
reviews. The two basic sentiments presented in the database are: Positive and negative. A total of 110
product reviews are collected, among them 100 inputs are taken based on inter-annotator agreement.
Among them 80% are used for training and remaining 20% are used for testing. Both the modalities
such as audio and text are provided for annotators to find the exact opinion of the input. Then based
on inter-annotator agreement, 50 positive and 50 negative inputs are chosen. Each audio input is in
.wav format, 16 bit resolution, 16000 Hz sampling frequency and a mono channel. Transcription and
sentiment annotations were manually performed for text based sentiment classification. The average
length of each input is 30 seconds and average number of words in each input is about 40.
3.1.3 Telugu songs
The database used in this studies is collected from the You-tube which is a publicly available source.
A total of 150 Telugu movie songs along with lyrics corresponding to each song are taken. The two basic
sentiments presented in the database are: Happy and sad. Joyful, thrilled, powerful, etc, are taken as
happy sentiment and ignored, depressed, worry, etc are taken as sad sentiment. As our native language
is Telugu, work is implemented on Telugu songs which don’t have any special features compared to
other language songs. Telugu songs are one of the popular categories of Indian songs and are present in
Tollywood movies. Most of the people who belongs to the south part of India will listen to these songs.
The songs include variety of instruments along with the vocals. Here the main challenging issue is the
diversity of instruments and vocals. The average length of each song is 3.5 minutes and average number
of words in lyrics for each song is around 300. The database is annotated for the sentiment happy and
sad by three people. Annotators are provided with the two modalities such as text and audio to correctly
figure out the sentiment of a song. Then based on inter-annotator agreement, 50 happy songs and 50
sad songs are selected because some songs seems to be happy or sad for one annotator and neutral for
another annotator. So, only 100 songs are selected out of 150. Inter-annotator agreement is a measure of
how well two or more annotators can make the same annotation decision for a certain category. Among
them 40% of songs are used for training and 60% of songs are used for testing.
3.2 Detection of Sentiment in Selective Regions of Songs
The entire song may or may not carry the same sentiment like happy (positive) or sad (negative) to
detect the sentiment. So, in this work, experiments are not only performed on the entire song but also
on different regions like at the beginning of a song and the end of a song
13
3.2.1 Sentiment analysis using audio features
This section describes the process of developing the sentiment models at different regions by ex-
tracting the audio features of a song. These features are then used to built the classifiers of positive or
negative sentiment of a song. Each song underwent the preprocessing step of converting mp3 files into
wave file (.wav format), 16 bit resolution, 16000 Hz sampling frequency and mono channel.
As the entire song may or may not carry the same sentiment, in this work three studies are performed:
namely at the beginning 30 seconds of a song, the ending 30 seconds of a song and for the whole
song. To extract a set of audio features like Mel-frequency Cepstral Coefficients (MFCC), chroma,
prosody, temporal, spectrum, harmonics and tempo from a wave file openEAR/openSMILE toolkit [11]
is used. As prosody have been used before for the task of emotion recognition in speech, it has also been
experimented for the task of sentiment analysis [31] by using the openEAR toolkit which is succeeded.
Brief details about audio features used in this work are mentioned below:
1. Prosodic features are those aspects of speech which go beyond phonemes and deal with the au-
ditory qualities of sound. In spoken communication, we use and interpret these features without
really thinking about them. These features include intensity, loudness and pitch that describe the
speech signal.
2. Temporal features also called as time domain features, are the energy of signal, zero crossing rate.
3. Spectral features also called as frequency domain features which are extracted by converting the
time domain signal into frequency domain using the Fourier Transform. It include features like
fundamental frequency, spectral centroid, spectral flux, spectral roll-off, spectral kurtosis, spectral
skewness. These features can be used to identify the notes, pitch, rhythm, and melody.
4. Mel frequency cepstral coefficients (13 dimensional feature vector) were calculated based on
the short time Fourier transform (STFT). First, log-amplitude of the magnitude spectrum was
taken, followed by grouping and smoothing the fast Fourier transform (FFT) bins according to
the perceptually motivated Mel-frequency scaling. The frequency bands are equally spaced on
the mel scale, which approximates the human auditory system’s response more closely.
5. Chroma features (12 dimension feature vector) are most popular feature in music which is closely
relates to the pitch classes. A pitch class is a set of pitches that share the same chroma. The entire
spectrum is projected into a 12 bins representing 12 different semitones or chroma of the musical
octave. These features are extensively used for chord, key recognition and segmentation.
6. Harmonic tempo is the rate at which the chords change in the musical composition in relation to
the rate of notes.
By combining all these features a total of 37 dimension feature vector is extracted, each of them
obtained at frame level. During the feature extraction, frame size of 25 ms and frame shift of 10 ms is
14
Table 3.1 Sentiment classification performance (in %) of songs using 37 dimension feature vector.
Region SVM GMM SVM+GMM
Entire song 52.8 54.9 69.7
Beginning of the song 55.8 73.5 88.3
Ending of the song 64.7 61.7 82.35
used. The classifiers such as Support Vector Machine (SVM), Guassian Mixture Models (GMM), and
combination of both are developed using these features. GMM are known for capturing the distribution
in the features, and SVM are known for capturing discriminative information. Hence these models are
combined to improve the performance in detecting the sentiment of a song. In this work, 64 mixtures
for GMM models and Gaussian kernel for SVM models are determined empirically.
A 37 dimension feature vector is given as input to the classifiers to detect the sentiment. From the
Table 3.1 it is observed that, the entire song gives less performance in detecting the sentiment compared
to the beginning and ending of a song. This is due to the fact that, the whole song will carry different
attributes (happy and sad) which is not clear. So by using part of song, the performance is increased.
The performance of beginning of a song is better compared to ending of the song because the vocals
and instruments which are present in the beginning of a song may or may not sustain through out the
song. It is also observed that, the combination of evidences from both the classifiers gives the better
performance for beginning of the song compared to the ending of the song. The average performance of
sentiment analysis for beginning, ending and for whole song is 88.3%, 82.3% and 69.7% respectively.
3.2.2 Sentiment analysis using lyric features
This section describes the process of extracting the textual lyrics of a song at different regions. These
features are then used to build a classifier for the analysis of positive or negative sentiment of a song. All
the lyrics are collected from the Internet. In preprocessing step, lyrics which contain stanza names like
”pallavi” and ”charanam” were removed because the headings (”pallavi” and ”charanam”) are common
for each song which does not act like a feature to detect the sentiment of the song. If the same line has
to be repeated, it is represented as ”x2” in the original lyrics, so ”x2” is removed and the line opposite
to that is considered twice. For each song in a database, a 300-dimensional feature vector is generated
using Doc2vec model [27]. As we have 100 files, 100 feature vectors are generated one for each song.
For checking the accuracy, each song is manually annotated and is given a tag like happy or sad.
The Doc2Vec model is used for associating random documents with labels. Doc2vec modifies
word2vec algorithm to an unsupervised learning of continuous representations for larger blocks of text
such as sentences, paragraphs or entire documents. This means, Doc2vec learns to correlate labels and
words rather than words with other words. In document vector, the vector tries to grasp the semantic
meaning of all the words in the context by placing the vector itself in each and every context. Thus fi-
15
Table 3.2 Sentiment classification performance (in %) of songs using lyric features.
Region SVM NB SVM+NB
Entire song 60.6 52.3 70.2
Beginning of a song 67.5 57.3 75.7
Ending of a song 64.4 55.8 72.4
nally, the document vector contains the semantic meaning of all the words in the context trained. In the
word2vec architecture, the two algorithms used are continuous bag of words and skip-gram. Word2vec
input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus. In
doc2vec architecture, the two algorithms are distributed memory and distributed bag of words. In dis-
tributed memory, in addition to the word vectors there is a document vector that keeps track of the whole
document and in distributed bag of words it does not have any word vectors, there is just a document
vector which is trained to predict the context.
All songs are given as input to the doc2vec. This generates a single vector that represents the meaning
of a document, which can then be used as input to a supervised machine learning algorithms like SVM
and Naive-Bayes (NB) to associate documents with labels. SVM, NB and combination of both of these
two classifiers are trained with vectors generated from the doc2vec for positive or negative sentiment
classification. Given a test data song, the trained models classifies it as either happy or sad. Three
experiments are done on each song:beginning 30 seconds, last 30 seconds and for the whole song.
GMM requires more features for training as compared to Naive-Bayes and SVM, but in textual part
we have less features (only one feature vector for one song using doc2vec). Where as for audio, several
features are there because for each song frame level features are extracted with a frame size of 20 ms. So
for acoustic models GMM and SVM are used, where as for linguistic features Naive-Bayes and SVMs
are used.
Features are extracted using Doc2Vec model, by giving input to the model it creates a fixed length
feature vector as an output. From the Table 3.2, it is observed that a combination of both the classifiers
give better performance for beginning of the song as compared to the ending and whole song. Features
extracted from Whole song gives less accuracy in detecting the sentiment of a song because of ambiguity
that present in large number of features. The performance at beginning of a song is better compared to
ending because the amount of words which will be there in the beginning may have strong sentiment.
The average performance of sentiment analysis for beginning, ending and for whole song is 75.7%,
72.4% and 70.2% respectively, by using lyric features.
3.2.3 Multimodal sentiment analysis
The main advantage of audio as compared to their textual data is that audio has voice modularity. In
textual data, the only source that is available is information regarding the words and their dependencies.
16
Songs
Hypothesized Sentiment (Happy or Sad)
Lyrics File Audio File
Extracting Textual Lyric Features
Extracting Audio Features
SVM Naive Bayes SVM GMM
Hypothesizing based on the highestaverage probability of the classifiers.
Preprocessing
SVM+NaiveBayes SVM+GMM
Figure 3.1 Block diagram of multimodal sentiment analysis of songs.
Table 3.3 Multimodal sentiment classification performance (in %) with lyric and audio features.
Region Lyric Audio Lyric+Audio
Entire song 70.2 69.7 75.8
Beginning of a song 75.7 88.3 91.2
Ending of a song 72.4 82.3 85.6
This may sometime be insufficient to convey the exact sentiment of the song. Instead, audio data contain
multiple modalities like acoustic, and linguistic streams. The simultaneous use of these two modalities
will help to create a better sentiment analysis system to detect whether the song is happy or sad.
Sequence of steps in proposed approach is presented in the Figure 3.1. Table 3.3 presents the accu-
racy of sentiment by combining lyrics and audio features. To handle the similarity of sentiment classes,
decision from different classification models trained using different modalities are combined. By com-
bining both the modalities performance is improved by 3 to 5%.
From these studies it is observed that, the performance is better in beginning compared to ending
and from the whole song. But these studies cannot be performed on reviews because whenever we
are expressing something, the sentiment may not be present only at the beginning or the ending of an
utterance. So to detect the sentiment from reviews, the stressed and normal regions are identified from
the input and the experiments are performed on these regions.
17
3.3 Detection of Sentiment from Selective Regions of Reviews
This section describes the multimodal sentiment analysis performance on reviews dataset.
3.3.1 Sentiment analysis using speech features
Whole audio input may not contain positive or negative sentiment because human beings cannot
sustain the same sentiment for the entire utterance. The utterance may have positive, neutral or negative
data. So stressed regions and normal regions are detected from a voiced segments to extract the sen-
timent of an audio. MFCC, prosody and relative prosody features are extracted from stressed regions
and normal regions of an audio input. These features are then used to build a classifier of positive or
negative sentiment. Each audio input is in .wav format, 16 bit, 16000 Hz sampling frequency and a
mono channel. The process of finding stressed and normal regions from an audio input is described in
following subsection.
3.3.1.1 Detecting stressed and normal regions
Stressed regions are detected based on the strength of excitation of an audio signal using Zero-
Frequency Filter (ZFF) method [39]. Strength of the excitation is defined as the slope of ZFF signal at
epoch. Zero-frequency filter of audio signals provides useful information about the excitation source
such as epoch locations, fundamental frequency (Fo) and Strength of excitation. This analysis is useful
for detecting the voiced/unvoiced regions as well as stressed/normal regions. Computing the strength of
excitation using ZFF method is discussed in the following subsection [39].
3.3.1.2 Computation of strength of excitation using ZFF method
• Time-varying low frequency bias is removed from the signal by pre-emphasizing the audio signal
using a difference operation.
x(k) = s(k)− s(k − 1). (3.1)
• Then, the pre-emphasized audio signal is passed through a cascade of two ideal digital resonators
at 0 Hz i.e.,
y(k) =
4∑
p=1
apy(k − p) + x(k), (3.2)
where a1 = +4, a2 = −6, a3 = +4, and a4 = −1. The above operation can be realized by
passing the signal x(k) through a digital filter given by
H(z) =1
(1− z−1)4. (3.3)
18
• From the above step an exponential trend is introduced in the y(k), which will be removed in the
following manner.
zffsignal = y(k)− y(k), (3.4)
where
y(k) =1
2N + 1
N∑
k=−N
y(k). (3.5)
Here a window size of 2N +1 is used to compute the local mean, and typically this is the average
pitch period when computed over a longer segment of speech.
• The trend removed signal zffsignal is referred as the zero-frequency filtered signal (ZFF signal).
The positive zero crossings of the ZFF signal correspond to epochs or the instants of significant
excitation.
• Strength of the excitation is given by
SoE = |sz[k]((le)p + 1)− sz[k](le)p − 1)|. (3.6)
Where p=1,2,3...M. Here M is the total number of epochs, (le)p is the location of pth epoch and
SoE is the strength of the excitation.
3.3.1.3 Detecting stressed and normal regions of an audio signal
For each audio signal voiced or unvoiced regions are identified using ZFF signal. Strength of excita-
tion at each epoch is computed for the entire signal. The regions where epochs are equally spaced and
more Strength of excitation is treated as voiced regions. In the unvoiced regions epochs are irregularly
spaced and strength of excitation is low. So, a simple threshold on strength of excitation will separate
the voiced regions from the unvoiced regions.
From the voiced regions, normal and stressed regions are identified based on the strength of exci-
tation. Strength of excitation is computed at an epoch and keep that same value till the next epoch.
From this, step signal is obtained which is subjected to a 20 ms mean smoothing to generate a smooth
contour representing the strength of the excitation curve. From this, the average value of the strength
of excitation is computed and refereed as SoEavgvalue. The regions where the strength of excitation
fluctuates above and below 30% of the SoEavgvalue is considered as stressed significant regions and the
remaining part of the audio is identified as normal regions.
An approach for detecting stressed significant and normal regions of an utterance is shown in Fig. 3.2.
Spanish speech utterance “pero igual con las lavadas” is shown in Fig. 3.2(a). Fig. 3.2(b) shows the
zero-frequency filtered signal (zffsignal). At every epoch strength of excitation is computed using
the algorithm described in Section 3.3.1.2 is shown in Fig. 3.2(c). Strength of the excitation which
is computed in the above step is mean smoothed with a frame size of 20 ms is shown in Fig. 3.2(d).
19
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
−1
0
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
−0.50
0.51
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.10
0.5
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.10
0.5
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
−1
0
1
(a)
(b)
(c)
(d)
(e)
Figure 3.2 An approach to detect stressed significant regions of Spanish speech utterance “pero igual
con las lavadas”. (a) Input speech signal, (b) ZFF signal (c) Strength of the excitation at each epoch,
(d) Strength of the excitation which is mean smoothed using a frame size of 20 ms and (e) Stressed
significant regions of utterance are detected.
Stressed significant regions and normal regions of an utterance which are detected are marked with red
color is shown in Fig. 3.2(e).
3.3.1.4 Features extracted at stressed and normal regions
After detecting the stressed and normal regions, a sentiment model is built by extracting MFCC,
prosody and relative prosody features from these regions. The features like energy, pitch, and duration
which are extracted from suprasegmental regions are called as prosody features. In this study, the
prosody features considered are mean pitch, maximum pith, minimum pitch, mean energy, maximum
energy, minimum energy and the duration ratio. Relative prosody features are extracted from both the
regions. MFCC features are of 13-dimensional feature vector.
20
Figure 3.3 shows that block diagram to extract the relative prosody features from an audio signal.
After identifying the normal and stressed regions, prosody features are extracted from both the regions
separately. Pitch is the perceptual quantity of sound. In general, fundamental frequency (F0) is treated
as a pitch. In the present study pitch is computed from ZFF signal. The positive zero crossings of ZFF
signal are hypothesized as epochs. The time difference between two consecutive epochs is termed as the
fundamental period (T0). The inverse of T0 gives F0. The pitch and energy parameters are considered
from voiced regions and the duration ratio is considered as the ratio between voiced duration to the total
duration. After extracting the prosody features from both the regions, the relative prosody features are
calculated as difference of stressed and normal prosody features. The three (mean, maximum and mini-
mum) variations of each pitch and energy along with the duration ratio together forms a 7-dimensional
feature vector for relative prosody features.
Figure 3.3 Block diagram to extract relative prosody features.
The importance of relative prosody features over prosody features is shown in Figures 3.4 and 3.5.
To plot these figures, 30 positive and 30 negative sentiment audio signals are used. Figure 3.4 shows the
histogram of maximum pitch value and relative maximum pitch value for postive and negative sentiment
audio signals. In Figure 3.4 (a) it can be observed that maximum pitch values for positive (blue color)
and negative (red color) sentiment audio signals are overlapped. In contrast, maximum relative pich
21
values of positive and negative sentiment audio signals are concentrated in different bins and it is shown
in Figure 3.4 (b).
440 450 460 470 480 490 500 510 5200
5
10
15
20
MAXIMUM PITCH
−50 0 50 100 150 2000
5
10
15
20
25
RELATIVE MAXIMUM PITCH
NegativePositive
NegativePositive
(a)
(b)
Figure 3.4 Histogram of (a) maximum pitch, (b) relative maximum pitch for positive and negative
sentiment audio files.
Figure 3.5 shows the scatter plot of maximum pitch vs maximum energy. In figure red color indicates
negative and green color indicates positive sentiment. Values plotted using ‘o’,‘⋄’ indicates prosody
features and values plotted using ‘+’ and ‘�’ are relative prosody features. In Figure 3.5, it can observed
that overlap is high in prosody features (‘o’,‘⋄’) compared to relative features (‘+’,‘�’).
From Figures 3.4 and 3.5 it can be observed that relative prosody features has more sentiment specific
discrimination compared to prosody features. Prosody features play a key role in understanding the
sentiment by human being, but the natural variations in prosody features are limiting factors for using
them. In this work, the natural variations in the prosody features are minimized by using relative prosody
features.
22
−20 −10 0 10 20 30 40−100
0
100
200
300
400
500
600
700
MAXIMUM ENERGY
MA
XIM
UM
PIT
CH
Positive ProsodyNegative ProsodyPositive Relative ProsodyNegative Relative Prosody
Figure 3.5 Scatter plot of maximum pitch vs maximum energy for positive and negative sentiment audio
files.
3.3.1.5 Sentiment analysis experimental setup and results
Experiments are performed on Spanish and Hindi datasets. For each experiment, a 5-fold cross
validation is run on the entire dataset. In this work, two experimental setups are considered.
In first experimental setup, a sentiment model is build using MFCC features (13-dimension feature
vector) and GMM classfier. GMMs are build with 13-dimensional MFCC feature vector extracted from
the whole input and with the MFCC features extracted from the stressed regions of an audio signal.
Different mixture components (8, 16, 32 and 64) are considered for building GMM. The results obtained
from GMM with 32 mixtures are given in the Table 3.4.
In second experimental setup, a sentiment model is build using prosody features and SVM classi-
fier. In this work, a linear kernel SVMs are used which are binary classifiers. SVMs linear kernel will
separates a set of negative examples from a set of positive examples by finding out the optimal hyper-
plane with a maximum margin. SVM classifier is build using the following three different features.
(1) By extracting the prosody features (7-dimensional feature vector) of the entire audio signal. (2) By
extracting the prosody features from stressed (7-dimensional) and normal(7-dimensional) regions, total
of 14-dimensional feature vector. (3) By calculating the relative prosody features (stressed minus nor-
mal) which is of 7-dimensional. Sentiment models are built using these three different features. These
23
Table 3.4 Sentiment classification performance (in %) with different features.
Features Spanish Hindi
Pitch and intensity [44] 46.7 –
MFCC (Entire signal-13 dimensional) 58.3 53.6
MFCC (Stressed regions-13 dimensional) 67.8 65.2
Prosody (Entire signal-7 dimensional) 41.7 41.7
Prosody (Stressed and normal regions-14 dimensional) 75.0 67.8
Relative prosody (stressed minus normal-7 dimensional) 83.3 75.0
prosody features are not tested on GMM classifier because the data may not be sufficient as only one
feature vector is extracted for each utterance.
From the literature, it is observed that, lot of audio features are extracted using toolkits and classifiers
are built using all those features which might create confusion in identifying the sentiment of a signal
[44]. Instead of using all the features, in our work only suitable features like prosody and MFCC
are considered. Prosody features are extracted from suprasegmental and MFCC features are extracted
from segmental level. As prosody features are extracted from suprasegmental, the rate of detecting the
sentiment of a signal might be more accurate than the MFCC features which are computed for each
frame, because each frame may not carry the sentiment.
Human being depends on suprasegmental information like prosody to detect the sentiments, but the
problem with prosody features is that, they are not very much dependable due to the natural variations
present in them. In this proposed approach for sentiment detection, natural variations in the prosody
features are reduced by considering the relative prosody features. This can be observed from the Table
3.4.
Relative prosody features are extracted by taking the difference of prosody features in stressed and
normal regions. MFCC features are tested on GMM classifier. Prosody features and relative prosody
features are tested on SVM classifier. Prosody and relative prosody features are not tested on GMM clas-
sifier because of less data. The Table 3.4 shows the sentiment classification performance with different
features. From the Table 3.4, it is observed that features extracted from whole signal has less perfor-
mance compared to the features extracted from stressed and normal regions. It is also observed that by
using relative prosody features, the rate of detecting the sentiment for Spanish and Hindi database is
83.3% and 75.0% which outperforms the MFCC extracted at stressed regions by 16.6% and 9.8% , and
the prosody features extracted from both the regions by 8.3% and 7.2% respectively. It is also observed
that our proposed method for Spanish database outperforms the [44] by 36.6 %.
3.3.2 Sentiment analysis using text features
This section describes the process of extracting the text features of an audio signal. These features
are then used to build a classifier to identify positive or negative sentiment. In a preprocessing step, each
24
audio input is transcribed manually and sentiment annotations are assigned. For each of the audio input
300- dimensional feature vector is extracted for better results.
Many machine learning algorithms require the input to be represented as a fixed-length feature vec-
tor. Doc2Vec model is used for associating documents with labels which is an extension of the existing
word2vec model. Doc2vec modifies word2vec to an unsupervised learning of continuous representa-
tions for larger blocks of text such as sentences, paragraphs or whole documents means Doc2vec learns
to correlate labels and words rather than words with other words. In the word2vec architecture, the two
algorithms used are skip-gram and continuous bag of words and in the doc2vec architecture, the cor-
responding algorithms are distributed bag of words and distributed memory. All transcribed inputs are
given to the doc2vec which generates a single vector that represents the meaning of a document. This
will be used as input to a supervised machine learning algorithm to associate documents with labels.
Sentiment analysis based on text can be viewed as a text classification task which can be handled by
SVM due to its better classification. SVM classifier is trained with vectors generated from the doc2vec
and by using corresponding sentiment tags (positive/negative). Given a test input, the trained models
classify it as either positive or negative.
Table 3.5 Sentiment classification performance (in %) using text features.
Classifier Spanish Hindi
SVM 65.5 63.4
From Table 3.5, it is observed that the rate of detecting the sentiment from Spanish data and Hindi
data using text features is 65.5% and 63.4% respectively. When compared these results with [44] our
system outperforms in detecting the sentiment using text features.
3.3.3 Multimodal sentiment analysis
The advantage that comes to the analysis of audio is voice modularity when compared to their textual
data. Textual data will only have the information regarding the words and their dependencies which may
be insufficient to convey the exact sentiment of the input. Instead, audio data contain multiple modalities
like linguistic and acoustic streams. Both the modalities are hypothesized based on the highest average
probability of the classifiers. The simultaneous use of these two modalities might help to create a better
sentiment analysis model to detect whether the given test input is positive or negative sentiment.
Table 3.6 presents the accuracy of sentiment by combining text and audio features using the proposed
method. In audio model, the performance of detecting the sentiment on Spanish review is improved by
36.5% and for text model the performance is improved by 1%. By combining both audio and text
models, rate of detecting the sentiment of an Spanish audio input is 87.2% which is improved by 20%
compared to [44] model. It is also observed that the multimodality accuracy of Hindi database is 81.4%.
25
Table 3.6 Sentiment classification performance (in %) for different models on Spanish and Hindi
datasets.
Modality Spanish [44] Spanish Hindi
Proposed Proposed
Audio 46.75 83.3 75.0
Text 64.94 65.5 63.4
Audio+Text 67.42 87.2 81.4
3.4 Summary and Conclusions
In this chapter, we have proposed an approach to extract the sentiment of an input using both au-
dio and text modalities. Instead of focusing on an entire input, experiments are performed on selective
regions and observed good accuracy because the entire input may contain positive, negative and neu-
tral. For songs, experiments are performed on beginning, end and whole song and for reviews stressed
significant and normal regions are identified based on strength of excitation.
Audio features like chroma, spectral etc, which are extracted at beginning, ending and from whole
song are used to build the song sentiment classifiers such as SVM, GMM. The beginning of a song
outperforms in detecting the sentiment compared to ending of a song and whole song because the in-
struments and vocals which convey the sentiment for beginning part of the song may or may not sustain
throughout the song. The features like MFCC, prosody (pitch, energy and duration ratio) and relative
prosody features extracted at stressed and normal regions are used to detect the sentiment of an audio
signal. Stressed regions of an input are detected based on the strength of excitation using ZFF based
method. These features are then used to build a sentiment models using GMM and SVM classifiers.
From our experiments, it is observed that features which are extracted from the selective regions of an
input performed better compared to the entire input. The relative prosody features extracted from both
the regions has a high rate in detecting the sentiment compared to MFCC and prosody. This is because
of natural variations present in the prosody features are reduced using the relative prosody features.
The text features generated using Doc2Vec are used o build the SVM classifier. From the experimental
setup, it is also observed that by combining audio and text modalities rate of detecting the sentiment of
an input is improved.
In the next chapter we describe how multimodal sentiment analysis is developed using deep neural
networks.
26
Chapter 4
MULTIMODAL SENTIMENT ANALYSIS USING DEEP NEURAL
NETWORKS
Recently, neural networks have achieved a good success on sentiment classification. In this chapter,
a neural network approach is presented to detect the sentiment of an online product review based on its
multi-modality nature (audio and text). We also present how different deep neural network classifiers are
developed to detect the sentiment of reviews. For each audio input, Mel Frequency Cepstral Coefficients
(MFCC) features are extracted. These features are used to develop sentiment models using Deep Neural
Network (DNN) and Deep Neural Network Attention Mechanism (DNNAM) classifiers. From results,
it is observed that DNNAM classifier gives better results compared to DNN. This is because the DNN
approach is a frame based one where as the DNNAM approach is an utterance level classification thereby
efficiently making use of the context. Further textual features are extracted from the transcript of the
audio input by using Doc2vec and Word2vec models. Support Vector Machine (SVM) and Long Short-
Term Memory Recurrent Neural Network (LSTM-RNN) classifiers are used to develop a sentiment
model with the textual features. From the results, it is observed that LSTM-RNN outperforms the SVM
because LSTM-RNN is able to memorize long temporal context. From experimental results, it is also
observed that by combining both the audio and text features there is a significant improvement in the
performance for detecting the sentiment of online product reviews.
4.1 Sentiment Analysis using Audio Features
The process of developing the sentiment model and extracting the audio features from the input is
described in this section. The features which are extracted are used to build a classifier to detect the
input as positive or negative sentiment. MFCC features are extracted from each input and sentiment
model is developed using Guassian Mixture Model (GMM), DNN and DNNAM classifiers.
27
Selection of Feature Dimension(13 or 65 or 130 dimension)
Sentiment
1 Frame 5 Frames are combined
10 Frames are combined
Figure 4.1 Block diagram of sentiment analysis using deep neural network.
4.1.1 Deep neural network
A deep neural network (DNN) is a neural network with multiple hidden layers of nodes between
the input and output layers. These hidden layers will perform feature identification and processing in
a series of stages. The successive layers can learn higher level features. DNN performance depends
on the amount of training data. The more the training data the better accurate it will be. Each DNN
is trained for 30 epochs with different number of layers and different number of nodes in each layer.
In our work up to four deep layers are explored. All the DNNs are trained with ADAM method which
is hyper parameter learning algorithm. MFCC features considered in this study are 13-dimensional,
65-dimensional and 130-dimensional. Five frames of 13 dimensional MFCC frames are concatenated
to get 65 dimensional vectors and ten frames of 13 dimensional MFCC frames are concatenated to get
130 dimensional feature vector. Each frame will not carry the sentiment. So, experiments are done in
combination of frames, which results in better performance. Based on the input dimension, the input
layer nodes can be 13, 65 or 130 which are linear. The output layer is of softmax layer with 2 nodes
because the number of classes are 2. During testing the node which gives maximum value is assigned
as the claimed class. For testing only 5 seconds of data are considered for each input. Block diagram of
sentiment analysis using Deep Neural Network is shown in the Figure 4.1.
28
Table 4.1 Performance (in %) of sentiment analysis using deep neural network.
FeaturesHindi Spanish
2 layers 4 layers 2 layers 4 layers
13-D MFCC 41.6 58.3 50.0 61.2
65-D MFCC 58.3 75.0 66.7 72.0
130-D MFCC 58.3 66.7 61.2 66.7
From the Table 4.1, it is observed that DNN with 65-dimensional feature vector has performed better
compared to 130 dimensional because by using 130 dimension feature vector, features are not sufficient
to train the DNN. It is also observed that, performance is better for four layers. The rate of detecting the
sentiment using DNN classifier on Hindi and Spanish reviews is 75.0% and 72.0% respectively.
A DNN can have any number of hidden layers and any number of nodes in each layer. But, a simple
feedforward neural network or DNN cannot achieve good performance just by adding more layers to
it because the training process is ineffective when there are more layers. Even we combine frames
and done experiments with different layers and observed good performance, the drawback of the DNN
system is that, at each frame decision is taken and the context used is fixed which is usually assigned
to an entire utterance. DNN cannot memorize the long temporal context. To overcome these problems
a DNNAM architecture is proposed to capture the temporal context better and to do utterance-wise
sentiment classification.
4.1.2 Deep neural network attention mechanism
A DNNAM is a simple DNN implemented with an attention mechanism. The advantage of DNNAM
is that it is able to memorize and is also parallelized because of the strictly feed forward neural network.
This is used for classifying entire utterance rather than taking a frame level decision as we did in DNN.
In addition to this, the attention mechanism will go deep into the input feature frames which are more
important to extract the sentiment. In the DNNAM architecture, the attention is computed by using the
input feature vectors. The block diagram for deep neural network attention model is shown in Figure
4.2.
Given an input sequence, Y = {Y1, Y2, . . . , YT }, a hidden layer sequence H = {h1, h2, · · · , hT }, is
computed by the regular DNN and attention is computed on this hidden features.
By using a single layer perceptron, attention mechanism a(ht) which is shown in Figure 4.2 is
computed and then a softmax operation is performed for normalizing the values between zero and one.
H = [h1 h2 · · · hT ] (4.1)
γ = tanh(WaH + ba) (4.2)
α = softmax(γ) (4.3)
29
Figure 4.2 Deep neural network attention model [48].
In the above equations, γ is a hyperbolic tanh function and Wa, ba are the parameters of the attention
mechanism. By using backpropagation algorithm these parameters are optimized along with other pa-
rameters of the network, where as α is referred to a attention vector. The vectors in the hidden state
sequence ht are fed into the attention mechanism a(ht) to produce a probability vector α.
The context vector c which is shown in Figure 4.2 is computed using the weighted average of H with
weight α.
c = Hα (4.4)
Then, the output vector is computed by transforming the c using output layer weights U and then
softmax operation is performed.
Z = softmax(Uc+ bo) (4.5)
In the above equation bo is the output layer bias. From the above equation it is inferred that from the
entire input utterance Y , a single decision vector Z is predicted.
The architecture can have any number of layers before or after the attention model. The number
of layers in this architecture are termed as the hidden layers present before the context vector plus one
additional output layer. This architecture is trained for 30 epochs with different number of layers and
different number of nodes in each layer. In our work up to four deep layers are explored. All the
DNNs are trained with ADAM method which is hyper parameter learning algorithm. Hyper parameters
30
Table 4.2 Performance (in %) of sentiment analysis using deep neural network attention mechanism.
FeaturesHindi Spanish
2 layers 4 layers 2 layers 4 layers
13-D MFCC 50.0 62.5 56.2 68.7
65-D MFCC 68.7 81.2 72.0 77.4
130-D MFCC 62.5 68.7 61.2 75.0
are adjusted using validation set. Similar to DNN here also 13-dimensional, 65-dimensional and 130-
dimensional MFCC features are considered. As single frames cannot carry the sentiment, the frames
are concatenated and observed that combining the frames results in improvement in the performance.
The input layer can have 13, 65 or 130 nodes, which are linear units and the output layer is of softmax
layer with 2 nodes because the number of classes considered in the databases is 2. During testing the
node which gives maximum score is assigned as the claimed class. For testing only 5 seconds of data
are used for each input.
From the Tables 4.2, it is observed that DNNAM with 65-dimensional feature vector has performed
better when compared to 130-dimensional because when we combine the frame features will be less
therefore they are not sufficient to train the DNN. It is also observed that, the performance is more with
four layers. The rate of detecting the sentiment using DNN classifier on Hindi and Spanish reviews is
81.2% and 77.4% respectively. From the Tables 4.1 and 4.2, it is observed that DNNAM performance
is more with four layers compared with DNN because DNNAM with hidden layers before the attention
mechanism captures context and integrates information before taking the final decision.
4.1.3 Gaussian mixture models
GMMs are well known to capture the distribution of data in the feature space. A Gaussian mixture
density is a sum of M weighted component densities [50] given by:
p(xk|λ) =
M∑
r=1
wrKr(xk) (4.6)
where xk is an N dimensional input vector,
Kr(xk), r = 1...M are the component densities and
wr, r = 1...M are the weights of the mixtures.
The product of the component Gaussian with its mixture weight i.e., Kp(xk)wr is termed as com-
ponent density. Sum of the component densities is given by Gaussian mixture density. The accuracy
in capturing the true distribution of data depends on various parameters such as dimension of feature
vectors, number of feature vectors, and number of mixture components. In this work Expectation Max-
imization (EM) algorithm is used to train the GMM models using audio features.
31
Table 4.3 Performance (in %) of sentiment analysis using different classifiers.
Classifiers Hindi Spanish
GMM (64 Mixtures) 58.3 66.4
DNN(Four Layers) 75.0 72.0
DNNAM (Four Layers) 81.2 77.4
MFCC features which are extracted from each input audio signal is given as input to the GMM.
GMM is tested with different number of test cases and different number of mixtures like 16, 32 and 64.
Here also for testing 5 seconds of data is considered. From the Table 4.3 it is observed that DNNAM
with four hidden layers outperforms the DNN with four hidden layers and GMM with 64 mixtures.
4.2 Sentiment Analysis using Text Features
The process of developing the sentiment model using the text features from the input is described
in this section. These features are used to build a classifier of positive or negative sentiment. In a
preprocessing step, each audio input is manually transcribed and sentiment annotations are also assigned
manually.
Many machine learning algorithms require the input to be represented as a fixed-length feature vec-
tor. Word2Vec and Doc2Vec are two models which will represent the input in fixed length feature vector.
Word2Vec takes data from a corpus, and churns out vectors for each of those words. These vectors are
used to represent a word instead of the entire document and it is interesting because similar words are
placed nearer to each other in the vector space. The length of word vector is adjustable and indepen-
dent from the size of dictionary. In the Word2Vec architecture, the two algorithms used are continuous
bag of words and skip-gram. The continuous bag of words predicts the current word from a window
of surrounding context words, where as the continuous skip-gram model uses the current word to pre-
dict the surrounding window of context words. On the other hand Doc2Vec model not only represents
words, but it is an unsupervised learning of continuous representations for larger text such as sentences,
paragraphs and whole documents. In Doc2Vec architecture, the algorithms used are distributed memory
and distributed bag of words. Distributed memory will randomly initialize paragraph vector for each
document and predict next instance using context words and paragraph vectors. Context window slide
across a document, but paragraph vector is fixed. On the other hand, distributed bag of words will only
use paragraph vectors and not word vectors. It will take window of words in a paragraph and randomly
sample which one to predict using paragraph vector (ignores word ordering).
Two types of sentiment models are built for text classification. One is SVM model and other one is
an LSTM-RNN model.
32
4.2.1 Support vector machine
Support vector machine classifier is intended to solve two class classification problems. The main
aim is to find linear separators in the search space which can be best disjoin the various indifferent
classes. The basic principle implemented in a support vector machine is that the input vectors which
are not linearly separable are transformed to a higher dimensional feature space and an optimum liner
hyperplane is constructed to separate the classes. SVM classification is best suited for sparse nature of
input. An SVM [4] is a two-class classifier constructed from sums of a kernel functions.
SVM classifier is trained with vectors generated from the Doc2vec and by using corresponding sen-
timent tags: positive or negative. Doc2Vec will generate a single vector for each manually transcribed
input, which represents the meaning of a document. For better results, 300-dimensional feature vector
is generated from each text input. To associate documents with labels, this vector will be used as an
input to a supervised machine learning algorithm like SVM. The problem with SVM is that it does not
model sequence information, it just uses the given features directly. To solve this problem, experiments
are performed using LSTM-RNN model.
4.2.2 LSTM-RNN model
A recurrent neural network (RNN) is a network of neurons with feedback connections. It can learn
sequence processing tasks that are not learnable by the traditional machine learning methods such as
feedforward networks and SVM which does not have any internal states at all. In-spite of its advantages
RNN suffers from vanishing gradient descent problem which is overcome by LSTM-RNN. LSTM-RNN
prevent the backpropagated errors from vanishing. The errors can flow backwards through any number
of virtual layers which are unfolded in space. The LSTM-RNN model will automatically learn a flexible
history length and it has an abstracted feature representation. Compared to the SVM model, the LSTM-
RNN model has more parameters that need to be learned during training. The block diagram for LSTM
is shown in the Figure 4.3.
The LSTM-RNN model [18] is a recurrent neural network model with multiple hidden layers and
a special memory unit. In LSTM-RNN, the information is stored in two ways: Long-term Memory as
weights, which are modified based on back propagation and Short-term Memory as activations of the
neurons which capture the recent history. LSTM-RNN can capture the long dependencies in a sequence
by introducing a memory unit and a gate mechanism which aims to decide how to utilize and update the
information kept in the memory cell. From the Figure 4.3, we can see that each memory cell ct has three
gate units: the input gate it, output gate ot and forget gate ft. The input gate controls the flow of input
activations into the memory cell. The output gate controls the flow of cell activations into the rest of the
network. The forget gate scales the internal state of the cell before adding it as input to the cell through
self recurrent connection of the cell, therefore adaptively forgetting or resetting the cell memory. In
addition, the LSTM- RNN architecture contains peephole connections from its internal cells to the gates
33
Figure 4.3 LSTM Network.
in the same cell to learn precise timing of the outputs. An LSTM-RNN network computes a mapping
from an input sequence to an output sequence by calculating the network unit activations.
it = σ(Wixxt +Wimmt1 +Wicct1 + bi) (4.7)
ft = σ(Wfxxt +Wfmmt1 +Wfcct1 + bf ) (4.8)
ct = ft ⊙ ct1 + it ⊙ g(Wcxxt +Wcmmt1 + bc) (4.9)
ot = (Woxxt +Wommt1 +Wocct + bo) (4.10)
mt = ot ⊙ h(ct) (4.11)
yt = Φ(Wymmt + by) (4.12)
where the W terms denote weight matrices (e.g. Wix is the matrix of weights from the input gate
to the input), the b terms denote bias vectors (bi is the input gate bias vector), σ is the logistic sigmoid
function, and i, f, o and c are respectively the input gate, forget gate, output gate and cell activation
vectors, all of which are the same size as the cell output activation vector m, ⊙ is the element-wise
product of the vectors and g and h are the cell input and cell output activation functions, generally tanh
34
Table 4.4 Performance (in %) of sentiment analysis using text features.
Classifier Hindi Spanish
SVM 65.5 63.6
LSTM-RNN 72.4 68.3
and Φ is the network output activation function which is softmax.
In this work an LSTM network with 32 units and a single output neuron with a softmax activation
function is used for making 0 or 1 predictions for the two classes positive and negative. Because this is a
binary classification problem, a log loss function is used and the network is optimized using the ADAM
optimization function. The model is fit over 100 epochs with a batch size of 256. Word2vec is used to
train initial word vectors for LSTM-RNN instead of Doc2Vec. In case of Doc2Vec, we already have a
summary vector for the sequence means a vector which represents sentence or document but RNN is
more useful for sequences. If we use Doc2Vec to train RNN there may not be any additional advantage.
From Table 4.4, it is observed that the rate of detecting the sentiment of an input in both databases is
high with LSTM-RNN when compared to SVM.
4.3 Multimodal Sentiment Analysis
As sentiment classification with both textual and audio features had few limitations, in our work we
carried out multimodality sentiment analysis. Social media are a huge untapped source of user opinion
for various products and services. Multimodality entails the use of multiple media such as audio and
video in addition to text to enhance the accuracy of sentiment analyzers.
The main advantage that comes with the analysis of audio opinions, as compared to their textual
counterparts, is the availability of speech cues. Speech is a nontrivial part which serves purpose of
emotion reorganization. Speech helps to understand the emotion information via explicit (linguistic)
message and implicit (paralinguistic) message. In textual opinions, the only source of information it
consists of words and their dependencies which may sometime prove insufficient to convey the exact
sentiment of the user. Instead, audio opinions naturally contain acoustic, and linguistic data streams.
The vocal modulations in the recorded response help us determine the tone of the speaker. So, the
simultaneous use of these two modalities will help to create a better sentiment analysis model. From
our experiments also, it is observed that the simultaneous use of these two modalities help to create a
better sentiment analysis model to detect whether the given test input is positive or negative sentiment.
Block diagram for multimodal sentiment analysis is shown in Figure 4.4.
35
Figure 4.4 Block diagram of multimodal sentiment analysis.
Table 4.5 Performance (in %) of multimodal sentiment analysis.
Modality Hindi Spanish
Audio 81.2 77.4
Text 72.4 68.3
Audio + Text 85.6 82.3
From the Table 4.5, it is observed that by combining the two modalities such as text and audio, rate of
detecting the sentiment of a product reviews in both the databases are improved significantly compared
to the single modality.
4.4 Summary and Conclusions
In this chapter, we proposed an approach to detect the sentiment of a given input using both audio
and text information. MFCC features are extracted from audio and sentiment models are built using
DNNAM, DNN and GMM. DNNAM and DNN are tested with different layers with different number
of nodes, whereas the GMM is tested with different mixture components with different test cases. From
our experiments, it is observed that DNNAM classifier with 65-dimensional MFCC features has high
accuracy of detecting the sentiment of an input compared to other two dimensions and even DNNAM
classifier outperformed DNN and GMM classifiers because the attention mechanism captures the con-
text present in the review. For text, features generated using Doc2Vec and Word2Vec are used to build
36
the model using the SVM and LSTM-RNN classifiers. From experimental studies, it is observed that
LSTM-RNN outperforms the SVM classifier because SVM does not model sequence information. It
is also observed that by combining both the modalities such as audio and text, rate of detecting the
sentiment is significantly improved.
37
Chapter 5
SUMMARY AND CONCLUSIONS
Sentiment analysis emerges as a challenging field with lots of obstacles as it involves natural lan-
guage processing. Sentiment analysis has been quite popular and has lead to building of better products,
understanding users opinion, executing and managing of business decisions. It has a wide variety of
applications that could benefit from its results, such as news analytics, marketing, question answering
and so on. The challenge of this field is to develop the machines ability to understand texts as human
readers do. Getting important insights from opinions expressed on the Internet especially from social
media blogs is vital for many companies and institutions, whether it is in terms of product feedback,
public mood, or investors opinions. To buy any product people also rely and make decisions based on
reviews and opinions.
The scarcity of resources is one of the biggest challenge while dealing with multimodal sentiment
analysis for Indian languages. In this work, we build Telugu songs dataset and Hindi product reviews
dataset. Based on the inter-annotator agreement, Telugu songs dataset has 110 songs (55 positive and
55 negative) and for the Hindi product review dataset has 100 reviews (50 positive and 50 negative).
For songs lyrics are collected from Internet and for reviews database the transcription and sentiment
annotations are manually performed as positive or negative. We even work on Spanish product review
database which is available online.
While developing multimodal sentiment analysis, instead of taking entire input and extracting sev-
eral features from toolkit, we identify selective regions of an input and experiment is performed on those
regions by extracting specific features.
For both lyrics and audio information of Telugu songs, experiments are performed on the beginning,
the ending and on the whole song. Experimental results show that the rate of detecting the sentiment of
a song is more at the beginning, ending compared to the whole song because the whole song will carries
different attributes (happy and sad) which is not clear. So by using part of song, the performance is
increased. The performance of the song at beginning is high compared to the ending of a song because
38
the instruments and vocals which convey the sentiment for beginning part of the song may or may not
sustain throughout the song. It is also observed that by combining both the modalities such as audio and
text, rate of detecting the sentiment is significantly improved.
The experiments at beginning and ending is not worked with the reviews because in reviews it is
not like that the opinion is there at beginning or at ending. So, in order to extract the sentiment of an
audio an algorithm is developed to detect the stressed regions and normal regions based on strength of
excitation using ZFF based method. MFCC, prosody and relative prosody features which are extracted
at these regions are used to build a sentiment models using GMM and SVM classifiers. From our ex-
periments, it is observed that the performance at selective regions is better compared to the whole input.
The relative prosody features extracted from stressed and normal regions has a improved performance
compared to MFCC and prosody because natural variations present in the prosody features are reduced
using the relative prosody features. For text features generated using Doc2Vec are used to build a sen-
timent models using SVM classifier. sIt is also observed that by combining both the modalities rate of
detecting the sentiment is further improved.
As the current research is going on the deep architectures, in our work we also build a sentiment
models using deep neural network classifiers. Selective regions concept is not worked well with the
deep neural networks because of the less data. So in this work we have done experiments with different
MFCC dimensional feature vectors like 13-dimensional, 65-dimensional and 130-dimensional. MFCC
features are extracted from audio and sentiment models are built using DNNAM, DNN and GMM clas-
sifiers. DNNAM and DNN are tested with different layers with different number of nodes, whereas the
GMM is tested with different mixture components with different test cases. From our experiments, it
is observed that DNNAM classifier with 65-dimensional MFCC features has high accuracy of detecting
the sentiment of an input compared to other dimensions because each frame cannot carry the sentiment
and even DNNAM classifier outperformed DNN and GMM classifiers because the attention mechanism
captures the context present in the review. For text, features generated using Doc2Vec and Word2Vec
are used to build the model using the SVM and LSTM-RNN classifiers. From experimental results, it
is observed that LSTM-RNN outperforms the SVM classifier because SVM does not model sequence
information. It is also observed that by combining both the modalities such as audio and text, rate of
detecting the sentiment is significantly improved.
The following is a list of some directions for future work:
1. Database need to be increased. In our work the selective regions concept is not applied on deep
neural network classifiers because of less data. The performance of deep neural networks is
depend on the amount of training data. The more the training data the better accurate it was.
2. In this work we focused only on two modalities text and audio. The performance can be improved
by combining these two modalities with video modality.
39
3. The percentage of text sentiment analysis can be improved by using rule based and linguistic
approach.
40
RELATED PUBLICATIONS
PUBLISHED:
• Harika Abburi, Eswar Sai Akhil Akkireddy, Suryakanth V Gangashetty, and Radhika Mamidi.
“Multimodal Sentiment Analysis of Telugu Songs”. In Proc. 4th Workshop on Sentiment Analysis
where AI meets Psychology (SAAIP) co-located with 25th International Joint Conference on
Artificial Intelligence (IJCAI), Hilton, New York City, USA, pages 48-52, 2016.
• Harika Abburi, Manish Shrivastava, and Suryakanth V Gangashetty. “ Improved Multimodal Sen-
timent Detection using Stressed Regions of audio”. In Proc. International Conference TENCON,
Marina Bay Sands, Singapore, pages 2834-2837, 2016.
• Harika Abburi, Manish Shrivastava, and Suryakanth V Gangashetty. “Multimodal Sentiment
Analysis using Deep Neural Network”. In Proc. Fourth International Conference on Mining
Intelligence and Knowledge Exploration (MIKE), Mexico, pages 58-65, 2016.
COMMUNICATED:
• Harika Abburi, K N R K Raju Alluri, Anil Kumar Vuppala, Manish Shrivastava, and Suryakanth
V Gangashetty. “Sentiment Analysis using Relative Prosody Features”. Submitted to Tenth In-
ternational Conference on Contemporary Computing (IC3), Noida, India, 2017.
41
Bibliography
[1] H. Abburi, E. S. A. Akkireddy, S. V. Gangashetty, and R. Mamidi. Multimodal sentiment analysis of Telugu
songs. In Proc. 4th Workshop on Sentiment Analysis where AI meets Psychology (SAAIP), pages 48–52,
2016.
[2] A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R. Passonneau. Sentiment analysis of twitter data. In Proc.
Workshop on Languages in Social Media, pages 30–38, 2011.
[3] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model. Journal of
Machine Learning Research, 3:1137–1155, 2003.
[4] M. W. Campbel, P. J. Cambell, A. D. Reynolds, E. Singer, and A. P. Torres-Carrasquillo. Support vector
machines for speaker, language recognition. Computer Speech & Language, 20(2):210–229, 2006.
[5] P. Chaovalit and L. Zhou. Movie review mining: A comparison between supervised, unsupervised classifi-
cation approaches. In Proc. IEEE 38th International Conference on System Sciences. Big Island, Hawaii,
pages 1–9, 2005.
[6] H. Davis and S. M. Mohammad. Generating music from literature. In Proc. 3rd Workshop on Computational
Linguistics for Literature, pages 1–10, 2014.
[7] L. Deng. A tutorial survey of architectures, algorithms, applications for deep learning. APSIPA Transactions
on Signal, Information Processing, 3, 2014.
[8] Z. Dong, Q. Dong, and C. Hao. Hownet and its computation of meaning. In Proc. 23rd International Con-
ference on Computational Linguistics: Demonstrations, Association for Computational Linguistic, pages
53–56, 2010.
[9] C. N. dos Santos and M. Gatti. Deep convolutional neural networks for sentiment analysis of short texts. In
Proc. 25th International Conference on Computational Linguistics (COLING), pages 69–78, 2014.
[10] D. P. W. Ellis. Clasifying music audio with timbral and chroma features. In Proc. 8th International Confer-
ence on Music Information Retrieval (ISMIR), pages 339–340, 2007.
[11] F. Eyben, M. Wollmer, and B. Schulle. Opensmile the munich versatile and fast open-source audio feature
extractor. In Proc. ACM Multimedia (MM), pages 1459–1462, 2010.
[12] X. Fang and J. Zhan. Sentiment analysis using product review data. Journal of Big Data, Springer Open
Journal, 2(5), 2015.
42
[13] S. Fong, Y. Zhuang, and J. L. R. Khoury. Sentiment analysis of online news using mallet. In Proc. IEEE
International Symposium on Computational,Business Intelligence(ISCBI), pages 301–304, 2013.
[14] S. J. Fulse, R. Sugandhi, and A. Mahajan. A survey on multimodal sentiment analysis. International Journal
of Engineering Research,Technology (IJERT) ISSN: 2278-0181, 3(4):1233–1238, Nov 2014.
[15] P. Gamallo and M. Garcia. Citius: A Naive-Bayes strategy for sentiment analysis on English tweets. In
Proc. 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 171–175, August 2014.
[16] P. Gangamohan, S. R. Kadiri, and B. Yegnanarayana. Analysis of emotional speech at subsegmental level.
In Proc. INTERSPEECH, pages 1916–1920, 2013.
[17] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath,
and B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Pro-
cessing Magazine, 29(6):82–97, 2012.
[18] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
[19] X. Hu, M. Bay, and J. S. Downie. Creating a simplified music mood classification ground-truth set. In Proc.
8th International Conference on Music Information Retrieval, 2007.
[20] X. Hu and J. S. Downie. Improving mood classification in music digital libraries by combining lyrics and
audio. In Proc. Joint Conference on Digital Libraries, (JCDL), pages 159–168, 2010.
[21] X. Hu, J. S. Downie, and A. F. Ehmann. Lyric text mining in music mood classification. In Proc. 10th
International Conference on Music Information Retrieval (ISMIR), pages 411–416, 2009.
[22] A. Jamdar, J. Abraham, K. Khanna, and R. Dubey. Emotion analysis of songs based on lyrical and audio
features. International Journal of Artificial Intelligence and Applications(IJAIA), 6(3):35–50, 2015.
[23] L. Kaushik, A. Sangwan, and J. H. Hansen. Automatic audio sentiment extraction using keyword spotting.
In Proc. INTERSPEECH, pages 2709–2713, September 2015.
[24] L. Kaushik, A. Sangwan, and J. H. L. Hansen. Sentiment extraction from natural audio streams. In Proc.
ICASSP, pages 8485–8489, 2013.
[25] S. G. Koolagudi, N. Kumar, and K. S. Rao. Speech emotion recognition using segmental level prosodic
analysis. In Proc. IEEE International Conference on Devices and Communications (ICDCom), pages 1–5,
2011.
[26] A. Kumar and T. M. Sebastian. Sentiment analysis on twitter. International Journal of Computer Science
(IJCSI), 9(4):372–378, July 2012.
[27] Q. Le and T. Mikolov. Distributed representations of sentences and documents. In Proc. 31st International
Conference on Machine Learning (ICML), pages 1188–1196, 2014.
[28] T. Li and M. Ogihara. Detecting emotion in music. In Proc. International Symposium on Music Information
Retrieval, pages 239–240, 2003.
[29] B. Logan. Mel frequency cepstral coefficients for music modeling. In Proc. International Symposium on
Music Information Retrieval, 2000.
43
[30] I. Lopez-Moreno, J. Gonzalez-Dominguez, O. Plchot, D. Martinez, J. Gonzalez-Rodriguez, and P. Moreno.
Automatic language identification using deep neural networks. In Proc. ICASSP, pages 5337–5341, 2014.
[31] F. Mairesse, J. Polifroni, and G. D. Fabbrizio. Can prosody inform sentiment analysis? experiments on
short spoken reviews. In Proc. ICASSP, pages 5093–5096, 2012.
[32] W. Medhat, A. Hassana, and H. Korashy. Sentiment analysis algorithms, applications: A survey. Ain Shams
Engineering journal, pages 1093–1113, 2014.
[33] T. Mikolov, M. Karafiat, L. Burget, J. H. Cernocky, and S. Khudanpur. Recurrent neural network based
language model. In Proc. INTERSPEECH, pages 1045–1048, 2010.
[34] V. K. Mittal. Discriminating features of infant cry acoustic signal for automated detection of cause of crying.
In Proc. 10th International Symposium Chinese Spoken Language Processing (ISCSLP), pages 1–5, 2016.
[35] V. K. Mittal and A. K. Vuppala. Significance of automatic detection of vowel regions for automatic shout de-
tection in continuous speech. In Proc. 10th International Symposium Chinese Spoken Language Processing
(ISCSLP), pages 1–5, 2016.
[36] V. K. Mittal and B. Yegnanarayana. Analysis of production characteristics of laughter. International JOur-
nal of Computer Speech and Language, 30(1):99–115, 2015.
[37] L. P. Morency, R. Mihalcea, and P. Doshi. Towards multimodal sentiment analysis: Harvesting opinions
from the web. In Proc. 13th International Conference on Multimodal Interfaces (ICMI), pages 169–176,
November 2011.
[38] K. Mounika, A. Sivanand, H. R. Lakshmi, S. V. Gangashetty, and A. K. Vuppala. An investigation of deep
neural network architectures for language recognition in Indian languages. In Proc. INTERSPEECH, pages
2930–2933, 2016.
[39] K. S. R. Murty and B. Yegnanarayana. Epoch extraction from speech signals. IEEE Trans. Speech, Audio
and language Processing, 16(8):1602–1613, 2008.
[40] B. Pang and L. Lee. A sentimental education:sentiment analysis using subjectivity summarization based on
minimum cuts. In Proc. Association for Computational Linguistics, pages 271–278, 2004.
[41] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?: sentiment classification using machine learning
techniques. In Proc. ACL Conference on Empirical methods in natural language processing, volume 10,
pages 79–86, 2002.
[42] B. G. Patra, D. Das, and S. Bandyopadhyay. Unsupervised approach to Hindi music mood classification. In
Proc. Mining Intelligence and Knowledge Exploration (MIKE), pages 62–69, 2013.
[43] B. G. Patra, D. Das, and S. Bandyopadhyay. Mood classification of Hindi songs based on lyrics. In Proc.
12th International Conference on Natural Language Processing (ICON), 2015.
[44] V. Perez-Rosas, R. Mihalcea, and L. P. Morency. Multimodal sentiment analysis of spanish online videos.
IEEE Intelleigent Systems, 28(3):38–45, 2013.
[45] V. Perez-Rosas, R. Mihalcea, and L. P. Morency. Utterance level multimodal sentiment analysis. In Proc.
ACL, pages 973–982, 2013.
44
[46] S. Poria, E. Cambria, and A. Gelbukh. Deep convolutional neural network textual features,multiple kernel
learning for utterance-level multimodal sentiment analysis. In Proc. EMNLP, pages 2539–2544, 2015.
[47] S. Poria, E. Cambria, N. Howard, G. B. Huang, and A. Hussain. Fusing audio, visual,textual clues for
sentiment analysis from multimodal content. Neurocomputing 174, pages 50–59, 2015.
[48] C. Raffel and D. P. Ellis. Feed-forward networks with attention can solve some long-term memory problems.
In CoRR, vol. abs/1512.08756. [Online]. Available: http://arxiv.org/abs/1512.08756, 2015.
[49] K. S. Rao, S. R. M. Prasanna, and T. V. Sagar. Emotion recognition using multilevel prosodic information.
In Proc. Workshop on Image and Speech Processing (WISP), 2007.
[50] A. D. Reynolds and C. R. Rose. Robust text-independent speaker identification using Gaussian mixture
speaker models. IEEE Transactions on Speech, Audio Processing, 3(1):72–83, 1995.
[51] F. Richardson, D. Reynolds, and N. Dehak. A unified deep neural network for speaker, language recognition.
In Proc. INTERSPEECH, pages 1146–1150, 2015.
[52] K. Shanmugapriya and Dr.B.Srinivasan. An efficient method for determining sentiment from song lyrics
based on wordnet representation using HMM. International Journal of Innovative Research in Computer
and Communication Engineering, 3(2):1139–1145, February 2015.
[53] M. Sikandar. A survey for multimodal sentiment analysis methods. International Journal of Computer
Technology, Applications (IJCTA) ISSN:2229-6093, 5(4):1470–1476, July 2014.
[54] R. Singh and R. Kaur. Sentiment analysis on social media, online review. International Journal of Computer
Applications, 121(20):44–48, 2015.
[55] G. Tzanetakis and P. Cook. Musical genre classification of audio signals. IEEE Transactions on Speech and
Audio Processing, 10(5):293–302, 2002.
[56] D. Ververidis, C. Kotropoulos, and I. Pitas. Automatic emotional speech classification. In Proc. ICASSP,
volume 1, pages 593–596, 2004.
[57] H. K. Vydana, P. Vikash, T. Vamsi, K. P. Kumar, and A. K. Vuppala. Detection of emotionally significant
regions of speech for emotion recognition. In Proc. IEEE India Conference (INDICON), pages 1–6, 2015.
[58] T. Wang, D. Kim, K. Hong, and J. Youn. Music information retrieval system using lyrics and melody
information. In Proc. Asia-Pacific Conference on Information Processing, pages 601–604, 2009.
[59] M. Wollmer, F. Weninger, T. Knaup, and L.-P. Morency. Youtube movie reviews: Sentiment analysis in an
audio-visual context. IEEE Intelleigent Systems, 28(3):46–53, 2013.
[60] Y. Xia, L. Wang, K.-F. Wong, and M. Xu. Sentiment vector space model for lyric-based song sentiment
classification. In Proc. ACL-08:HLT, Short Papers, pages 133–136, 2008.
[61] L. Xing, L. Yuan, W. Qinglin, and L. Yu. An approach to sentiment analysis of short chinese texts based on
svms. In Proc. IEEE 34th Chinese Control Conference, pages 28–30, July 2015.
[62] S. K. Yadav, M. Bhushan, and S. Gupta. Multimodal sentiment analysis: Sentiment analysis using audio-
visual format. In Proc. IEEE 2nd International Conference on Computing for Sustainable Global Develop-
ment (INDIACom), pages 1415–1419, 2015.
45
[63] J. Yi, T. Nasukawa, R. Bunescu, and W. Niblack. Sentiment analyzer: Extracting sentiments about a given
topic using natural language processing techniques. In Proc. IEEE International Conference on Data Mining
(ICDM), 2003.
[64] J. Zhong, Y. Cheng, S. Yang, and L. Wen. Music sentiment classification integrating audio with lyrics.
Information and Computational Science, 9(1):35–54, 2012.
46