Continuous Sentiment Intensity Prediction based on Deep Learning
-
Upload
yunchao-he -
Category
Data & Analytics
-
view
273 -
download
0
Transcript of Continuous Sentiment Intensity Prediction based on Deep Learning
Yunchao He (何云超)
2015.9.15 @ Yuan Ze University
“unbelievably disappointing”
“Full of zany characters and richly applied satire, and some great plot twists”
“this is the greatest screwball comedy ever filmed”
“It was pathetic. The worst part about it was the boxing scenes.”
Sentiment Analysis Using NLP, statistics, or machine learning methods to extract, identify, or otherwise
characterize the sentiment content of a text unit
Sometimes called opinion mining, although the emphasis in this case is on extraction
Other names: Opinion extraction、Sentiment mining、Subjectivity analysis
2
3
Movie: is this review positive or negative?
Products: what do people think about the new iPhone?
Public sentiment: how is consumer confidence? Is despair increasing?
Politics: what do people think about this candidate or issue?
Prediction: predict election outcomes or market trends from sentiment
4
Short text classification based on Semantic clustering
Sentiment intensity prediction using CNN
Transfer Learning*
* Future works 5
People express opinions in complex ways
In opinion texts, lexical content alone can be misleading
Intra-textual and sub-sentential reversals, negation, topic change common
Rhetorical devices such as sarcasm, irony, implication, etc.
6
Tokenization
Feature Extraction: n-grams, semantics, syntactic, etc.
Classification using different classifiers Naïve Bayes
MaxEnt
SVM
Drawback Feature Sparsity
S1: I really like this movie
[...0 0 1 1 1 1 1 0 0 ... ]
8
S1: This phone has a good keypad
S2: He will move and leave her for good
Using clustering algorithm to aggregate short text to form big clusters, in which each cluster has the same topic and the same sentiment polarity, to reduce the sparsity of short text representation and keep interpretation.
S1: it works perfectly! Love this product
S2: very pleased! Super easy to, I love it
S3: I recommend it
it works perfectly love this product very pleased super easy to I recommend
S1: [1 1 1 1 1 1 0 0 0 0 0 0 0]
S2: [0 0 0 1 0 0 1 1 1 1 1 1 0]
S3: [1 0 0 0 0 0 0 0 0 0 0 1 1]
S1+S2+S3: [...0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0...]
9
Training data labeled with positive and negative polarity
K-means clustering algorithm is used to cluster positive and negative text separately. K-means, KNN, LDA…
works perfectly! Love this product
completely useless, return policy
very pleased! Super easy to, I am pleased
was very poor, it has failed
highly recommend it, high recommended!
it totally unacceptable, is so bad
works perfectly! Love this product
very pleased! Super easy to, I am pleased
highly recommend it, high recommended!
completely useless, return policy
was very poor, it has failed
it totally unacceptable, is so bad
Topical clusters
10
Classifier: Multinomial Naive Bayes
Probabilistic classifier: get the probability of label given a clustered text
,
1
arg max ( | )
arg max ( ) ( | )Ci
is S
i js S j N
s P s C
P s P C s
$
( ) sNP s
N
,
,
( , ) 1( | )
( | ) | |
i j
i j
x V
N C sP C s
N x s V
Bayes’ theoryIndependent assumption
11
Given an unlabeled text , we use Euclidean distance to find the most similar positive cluster , and the most similar negative cluster
The sentiment of , is estimated according to the probabilistic change of the two clusters when merging with . (vs. KNN)
This merging operation is called two-stage-merging method, as each unlabeled text will be merged two times.
0, | ( ) ( ) | | ( ) ( ) |( )
1, .
m m n n
j
P NC P C P NC P Cf x
otherwise
mC
jx
nC
jx
jx
12
Dataset: Stanford Twitter Sentiment Corpus (STS)
Baseline: bag-of-unigrams and bigrams without clustering
Evaluation Metrics: accuracy, precision, recall
The average precision and accuracy is 1.7% and 1.3% higher than the baseline method.
Methods Accuracy Precision Recall
Our Method 0.816 0.82 0.813
Bigrams 0.805 0.807 0.802
13
Continuous sentiment intensity provides fine-grained representation of sentiment.
Representing sentiment as Valence-arousal can easily convert to discrete categories.
“unbelievably disappointing” ModelV: -0.5A: 0.3
15
Lexicon based Method. To find the relationship between word-level and sentence-level sentiment values. Word-level information comes from sentiment lexicon, e.g. ANEW.
Paltoglou 2013: Weighted Arithmetic Mean、Weighted Geometric Mean
Malandrakis 2013: linear regression
Paltoglou, G., Theunis, M., Kappas, A., & Thelwall, M. (2013). Predicting emotional responses to long informal text. Affective Computing, IEEE Transactions on, 4(1), 106-115.
Malandrakis, N., Potamianos, A., Iosif, E., & Narayanan, S. (2013). Distributional semantic models for affective text analysis. Audio, Speech, and Language Processing, IEEE
Transactions on, 21(11), 2379-2392.16
To find the relationship between words and sentence-level sentiment.
CNN Method Lexicon-based Methods
Word Dense vector VA value
Relationship Auto learned Manually specified
Training data Many Few or None
Word Order Considered* Not Considered
Interpretation Black Box Easy
17
To find the relationship between words and sentence-level sentiment.
Sentence Matrix -> Convolution Operator -> Max Pooling -> Regression
Word Representation: dense vector, distributed representation
我们的
心
不像
明镜
不可以
美丑
善恶
全部
包容
boat
ship
vesselgood
happy
Beijing
Shanghai
glad
Semantic information of word is encoded in the dense vector. 18
Sentence Matrix -> Convolution Operator -> Max Pooling -> Regression
我们的
心
不像
明镜
不可以
美丑
善恶
全部
包容Sentence Matrix
[ : 1,:]( )i i i mc f w S b
Dimension Reduced
Reduce the parameters of the model
Parameter sharing
f: Activation function, Relu, tanh, sigmoid, …
𝑓 𝑥 = max(0, 𝑥)
19
Sentence Matrix -> Convolution Operator -> Max Pooling -> Regression
Aggregate the information and capture the most important features
我们的
心
不像
明镜
不可以
美丑
善恶
全部
包容
3 6 79 7 54
79 9
Max pool with 5×1 filters and stride 1
20
Sentence Matrix -> Convolution Operator -> Max Pooling -> Regression
我们的
心
不像
明镜
不可以
美丑
善恶
全部
包容
x1
x2
xn
linear ℎ 𝑥𝑖, 𝑤 = 𝑤𝑇𝑥𝑖 = 𝑦𝑖
Objective function: mean squared error (MSE)
21
Learning Algorithm: stochastic gradient descent (SGD)
Learning the parameters of the model with labeled data Word vectors
Convolution filters weights
Linear regression weights
Labeled data Chinese: CVAT dataset
English: VADER dataset
Dataset size #word L Dims
CVAT 720 21094 192.1 V+A
Tweets 4000 15284 13.62 V
Movie 10605 29864 18.86 V
Amazon 3708 8555 17.3 V
NYT 5190 20941 17.48 V
22
All the dataset is separated into training set, validation set and test set for model training, hyper-parameters selection and model evaluation.
Evaluation Metrics MSE, Mean Square Error
MAE, Mean Absolute Error
Pearson’s correlation coefficient r
%2
1
1( )
n
i i
i
MSE y yn
%
1
1| |
n
i i
i
MAE y yn
% %
% %
1 1 1
2 2
1 1 1 1
1 1( )( )
1 1( ) ( )
n n n
i ii j
i j j
n n n n
i ii j
i j i j
y y y yn n
r
y y y yn n
23
Methods CNN wGW RMAR LCEL RMV
Metrics MSE MAE r MSE MAE r MSE MAE r MSE MAE r MSE MAE r
valence ratings prediction
CVAT 1.17 0.88 0.73 2.30 1.23 0.62 1.89 1.14 0.63 1.81 0.95 0.66 1.49 0.98 0.72
Tweets 1.00 0.76 0.79 2.54 1.25 0.65 1.30 0.89 0.69 1.25 0.85 0.75 1.18 0.86 0.74
Movie 2.14 1.18 0.67 6.46 2.02 0.17 3.54 1.73 0.16 2.54 1.36 0.42 2.25 1.26 0.62
Amazon 1.50 0.95 0.67 3.75 1.51 0.35 2.66 1.38 0.27 1.45 1.14 0.45 2.20 1.19 0.56
NYT 0.84 0.72 0.36 3.47 1.54 0.28 0.79 0.71 0.26 0.83 0.75 0.37 0.61 0.63 0.60
arousal ratings prediction
CVAT 0.98 0.81 0.64 1.34 0.94 0.31 1.20 0.89 0.35 1.07 0.91 0.62 0.98 0.79 0.53
CNN method improved the VA prediction experiment performance compared with the lexicon-based and RMV method.
Baseline method: • wGW, Weighted geometric mean method• RMAR, Regression on mean affective ratings• LCEL, linear combination using expanded lexicon• RMV, regression on mean vectors method
24
Using Transfer Learning Techniques to improve VA prediction performance.
Motivation: There are numerous dataset for sentiment classification but only a few dataset for VA prediction. The sentiment polarity maybe useful for VA prediction.
Method: Pre-training the classification-CNN model, and then use the parameters of the pre-trained networks as the initial value of VA prediction-CNN model, keep training on VA corpus.
25