Sentiment Analysis for Arabic Social Media · Sentiment Analysis for Arabic Social Media Noha Adly...
Transcript of Sentiment Analysis for Arabic Social Media · Sentiment Analysis for Arabic Social Media Noha Adly...
1 1
Sentiment Analysis for
Arabic Social Media
Noha Adly
Sara El Shobaky
Bibliotheca Alexandrina
1
2 2
What is sentiment Analysis?
Discovering people opinions, emotions and feeling about a topic being a product, a service,
etc, …
The movie is great
The movie is horrible
The movie is 90 minutes
3 3
Measure Public Opinion about Elections
4 4
Brand Monitoring
Source: https://www.crowdsource.com/solutions/content-moderation/sentiment-analysis/
5 5
Where we can find the public opinion?
6 6
Why Social Media?
https://www.smartinsights.com/internet-marketing-statistics/happens-online-60-seconds/
7 7
Social Media in the Arab Region
Salem, F. (2017). The Arab Social Media Report 2017. Arab Social Media Report Series (Vol. 7). Dubai: Governance and Innovation Program, MBR School of Government.
Growth of Facebook Users in the Arab Region 2010-17
Distribution of Facebook Users in Arab Region (2017)
8 8
Why Arabic Social Media?
� Lack of research on sentiment analysis in Arabic
� Colloquial Arabic is challenging
9 9
Colloquial Arabic is challenging
9
ThisدهMasculine
singular nounھذا
ديThis
Feminine
singular nounھذه
ThisدولPlural noun ھؤالء
Modern Standard Arabic vs Egyptian Colloquial Arabic
Loooovely جمییییل
Wow
Dangerous
واااو(Transliteration)
خطیر
On Social Media
@, #, uptag, hashtag,
follow
XOXO, OMGHugs and kisses,
Oh my God
Lol, ھھھھھھه Laugh out Loud
10 10
Objective
Build a platform for applying sentiment analysis on Social media in Arabic
10
11 11
Case Study: Products Made in Egypt
Measure opinion toward
products made in Egypt
11
12 12
Example: a group on Facebook
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
2016-0
1
2016-0
2
2016-0
3
2016-0
4
2016-0
5
2016-0
6
2016-0
7
2016-0
8
2016-0
9
2016-1
0
2016-1
1
2016-1
2
2017-0
1
2017-0
2
2017-0
3
2017-0
4
2017-0
5
2017-0
6
2017-0
7
2017-0
8
2017-0
9
2017-1
0
2017-1
1
2017-1
2
Number of
Feeds/Comments
Number of
Contributing Users
614,197 Members, 676,151 Feeds/comments, 155,926 Shares
13 13
Study the effect of special events
�Media: e.g. TV shows
�Policies: e.g.
�Floating Egyptian pound ,
�Increase fuel price,
�Importing ban.
13
14 14
Study the effect of special events
� Media: e.g. TV shows
� Policies: e.g.
� Floating Egyptian pound
� Increase fuel price
� Importing ban14
15 15
Study sentiment towards different aspects
16 16
Sentiment analysis approaches
� Lexicon-based
� Machine-Learning
Each has advantages and disadvantages…
16
17 17
Lexicon Based Approach
++ Fast.
++ No Training data necessary.
++ Good initial accuracy.
-- Require a polar lexicon.
-- Underperform well trained
machine learning.
Sentiment Algorithm
This is a goodproduct.
Good ☺Bad �Sad �
….
18 18
Machine Learning Approach
This is a good
product.
- This is a bad product.- The price is very good- ...
++ Good accuracy based on:
• Collected Training data
• Features extraction
19 19
Feature Extraction
Transforming raw data into features that can be used as input for a learning algorithm
20 20
Hand-crafted features
� Unigram, bigram,…
� Stemming
وبیمشي� � يمشبیوكلینا� وكل <-تابلیت� � تبل
�POS-Tagging (Names, verbs, adjectives,…)
انا بحبه وبجیبه ع طول�� جداااااااااعلیه طعمه حلو بیسئلواللي � الفاضىالزمة وغالى على اىوال لیه
حلو
وحلو
حلوه
حلوووه
حلواوى
حلوجدا
حاو
جمیل رائع
ممتازروعه ممتع
حلووو
تحفة
حلوأوى
حلوة
ظريف
متمیز
حلو
يجنن
خطیرشابوه
وھمىمدمناه
خرافى
لذيذ
Domain Specific
Spellings
21 21
Word Embedding for Feature Extraction
�Using Neural word embedding to represent each word
with a low-dimensional vector
�Similarity in meaning = Similarity in vectors
� Two words have close meanings if their local neighborhoods are similar
21
22 22
�Trains a Neural Network Model from a large corpus
Word Embedding for Feature Extraction
…
D-dimensional vector
Word Vector
Large
Corpus
Learning
Vocab
23 23
Using the embedding in feature extraction
Word
Embedding
حلو
وحلو
حلوه
حلوووه
حلواوى
حلوجدا
حاو
جمیل رائع
ممتازروعه ممتع
حلووو
تحفة
حلوأوى
حلوة
ظريف
متمیز
حلو
يجنن
خطیرشابوه
وھمىمدمناه
خرافى
لذيذ
24 24
How to generate a Word
Embedding?
1. Collect a larges Corpus
2. Preprocessing the corpus
3. Learning the Embedding
24
25 25
1. Collect a large Corpus
�We collected a large Arabic corpus from social media
�Source: Facebook and Twitter
�Size: more than 50 Million feeds/comments/tweets
�Domain: Products domain
�Number of tokens (476 Million token)
26 26
2. Preprocessing the corpus
Filtration
Foreign
language
English,
French, …
Repeated
characters
ییلطويییییییییی
Punctuations “ : ; , ‘
Diactrice َ◌◌ً◌ٌ◌ُTatweel طويــــــــــــل
Unification
؟ ? __QUES
http://www..... __URL
@PersonName __TAG
م f up __FOllOW
1234 567890 __NUM
☺ :-D LOL, …..
� : ( …
� : | …
__EMOT_POS,
__EMOT_NEG,
__EMOT_NEU
• Decreasing number of vocabularies in a corpus
decreases the computation complexity while learning
the embedding
Characters Standardization
أ إ ا اى ي ية ه ه
27 27
3. Learning the Word Embedding
�Learning Models
�CBOW(Continuous Bag of Words)
Predict the word given its context.
�Skip GramPredict the context given a word.
�Glove(Global Vectors)
Tries to capture the counts of overall statistics.
�Learning Parameters�Vector Dimension
�Window-size to be included as the context of a target word
�Number of training Iterations over the corpus
Word2VecMikolov, 2013
Pennington
2014
28 28
How to choose the best model?
�Learning Word Embeddings on BA HPC
�More than 45 different embedding have been
generated with different models and parameters
�On BA HPC (92 nodes, 128G RAM, 24 threads)
�Preprocessing the corpus
�Learning one embedding took up to 13 hours.
29 29
Analogies tests
Accuracy based on Arabic Analogies test
29
- Glove model has the worst accuracy.
- Increasing dimensions increase the accuracy.
- Window-size 10 generates best results.
- 15 to 20 Iterations has the best results.
Dim. Win. Iter.CBOW
Accu.(%)
SG
Accu(%)
GloVe
Accu(%)
100 10 15 59.98 47.27 49.93
200 10 15 65.77 57.80 57.10
300 10 15 67.70 59.49 58.58
400 10 15 67.77 60.56 58.10
500 10 15 64.35 61.15 0.04?
300 5 15 66.49 62.84 56.50
300 10 15 67.70 59.49 58.58
300 15 15 67.46 56.83% 59.16
300 20 15 67.19 54.58 60.37
300 10 5 64.78 57.40 0.00?
300 10 10 66.61 59.16 58.10
300 10 15 67.70 59.49 58.58
300 10 20 67.60 60.10 58.31
300 10 25 63.50 59.28 58.60
300 10 30 63.65 59.98 58.29
Man is to woman
as king is to ___ ?
30 30
Sentiment Classification test
Word
Embedding
Sentiment Classification in 2 steps:1. Subjectivity Classification
(Subjective/Objective)
2. Polarity Classification
(Positive/Negative)
Split training data for evaluation• 90% training data
• 10% test data
Classification Algorithm• Support Vector Machine
31 31
Overall 5,127 3,453 5,304
Price 469 744 536
Quality 4,238 2,018 151
Availability 1,082 476 183
Awareness 93 39 3
Training data
15,000
15,000
15,000
32 32
Best embedding for Sentiment Classification task
Parameters CBOW Polarity
CBOW
Subjectivity SG Polarity SG Subjectivity
Dim Win Iter. F Pr Rec F Pr Rec F Pr Rec F Pr Rec
100 10 15 0.912 0.911 0.913 0.863 0.853 0.873 0.9142 0.912 0.917 0.859 0.849 0.869
200 10 15 0.921 0.924 0.918 0.868 0.858 0.878 0.925 0.926 0.923 0.867 0.856 0.879
300 10 15 0.927 0.929 0.924 0.877 0.871 0.885 0.9223 0.925 0.92 0.874 0.868 0.88
400 10 15 0.932 0.935 0.928 0.874 0.865 0.883 0.935 0.938 0.933 0.875 0.862 0.888
500 10 15 0.929 0.93 0.928 0.875 0.866 0.884 0.932 0.936 0.928 0.876 0.871 0.881
300 5 15 0.928 0.93 0.927 0.875 0.868 0.882 0.936 0.937 0.935 0.875 0.867 0.883
300 10 15 0.927 0.931 0.923 0.876 0.87 0.884 0.925 0.926 0.924 0.877 0.866 0.889
300 15 15 0.929 0.929 0.93 0.874 0.871 0.876 0.929 0.928 0.929 0.87 0.866 0.874
300 20 15 0.927 0.928 0.927 0.878 0.865 0.89 0.935 0.938 0.932 0.873 0.865 0.883
300 10 5 0.924 0.929 0.919 0.876 0.868 0.885 0.931 0.934 0.929 0.875 0.866 0.884
300 10 10 0.93 0.933 0.926 0.876 0.869 0.884 0.929 0.933 0.924 0.869 0.864 0.875
300 10 15 0.929 0.935 0.923 0.874 0.868 0.88 0.93 0.931 0.928 0.873 0.865 0.88
300 10 20 0.928 0.936 0.921 0.875 0.873 0.878 0.925 0.93 0.92 0.877 0.869 0.885
300 10 25 0.928 0.928 0.928 0.876 0.872 0.879 0.927 0.931 0.923 0.872 0.866 0.878
300 10 30 0.923 0.926 0.92 0.873 0.865 0.882 0.93 0.932 0.928 0.871 0.866 0.876
Skip-Gram Model400 Dimensions
10 Window-Size
15 Iterations
87.5% Subjectivity Classification
93.5 % Polarity Classification
33 33
Category classification Approach
F Precision Recall
Price 0.722 0.832 0.638
Quality 0.849 0.867 0.835
Availability 0.768 0.821 0.722
The same machine learning
approach using the word
embedding is used to classify
the Dataset.
34 34
Experiments setup
�Dataset:
� 645 Facebook page/group of products made in Egypt
� Period from 1-1- 2016 till 31-12-2017
� 4,104,538 Arabic feeds and comments out of 5,161,247
�3,409,461 Arabic text
�695,077 tags, follows,..
� Events
34
Date Event7/2016 Increase in fuel prices9/2016 TV show ‘Made in Egypt’ episode 1.11/2016 Floating the Egyptian pound1/2017 TV show ‘Made in Egypt’ episode 2.
1-3/2017 Egyptian decision to stop importing7/2017 Increase in fuel prices
35 35
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
Neutral Pos Neg
Overall Sentiments
�79% Neutral 14% positive posts 7% Negative posts
Num
ber
of
Feeds/
com
ments
Neutral79%
Pos15%
Neg6%
36 36
Overall Sentiments
�Positive and Negative sentiments following the same pattern
across time, with positivity almost double negativity
Num
ber
of
Feeds/
com
ments
0
10000
20000
30000
40000
50000
60000
Pos Neg
37 37
Interested aspects
� People are equally concerned to the 3 aspects.
� Quality was hot topic after 1st media show
� Availability was of interest when pound floated
� Prices became main interest during the importing ban and increase in fuel prices
Num
ber
of
feeds/
com
ments
0
10000
20000
30000
40000
50000
60000
Price Quality Availability
Price34%
Quality
30%
Availabi
lity
36%
38
38
Pric
e S
entim
ents
�80 %
of p
ublic
is talk
ing a
bout p
rice, b
ut n
ot sa
yin
g a
n o
pin
ion
Number of Feeds/comments
0
5000
10000
15000
20000
25000
30000
35000
2016-01
2016-02
2016-03
2016-04
2016-05
2016-06
2016-07
2016-08
2016-09
2016-10
2016-11
2016-12
2017-01
2017-02
2017-03
2017-04
2017-05
2017-06
2017-07
2017-08
2017-09
2017-10
2017-11
2017-12
Neutra
lPos
Neg
Neutra
l80%
Pos
9%
Neg
11%
39 39
Price Sentiments
�People are more negative about price than positive
�Negativity is observed more in
Num
ber
of
Feeds/
com
ments
0
1000
2000
3000
4000
5000
6000
Pos Neg
40 40
Quality Sentiments
� People tend to express positive sentiments toward Egyptian products 77%
� The TV show boosted the positivity feeling towards quality.
Num
ber
of
Feeds/
com
ments
Neutral7%
Pos77%
Neg16%
0
5000
10000
15000
20000
25000
30000
35000
40000
Neutral Pos Neg
41 41
Availability Sentiments
� More than 50% of the people feel positively about availability
� 36% of the public discuss availability but do not express their opinion
� The negativity is only 12% and mostly not affected by events
Num
ber
of
Feeds/
com
ments
0
5000
10000
15000
20000
25000
Neutral Pos Neg
Neutral
36%
Pos52%
Neg12%
42 42
0
500
1000
1500
2000
2500
3000
3500
4000
Price Quality Availability
Corona
Neutral Pos Neg
Corona Chocolate
Num
ber
of
Feeds/
com
ments
Neu67%
Pos20%
Neg13%
Corona Overall Sentiments
Neu Pos Neg
43 43
0
1000
2000
3000
4000
5000
6000
7000
Price Quality Availability
Mffco Sentiments per Aspects
Neutral Pos Neg
Mffco Helwan Furniture
43
Num
ber
of Feeds/
com
ments
Neu82%
Pos14%
Neg4%
MffcoOverall Sentiments
Neu Pos Neg
44 44
0
500
1000
1500
2000
2500
3000
3500
Price Quality Availability
Ahram Sentiments per Aspects
Neutral Pos Neg
Al-Ahram Aluminum
Num
ber
of
Feeds/
com
ments
Neu89%
Pos8%
Neg3%
Ahram Overall Sentiments
Neu Pos Neg
45 45
Conclusions and Future Work
� Sentiment analysis in Arabic is still a fertile field for research, especially for social media.
� A platform has been established for sentiment analysis in Arabic social media.
� A case study has been applied analyzing sentiments towards products Made in Egypt.
� The results needs to be more carefully examined to draw insights.
� Future work will include:
� Further tuning of parameters.
� Enlarging the corpus and the data to include Twitter, Instagram and others.
� Apply other case studies of interest to the platform.
45
46 46
46
Thank you