Cyworld Jeju 2009 Conference(10 Aug2009)No2(2)
-
Upload
sangme-nam -
Category
Education
-
view
790 -
download
1
description
Transcript of Cyworld Jeju 2009 Conference(10 Aug2009)No2(2)
Automatic discovery of emotion- based communication on so-cial networking sites of South Korean politicians : Cyworld comments
Steven SamsWCU WeboMatrix, Yeungnam University,
214-1, Dae-dong, Gyeongsan-si, Gyeongsangbuk-do, South Korea, 712-749
Wojciech GrycComputing Laboratory, Oxford University, Wolfson Building,
Parks Road, Oxford OX1 3QD, [email protected]
Han Woo Park (Corresponding author)Dept of Media & Communication, Yeungnam University,
214-1, Dae-dong, Gyeongsan-si, Gyeongsangbuk-do, South Korea, 712-749
[email protected] paper submitted for possible presentation to the 2nd International Confer-ence on u- and e- Service, Science and Technology (UNESST), December 10~12, 2009, International Convention Center Jeju, Jeju Island, Korea. http://www.sersc.org/UNESST2009/
This article examines a user’s textual comments on social networking sites of politicians.
South Korean politicians were identi-fied and all comments given to them by other users within a specified time-frame were gathered.
The techniques employed in this research were particularly useful in automatically classifying the senti-mental emotions made by citizens.
Abstract
Key wordsCyworld
South Korea political communica-
tionsocial networking sitesSentimental Analysis
e-research
Contents1 Introduction
2 Related Studies
3 Data Collection4 Machine Learning Methods5 Feature Analysis6 Categorization Res-ults 7 Further Work
8 Conclusion
1 Introduction
Recent social media offers us a new way of communicating and it helps politicians create an efficient method to communicate citizens.
Therefore, we aim to develop soft-ware that can capture a massive political communication usage data generated by social net-working sites and citizen’s social media sites. Specifically, on Cyworld.
The title of the Mini-HomepageCounts of visi-
tors
Basic information of the host
Condition of the host
Mini room (Editable by the host)
Main menu①②③
The status of the Mini-Home-page
①How active ②How famous ③How friendly
Favorite menu
Link : Cyworld Mini-Homepage(Geun-Hye Park)
2 Related Studies
1. Automatic tracking online political communication (The representative exercise has been conducted by Canada’s Infoscape Lab).
2. Sentimental analysis of social net-working sites.
These tools apply only English and
gender differences in political user-generated feedback haven’t be fully explored.
This presents an opportunity to exam-ine South Korean politicians’ network-ing site and observe the citizen’s beha-vior.
3 Data collection
The URLs of opposition and ruling
South Korean politicians who own social networking sites on Cyworld, as of 21st May 2009, were located.
The comments on their Cyworld between 1st April 2008(<- the month of
National Assembly Member election) and 14th June 2009 were automatically collected.
3 Data collection
Collect only the gender and comment of the post-generator information due to the criticism on the lack of an-onymity and the absence of the clear willingness of participation.
The data relating to these users was removed from the study. Of the 90 political profiles, except nine had high private options or had no post, 81 were successfully scraped.
Politician Male Female Unknown Total
나경원 (Kyeong-Won Na) 10547 6611 2288 19446
박근혜 (Geun-Hye Park) 10086 7199 1651 18936
이회창 (Hoi-Chang Lee) 8970 6284 2380 17634
조경태 (Kyeong-Tae Cho) 2889 2412 11101 16402
정동영 (Dong-Yong Chung)
4872 4430 981 10283
문국현 (Kook-Hyn Moon) 3104 4229 711 8044
강기갑 (Gi-Gap Kang) 1405 1065 3997 6467
손숙미 (Sook-Mi Son) 1634 771 586 2991
정몽준 (Mong-Jun Chung) 1146 409 842 2397
홍정욱 (Jeong-Wook Hong)
913 753 126 1792
Table 1. Summary of comments posted on ten political profile pages between April 2008 and June 2009.
One politician was selected at random from the eighty-one successfully scraped political profiles and the male and female comments posted were taken as the dataset.
From this data, two hundred ran-dom comments were taken and categorized in one of three possible groups.
The posts were all labelled by one individual so that the reliability metrics are not available.
(It is discussed in Future Work) The number of categorized posts in
each category is included in Table 1
3 Data collection•The post has nothing to do with the National Assembly Member or his or her policy issues. It is a general comment on politics, or may be SPAM
Irrelevant
•The post shows respect, support, or rapport with the National Assembly Member. It may suggest policy issues with gentle words or po-lite words
Favourable
•The post is hostile, adversarial, or critical of the National Assembly Member. It may be try-ing to slander the National Assembly Member, or includes curse words
Unfavourable
4 Machine learning method
Naïve Bayes multinomial models
and support vector machines were
used to build the machine learn-
ing,
combined with voting and stacking
frameworks.
(a “bag of words” approach was used)
4 Machine learning method
The first approach used was a naïve
Bayes multinomial model.
A support vector machine using a
polynomial kernel was also used in
the classification task.
4 Machine learning method
The NB model and support vector
machine were also combined using
a voting system. Each model re-
ceived one vote, with a third vote
being automatically assigned to
the class with the largest popula-
tion.
4 Machine learning method
A final approach was the use of a
stacking framework. In this case,
the output of the models are input
as variables into a logistic re-
gression model.
5 Feature analysis
The features used as inputs into the machine learning algorithms con-sisted of word counts of all the words that appears 2 or more times in either the male or female set of posts. This resulted in a feature list of 484 words. Table 2 shows the words that appear to be the most sig-nificant features, as determined by a χ2 test for significance of features.
6 Categorization Results
The four algorithms were trained on the data set, and results are all very similar. (Note : “Largest Class” row rep-resents the accuracy of the model if all comments were labelled as posit-ive.)
All four algorithms outperform, and do so at a statistically significant level(at p
= 0.05).
Interestingly, All four algorithms have fairly similar accuracies to each other.
6 Categorization Results
However They have different strengths.
While both accuracies(the voting classi-fier, represented in Table 4. / the stacked clas-
sifier, represented in Table 5.) are similar, it appears that the voting classifier is biased in favour of positive classifications, while the stacked classifier tends to label more posts as negat-ives.
Table 4. Confusion matrix for the voting classifier.
Irrelevant Favourable Unfavourable
Irrelevant 1 28 8
Favourable 1 101 4
Unfavourable 5 22 30
Irrelevant Favourable Unfavourable
Irrelevant 5 12 20
Favourable 4 89 13
Unfavourable 5 13 39
Table 5. Confusion matrix for the stacked classifier.
Predicted
Predicted
Actual
Actual
6 Categorization Results
The reason there are three curves for the three classes is that each curve represents the model's abil-ity to label a specific comment as within a specific category, or outside of it.
The class with the highest estim-ated probability is the one assigned to the actual comment.
6 Categorization Results
It means the classification algorithm creates three sub-classifiers(favourable or not, unfavourable or
not, irrelevant or not).
The ROC curves show how accurate the estimated probabilities are. Each point on a ROC curve shows how many false positives and true positives occur for a specific probability threshold.
6 Categorization Results
Figure 1 shows the ROC curves for the three classes. It shows how the classifiers tend to do well with the three classes.
Overall, it is useful to see that the algorithms are able to discern between irrelevant, positive and
negative posts. While accuracies
can still be improved, the results are very encouraging.
Fig. 1 Receiver Operator Characteristic (ROC) curves for the stacked classifier, for each class in the dataset.
7 Further work It would be interesting to apply
categorization methods using nat-ural language processing(NLP) tech-niques in the study to see how know-ledge of the grammatical structure of the post could help with the labelling process. Furthermore, in-formation on the participants of Cyworld, such as location, political affiliation was not used in cat-egorizing posts.
7 Further work
The authors are also planning to expand the study to multiple la-bellers to help understand how diffi-cult and reliable human labelling actually is.
However, the results above illus-trate that labelling posts by senti-ment is not an intractable problem and useful machine learning ap-proaches exist.
8 Conclusion
The development of these tools provides an efficient means to study emotional based political communication and ad-dresses previous criticism of the lack of anonymity. One of the many advantages of this technique is a senti-mental analysis of user-generated feedback when interacting with political social networks.
The potential for usage of these
tools could be applied to social
network analysis in nations where
political social networks are es-
tablished or extend beyond an
analysis of political subjects and ex-
plore habits by the online general
public.
8 Conclusion
Acknowledgments. The corres-pondence author acknowledges that this research is supported from the WCU project (In-vestigating an internet-based politics using e-research tools) granted from South Korean Government