Cyworld Jeju 2009 Conference(10 Aug2009)No2(2)

Automatic discovery of emotion- based communication on so-cial networking sites of South Korean politicians : Cyworld comments

Steven SamsWCU WeboMatrix, Yeungnam University,

214-1, Dae-dong, Gyeongsan-si, Gyeongsangbuk-do, South Korea, 712-749

[email protected]

Wojciech GrycComputing Laboratory, Oxford University, Wolfson Building,

Parks Road, Oxford OX1 3QD, [email protected]

Han Woo Park (Corresponding author)Dept of Media & Communication, Yeungnam University,

214-1, Dae-dong, Gyeongsan-si, Gyeongsangbuk-do, South Korea, 712-749

[email protected] paper submitted for possible presentation to the 2nd International Confer-ence on u- and e- Service, Science and Technology (UNESST), December 10~12, 2009, International Convention Center Jeju, Jeju Island, Korea. http://www.sersc.org/UNESST2009/

mailto:[email protected]



http://www.sersc.org/UNESST2009/

This article examines a user’s textual comments on social networking sites of politicians.

South Korean politicians were identi-fied and all comments given to them by other users within a specified time-frame were gathered.

The techniques employed in this research were particularly useful in automatically classifying the senti-mental emotions made by citizens.

Abstract

Key wordsCyworld

South Korea political communica-

tionsocial networking sitesSentimental Analysis

e-research

Contents1 Introduction

2 Related Studies

3 Data Collection4 Machine Learning Methods5 Feature Analysis6 Categorization Res-ults 7 Further Work

8 Conclusion

1 Introduction

Recent social media offers us a new way of communicating and it helps politicians create an efficient method to communicate citizens.

Therefore, we aim to develop soft-ware that can capture a massive political communication usage data generated by social net-working sites and citizen’s social media sites. Specifically, on Cyworld.

The title of the Mini-HomepageCounts of visi-

tors

Basic information of the host

Condition of the host

Mini room (Editable by the host)

Main menu①②③

The status of the Mini-Home-page

①How active ②How famous ③How friendly

Favorite menu

Link : Cyworld Mini-Homepage(Geun-Hye Park)

http://www.cyworld.com/ghism






2 Related Studies

1. Automatic tracking online political communication (The representative exercise has been conducted by Canada’s Infoscape Lab).

2. Sentimental analysis of social net-working sites.

These tools apply only English and

gender differences in political user-generated feedback haven’t be fully explored.

This presents an opportunity to exam-ine South Korean politicians’ network-ing site and observe the citizen’s beha-vior.

http://www.infoscapelab.ca/



3 Data collection

The URLs of opposition and ruling

South Korean politicians who own social networking sites on Cyworld, as of 21st May 2009, were located.

The comments on their Cyworld between 1st April 2008(<- the month of

National Assembly Member election) and 14th June 2009 were automatically collected.

3 Data collection

Collect only the gender and comment of the post-generator information due to the criticism on the lack of an-onymity and the absence of the clear willingness of participation.

The data relating to these users was removed from the study. Of the 90 political profiles, except nine had high private options or had no post, 81 were successfully scraped.

Politician Male Female Unknown Total

나경원 (Kyeong-Won Na) 10547 6611 2288 19446

박근혜 (Geun-Hye Park) 10086 7199 1651 18936

이회창 (Hoi-Chang Lee) 8970 6284 2380 17634

조경태 (Kyeong-Tae Cho) 2889 2412 11101 16402

정동영 (Dong-Yong Chung)

4872 4430 981 10283

문국현 (Kook-Hyn Moon) 3104 4229 711 8044

강기갑 (Gi-Gap Kang) 1405 1065 3997 6467

손숙미 (Sook-Mi Son) 1634 771 586 2991

정몽준 (Mong-Jun Chung) 1146 409 842 2397

홍정욱 (Jeong-Wook Hong)

913 753 126 1792

Table 1. Summary of comments posted on ten political profile pages between April 2008 and June 2009.

One politician was selected at random from the eighty-one successfully scraped political profiles and the male and female comments posted were taken as the dataset.

From this data, two hundred ran-dom comments were taken and categorized in one of three possible groups.

The posts were all labelled by one individual so that the reliability metrics are not available.

(It is discussed in Future Work) The number of categorized posts in

each category is included in Table 1

3 Data collection•The post has nothing to do with the National Assembly Member or his or her policy issues. It is a general comment on politics, or may be SPAM

Irrelevant

•The post shows respect, support, or rapport with the National Assembly Member. It may suggest policy issues with gentle words or po-lite words

Favourable

•The post is hostile, adversarial, or critical of the National Assembly Member. It may be try-ing to slander the National Assembly Member, or includes curse words

Unfavourable

4 Machine learning method

Naïve Bayes multinomial models

and support vector machines were

used to build the machine learn-

ing,

combined with voting and stacking

frameworks.

(a “bag of words” approach was used)


The first approach used was a naïve

Bayes multinomial model.

A support vector machine using a

polynomial kernel was also used in

the classification task.


The NB model and support vector

machine were also combined using

a voting system. Each model re-

ceived one vote, with a third vote

being automatically assigned to

the class with the largest popula-

tion.


A final approach was the use of a

stacking framework. In this case,

the output of the models are input

as variables into a logistic re-

gression model.

5 Feature analysis

The features used as inputs into the machine learning algorithms con-sisted of word counts of all the words that appears 2 or more times in either the male or female set of posts. This resulted in a feature list of 484 words. Table 2 shows the words that appear to be the most sig-nificant features, as determined by a χ2 test for significance of features.

6 Categorization Results

The four algorithms were trained on the data set, and results are all very similar. (Note : “Largest Class” row rep-resents the accuracy of the model if all comments were labelled as posit-ive.)

All four algorithms outperform, and do so at a statistically significant level(at p

= 0.05).

Interestingly, All four algorithms have fairly similar accuracies to each other.


However They have different strengths.

While both accuracies(the voting classi-fier, represented in Table 4. / the stacked clas-

sifier, represented in Table 5.) are similar, it appears that the voting classifier is biased in favour of positive classifications, while the stacked classifier tends to label more posts as negat-ives.

Table 4. Confusion matrix for the voting classifier.

Irrelevant Favourable Unfavourable

Irrelevant 1 28 8

Favourable 1 101 4

Unfavourable 5 22 30

Irrelevant Favourable Unfavourable

Irrelevant 5 12 20

Favourable 4 89 13

Unfavourable 5 13 39

Table 5. Confusion matrix for the stacked classifier.

Predicted

Predicted

Actual

Actual


The reason there are three curves for the three classes is that each curve represents the model's abil-ity to label a specific comment as within a specific category, or outside of it.

The class with the highest estim-ated probability is the one assigned to the actual comment.


It means the classification algorithm creates three sub-classifiers(favourable or not, unfavourable or

not, irrelevant or not).

The ROC curves show how accurate the estimated probabilities are. Each point on a ROC curve shows how many false positives and true positives occur for a specific probability threshold.


Figure 1 shows the ROC curves for the three classes. It shows how the classifiers tend to do well with the three classes.

Overall, it is useful to see that the algorithms are able to discern between irrelevant, positive and

negative posts. While accuracies

can still be improved, the results are very encouraging.

Fig. 1 Receiver Operator Characteristic (ROC) curves for the stacked classifier, for each class in the dataset.

7 Further work It would be interesting to apply

categorization methods using nat-ural language processing(NLP) tech-niques in the study to see how know-ledge of the grammatical structure of the post could help with the labelling process. Furthermore, in-formation on the participants of Cyworld, such as location, political affiliation was not used in cat-egorizing posts.

7 Further work

The authors are also planning to expand the study to multiple la-bellers to help understand how diffi-cult and reliable human labelling actually is.

However, the results above illus-trate that labelling posts by senti-ment is not an intractable problem and useful machine learning ap-proaches exist.

8 Conclusion

The development of these tools provides an efficient means to study emotional based political communication and ad-dresses previous criticism of the lack of anonymity. One of the many advantages of this technique is a senti-mental analysis of user-generated feedback when interacting with political social networks.

The potential for usage of these

tools could be applied to social

network analysis in nations where

political social networks are es-

tablished or extend beyond an

analysis of political subjects and ex-

plore habits by the online general

public.

8 Conclusion

Acknowledgments. The corres-pondence author acknowledges that this research is supported from the WCU project (In-vestigating an internet-based politics using e-research tools) granted from South Korean Government

Cyworld Jeju 2009 Conference(10 Aug2009)No2(2)

Education

Transcript of Cyworld Jeju 2009 Conference(10 Aug2009)No2(2)