Detecting Influential Users and Communities in Censored Tweets …€¦ · follow a large number of...

1

Detecting Influential Users and Communities in Censored Tweets Using Data-Flow Graphs

Rima S. Tanash1, Abdullah Aydogan2, Zhouhan Chen1, Dan Wallach1, Melissa Marschall2, Devika Subramanian1, Chris Bronk3.

1Rice University, Computer Science

2Rice University, Political Science 3University of Houston, Computer and Information Systems

Abstract

Current literature on social media censorship examined various aspects of censorship. However, the relationship among censored social media users has received much less attention. We address this gap in the literature by constructing a complete dynamic data-flow graph that models the communication between users, identifies influential users, and utilizes a wide variety of metadata embedded in tweets to follow data paths of censored tweets. Using a dataset that includes over 25 million tweets from Turkey, our analysis is based on 712,218 unique, censored tweets, associated with 13,056 distinct users. By applying a modularity metric to find user communities, our graph identified 5 large communities with influential users who are supporters of the Kurdish separatist movement, and famous accounts that had active social media involvement during the graft scandal of December 2013. In addition, using a machine-learning topic clustering algorithms to extract popular censored topics and keywords within these communities, we found that the largest community frequently mentioned topics regarding the leaked court documents of the corruption case, while the next four largest communities frequently referenced topics criticizing the military strikes against the separatist groups.

2

1. Introduction

Some social media outlets, like Twitter, provide mechanisms for governments to censor social

media posts in their geographical boundaries. This empirical reality triggered the development of

a new line of research that focuses on the nature of social media censorship (King et al. 2013,

2014; Zhu et al. 2013; Gunitsky 2015; Rød and Weidmann 2015). This literature has examined

various aspects of government censorship including the velocity of censorship, the content of

censored texts, and the geographic origins of posts. While current literature provides useful

techniques to examine the relationship among users (e.g. see Yamaguchi et al. 2010, Barbera

2015), existing studies use only following/follower (i.e. friendship) links to map the users’

relationships, and retweeted posts. We believe this approach provides a limited understanding of

the level of interaction between users (Huberman and Romero 2009). In this study we examine

another aspect of social relationships that involves studying the data flow between users to

investigate influential users in networks of user communities within censored data.

Building on these earlier works, we present a method for detecting user communities by

employing a novel approach of constructing complete data-flow graphs using a wide variety of

metadata embedded in tweets to follow paths of censored tweets. Our method captures tweets

and users (such as users being mentioned, retweeted, and replied to) the traditional approach,

which focuses exclusively on static friendship-based ties, misses. After detecting user

communities, we use machine learning statistical measures and topic clustering algorithms to

detect censored hot topics from each community.

Our analysis is based on a dataset of social media posts censored primarily by the Turkish

government. Turkey is by far the leading country in Twitter censorship over the last few years.1

1 See Appendix A, for the legal process of Internet censorship in Turkey.

3

In fact, 90 percent of the censored tweets during the second half of the 2016 were from Turkey.2

Applying our methods to 712,218 censored Turkish tweets between October 2014 and

September 2015, we detected interesting user communities and popular topics that shed light on

the aim of Turkish censors. Further examination of these communities and topics revealed that

the Turkish government targets mainly two types of users: those promoting corruption-related

discussions and those promoting Kurdish separatist movements. In addition, our analysis

provides important findings that indicate that there is a third, much smaller group that is also

censored. This group is more eclectic, with roughly a half representing tweets that are pro-

government in nature. The users in this group turn out to be mostly individuals who insult key

political figures by using vulgar language.

The paper is structured as following; we first explain the construction of data flow graphs

for detecting influential users. This will be followed by presentation of how we detect and

visualize social media user communities. We then present our analysis on topic clustering for

each of the largest communities. Finally we present some preliminary conclusions and discuss

some of our ongoing and future work.

2. Finding Influential Users

Graph metrics have been previously used by researchers to study social network

characteristics (Bonneau et al. 2009). In political science, more specifically, network analysis has

grown widely popular (Ward et al. 2011). In our research, we focus on finding influential users

and popular censored topics using a novel method to generate a dynamic data flow graph that

represents the data flow between the users.

2.1. Related Work 2 For more information, see Twitter transparency reports: https://transparency.twitter.com

4

Yamaguchi et al. (2010) showed methods for identifying authoritative users (i.e. influential

users) in Twitter by examining information flow in Twitter using data link analysis by assigning

an authoritative score to each user in the social graph. They argue that their approach is unlike

the existing methods that only deal with following and follower static relationships between users

(Java et al 2007 and Weng et al. 2010). Twitter users can be categorized as information

generators who post important tweets and tend to have many followers, and information seekers

who are users that follow a large number of users and are interested in reading more than posting

tweets (Yamaguchi et al. 2010). The following/follower relationships do not tell us much about

the level of interaction between these users because many users follow back other users as a

courtesy. Huberman and Romero (2009) found that the vast percentage of users in the

following/follower relationship do not interact with each other, so examining the link between

thousands of users (in many cases millions of users) is not useful. Additionally, over the years

Twitter attracted a large number of automated accounts known as bots that tend to randomly

follow a large number of users and retweet hoping to get followed back (Chu et al. 2010). For

these reasons, using the static followers/following relationship to identify influential users is not

very useful.

The Yamaguchi et al. rating algorithm considered the relationship from a user u to other

users that u follows, which is a static relationship that can be obtained from Twitter’s API. Next,

they considered the retweet dynamic relationship. The social graph they built consists of nodes

and edges, where nodes are user-nodes and tweet-nodes. The edges can be one or more of the

following relationships: a link from user u to a tweet that u posted, a link from user u to a user

that u is following, a link from tweet t to a tweet retweeted by t. After they obtain the graph, they

get an authoritative score based on the calculated weight of the edges. Twitter allows users to

5

mention, reply, and retweet other users that are not directly connected (not following each other).

This could occur for example, when the tweet is found in the public feed, or posted by some

intermediary users.

We argue that although their approach can be better than the existing methods, it only

considers users that are directly connected via following/follower relationships, and ignors users

that received the tweet from their timeline but are not directly connected to the user who

generated the tweet. This means that their approach missed many users that could also be

influential. Additionally, their approach only considers the retweet relationship, and ignores the

links between users that are mentioned or replied to in the tweets.

2.2. Twitter Data Structure

Twitter offers many free APIs for collecting public tweets, including the Streaming API3,

which captures continuous live tweets up to 1% of total Twitter feed, and the REST API4 for

retrieving historical tweets, such as tweets of specific users, or by tweets-IDs.

In this section we provide a brief description of some of the important fields in the tweet data

block returned by Twitter API. A typical tweet data block, looks similar to the example shown in

Figure 1, which we truncated for abbreviation.

3 Streaming API: https://dev.twitter.com/streaming/overview 4 REST API: https://dev.twitter.com/rest/public

6

Figure 1: Sample tweet returned by Twitter API

Some of the fields we present below could be absent or NULL. These fields help us

determine when tweets are: retweets, replies, mentions, original, and censored5:

1. “user”: this field contains attributes of the tweeting user, including, user-name, screen-

name, user-id, profile description, and many other attributes.

2. "retweeted_status": this field when present, means that the tweet is a retweet, and it

contains the representation of the original tweet including attributes of the original user and

the original tweet. Note that a retweeted-retweet does not contain attributes of the

intermediary retweet, but only the original tweet. For this reason, we consider the 5 Tweets’ fields guide: https://dev.twitter.com/overview/api/tweets

{"_id" : ObjectId("562c277d19a39eb2d8624f1a"),"contributors" : null,"truncated" : false,"text" : "RT @Jiyanizm: Bu defa hesabımı kapattırmaya kararlılar

bu vitaminsizler\n\nCok yogun spamlaniyorum arkadaslar ...","is_quote_status" : false,"in_reply_to_status_id" : null,"id" : NumberLong("625374103214653442"),"favorite_count" : 0,"source" : "<a href=\"http://twitter.com/download/android\" rel=

\"nofollow\">Twitter for Android</a>","retweeted" : false,"coordinates" : null,"entities" : {

"symbols" : [ ],"user_mentions" : [ ],"hashtags" : [ ],"urls" : [ ]

},"in_reply_to_screen_name" : null,"in_reply_to_user_id" : null,"retweet_count" : 23,"id_str" : "625374103214653442","favorited" : false,"retweeted_status" : {

"contributors" : null,"truncated" : false,"text" : "Bu defa hesabımı kapattırmaya kararlılar bu

vitaminsizler\n\nCok yogun spamlaniyorum arkadaslar ...",…}

7

relationship between the retweet and the original tweet logical, because the users are not

directly connected in the social graph.

3. "user_mentions": is a sub-field that contains a list of users being mentioned in the actual

text.

4. "in_reply_to_screen_name": this field indicates that the tweet is a reply. When present, it

contains the screen-name of the original user.

5. "withheld_in_countries": this field is only present if the tweet is censored. When

present, it contains a list of ISO-country codes in which the tweet is censored. For example,

"withheld_in_countries": [“TR”], means that the tweet is censored in Turkey only6.

Previous work by Tanash et al. (2015) validated that when this field is present, the tweet is

greyed-out when viewed from inside of the censoring country.

It is important to note that tweets may have a slightly different data block structure

depending on the API being used. Specifically, we discovered that a retweet collected using the

Streaming API does not include the "retweeted_status" field, appearing only as an original

tweet. When the same tweet is recollected using the REST API, it includes the

"retweeted_status" field, identifying the original user attributes. This finding is important as it

allows us to find additional users that may be of interest to the Turkish censors.

2.3. Data Collection

We used the same crawling methodologies and some data described in Tanash et al. (2015),

to get over 25 million tweets from Turkey between October 2014 and September 2015. We ran a

streamer from October 2014 until January 2015, and during the month of June 2015 to

6 Twitter uses “to withhold” instead of “to censor”. Throughout the paper we will use these two words interchangeably.

8

specifically capture tweets related to the Turkish election event. We also used the REST API to

collect tweets from users’ timelines for those found in our streamed dataset. Overall, we

identified 712,218 unique7 censored tweets each with a unique tweet id in Turkey, of which 96%

are retweets (contained “retweeted_status” field), associated with 13,056 distinct users.

Tweets may include additional strings (prefixes) that get automatically added by Twitter

depending on how the user chooses to post tweets, for example, the prefix “RT@username”

precedes the actual text when the post is a retweet. Additionally, if the tweet has a URL, Twitter

automatically converts the URL to a mini URL to reduce the number of characters, because the

size of the tweet cannot exceed 140 characters8. To find the number of identical messages , we

removed the prefixes and the mini URLs from all posts, and found 535,759 unique strings9. The

numbers we present above do not include tweets posted by 258 users that are withheld-accounts

in Turkey, because withheld-accounts are users for whom all posts are censored in specific

countries regardless of the tweet content.

2.4. Finding Censored Tweets

After collecting the tweets, we waited for few weeks, then we recollected the tweets

repeatedly for over a month, this is because Twitter censorship occurs days or months after the

tweet is posted. We used the REST API statuses/show/:id, which returns historical tweets by

providing the tweets ids as input parameters, we then examined if the “withheld_in_countries”

field was added by Twitter to the tweet data block, if true, we marked the tweet as censored

(Tanash et al. 2015).

7 Each tweet has a unique tweet id, we removed duplicate ids. 8 Twitter recently changed the limit on expressing a post with over 140-characters, but our data was collected prior to the change https://blog.twitter.com/express-even-more-in-140-characters. 9 This process is described in details in the "Data Pre-Processing” section of this paper

9

In the following section, we introduce a new simple algorithm to extract influential users by

converting tweets into a dynamic data flow graph that represents users’ communication.

2.5. Constructing Data Flow Graph

We introduce a simple algorithm for constructing a complete data-flow graph using a

wide variety of metadata embedded in tweets to follow data paths of censored tweets, instead of

the traditional approach of constructing such graphs using friendship relationships and retweets

only. To model the data-flow between users in our Twitter dataset, we converted the censored

tweets to a directed graph 𝐺(𝐸, 𝑈). Where U is the set of all nodes in our data, and each user is a

node. 𝐸 is the set of directed edges, representing the data flow link between the users. The edges

are added to the graph by examining every tweet’s JSON data structure, and applying the

appropriate rule that we defined in Table 1. The rules are described as following: (1) when a

tweet is original (𝑂), the node is a singleton, and no edge is added. (2) when the tweet is original

but contains a "user_mentions” subfield, we add an edge from the original user to each of the

mentioned users in the list, this is because when a user is mentioned in a tweet, the user receives

the tweet and the data flow direction is from the user to those who are mentioned. (3) If the tweet

contains a "retweeted_status" field, we add an edge from the original user to the retweeting

user, this is because the source of the data is the original user. (4) when the tweet contains both

"retweeted_status” field and the embedded original tweet (source) contains

"user_mentions” subfield, we first apply rule (3), then we add an edge form the original user to

each of the mentioned users. Finally, (5) when the tweet contains "in_reply_to_screen_name"

field, we add an edge from the replying user to the original user, this is because a reply tweet

generates new data which flows from the replying user to the original user.

10

Table (1): Data flow rules for adding directed edges

Note that because Twitter allows users to mention, reply, and retweet other users that are

not directly connected (not following each other), our algorithm allows us to captures additional

users from the metadata that we were not specifically crawling, while also identifying new users

who can also be influential. Users with the highest out-degree number (users who generated

most data) are users whose tweets reached to the most users in our graph and are influential. We

discuss the analysis of influential users from communities in the following sections.

3. Community detection

Now that we constructed our social data-flow graph, we can apply data clustering to

identify communities, also known as modularity, and analyze influential users from each

community. Modularity is a widely used metric for extracting modules in network structures

(Newman 2006) and is used for studying communities in social networks (Bonneau et al. 2009).

Each module or community consists of densely connected nodes with scarce connections to other

modules. In social networks, friends tend to communicate with each other more often than they

do with other nodes that are members of other communities. Studying communities and their

members is practically useful for conveying important information about common topics and

users’ characteristics.

11

Modularity is defined by Newman (2006) and noted in formula (1). 𝐺 is a directed graph.

Modularity 𝑄 is defined as the number of edges that exists in communities in 𝐺, minus the same

number of edges expected if the edges were distributed randomly. 𝑄 is calculated as following:

𝑄 =12𝑚 𝐴./ −

𝑘.𝑘/2𝑚

./

(1)

Where 𝑚 is the total number of edges in directed graph 𝐺, 𝑣 and 𝑤 are two random nodes, and

𝑘. and 𝑘/ are the out-degree for 𝑣 and 𝑤 respectively. When there is an edge between 𝑣 and 𝑤,

then 𝐴./=1, otherwise it is 0 (Bonneau et al. 2009). Dividing 𝐺 into two communities, the

algorithm is then repeated recursively until 𝑄 is maximized. The Louvain modularity method is

known to run faster by using greedy optimization (Blondel et al. 2008), which is appropriate for

processing very large networks. We decided to use a python library10 that uses the Louvain

modularity method. Applying modularity to our graph, we identified a total of 25 communities,

and found that there are five large communities, with the number of members per community

ranging from 7,791 to 4,879. In order to plot the top communities, we used Gephi an open source

visualization tool that uses a 3D engine to display graphs in real time and detects underlying

patterns that can tell a story about the data. We found Gephi to work fast when dealing with large

dynamic networks. It also comes with many useful functionalities such as filtering, clustering,

analysis, and exporting11. Figure 2 shows the top five communities created using Gephi with

10 Community detection for NetworkX’s documentation: http://perso.crans.org/aynaud/communities/ 11 For more info visit www.gephi.org

12

Atlas Force layout12. Each cluster is colored in a different color: red, green, blue, purple, and

yellow. Since the remaining communities had relatively smaller members size, ranging from 386

to 2 users, we omitted them in Figure 2.

Figure (2): Communities of censored tweets in Turkey- October 2014 through September 2015

Since the graph is constructed with data-flow rules that we defined in Table 1, we applied

vertex out-degree metric, where users with the highest out-degree number are users whose tweets

reached to the most users in our graph. These are the influential users. Using our visualization 12 A layout algorithms set the graph shape so it is more aesthetically pleasing, ForceAtlas layout sets the nodes that are connected closer and the nodes that are not connected gets pushed away. https://gephi.org/tutorials/gephi-tutorial-layouts.pdf

13

tool, we adjusted the node size to correspond with its out-degree size so that it is easier to

visually spot the most influential users (see Figure 3). Notice that the purple community is set

further away from the remaining communities in this layout. This is because the nodes in the

purple community are less connected to the other communities. On the other hand, the red,

green, yellow, and blue communities appeared tightly connected and closer to each other, which

suggests that users from these communities share common attributes, such as social contacts and

topics of interest.

14

Figure (3): Communities of censored tweets in Turkey October 2014- September 2015, with out-degree ranking for nodes.

3.1. Analysis of Influential Users in the Top 5 Communities

To confirm that these users are influential, we manually examined their profiles. The

community in purple is the largest community. The most influential users are fuatavnifuat,

15

HARAMZADELER333, BASCALAN, csagir2015, mehtabyuceel, and TheRedHack. One

important common feature of all these accounts is their active social media involvement during

the corruption scandal of December 2013. For example, Fuat Avni (@fuatavnifuat) has been

acting like a government insider leaking Erdogan’s strategies since the scandal was uncovered.

HARAMZADELER333 is the account that leaked corruption evidence such as legal tape

recordings between politicians and businessmen through Youtube and Twitter (Sozeri 2015).

The communities with red, green, and blue colors are the next three largest communities.

The influential users in these communities are overwhelmingly the advocates of the Kurdish

movement. These include AjansaKurdi1, AjansaKurdi2, Kurd24M, Diyarbakir7, curdistani, and

ROJOVA. AjansaKurdi1 and AjansaKurdi2 are the twitter accounts of Kurdish news site

www.ajansakurdi.com. Kurd24M is also a twitter account of a news source. Curdistani is another

pro-Kurdish user who regularly tweets promoting PKK and other Kurdish groups.

In addition to the top five communities, we identified one specific community with 387

nodes, shown in Figure 4. The graph representing this community takes the shape of a star

network, in which all of the nodes in this community are directly connected to the focal node

orgidee79 which acts as a hub, and information source. We categorized this as a porn community

of censored tweets because the content and the screen names contained sexual language. This

suggests that Turkish censors also target pornographic topics and users.

16

Figure (4): Porn Community

4. Topic Clustering

4.1. Data Preprocessing

Before feeding tweet texts into our topic classification model, we applied a sequence of

text processing techniques to ensure validity, accuracy, and relevance of our data. The first step

is extracting original tweets from retweets. As a popular social media platform, Twitter feed

consists of a large number of retweets. Indeed, in our dataset 96% of all posts are retweets. If a

tweet is a retweet, its JSON structure will have a non-empty retweeted_status field. In our top

five communities, the average retweet is almost 99%. Since retweets are endorsements for

original tweets and add little extra information, we decided to transform retweets into original

tweets by extracting the original content of the tweet. This approach also makes data de-

17

duplication easier as many RT@username prefixes are removed automatically. To determine

whether a tweet is a retweet or not, we checked its JSON structure. If retweeted_status field is

not empty, we extract the original tweet embedded in that field.

The second step is data de-duplication. Even though all tweets are original in the sense

that their retweeted_status field is empty, we still found a number of tweets that are almost

identical to each other. A further investigation revealed an RT@username + text structure. This

structure is usually generated when a user retweets another user but the resulting tweet is original,

i.e., has an empty retweeted_status field. We removed all RT@username structures that

appear at the beginning of a tweet and hashed every tweet to ensure uniqueness. At the end of

step two, three percent to twenty-five percent of tweets are eliminated as duplicates among

different communities.

At this point, our tweet texts are still not ready to be analyzed due to high variability

between similar words. For example, when a human sees three words university, University and

university’s, he knows that they are all related to university. For a computer, however, those

three words are coded differently and therefore are distinct from each other. A common text

processing technique is to make all letters lowercase. Besides doing this, we removed common

Turkish stop words13 as well as all occurrences of &gt and &lt, which are merely encodings for <

and > symbols (see Example One below). We removed any syllables that came immediately after

an apostrophe because they are used only for grammatical completeness. We also removed URLs

appended at the end of each tweet. Since Twitter automatically shortens and truncates those

URLs, it would be very difficult for us to decode the content behind each URL. As an illustration

of this process, Example Two shows a raw tweet text and its transformed version after data

preprocessing. 13 For Google Code of the Stop Words, see https://code.google.com/p/stop-words/.

18

Example One:

Tweet text shown on Twitter webpage:

>>>>AKPTERÖRÖRGÜTÜ<<<<

Tweet text encoded in our database:

>>>>AKPTERÖRÖRGÜTÜ<<<<<

Example Two:

Raw tweet text:

RT @PartizanChe: Ambulans ve sağlık ekiplerinden önce Toma geldi. Yazıklar olsun böyle

ülkeye! . #SuruçtaKatliamVar http://t.co/n9P5oGPHQ2

Tweet text after preprocessing:

ambulans ve sağlık ekiplerinden önce toma geldi yazıklar olsun böyle ülkeye .suruçtakatliamvar

4.2. Stemming

We also applied a stemming algorithm as a part of the preprocessing. Stemming is the

process of removing prefixes and suffixes (also called morphemes) so as to reduce derivational

differences among words with the same root meaning. For instance, stemming collapses words

university and universities into one word regardless of their grammatical tenses or pluralities. A

variety of open source Turkish stemmers are available on the Internet, and each has pros and

cons. Packages such as python snowballstemmer14 are too lightweight and easily customized,

and do not process Turkish language. We decide to use zemberek-nlp package because it is both

widely used and well researched (Akın and Akın 2007). One limitation of this package however, 14 Python Package. https://pypi.python.org/pypi/snowballstemmer

19

is that the program returns an empty string if there is no match for the given input word. In our

Twitter texts there are names (Erdoğan) and hashtags (AjansaKurdî) that are neither stemmable

nor in the package default dictionary. As a workaround for this, we modified the package to

return the original word if there is no match found. Example Three shows a raw tweet text and its

stemmed version.

Example Three :

Raw tweet text:

gelen, 6 milyon liralik kasa açiği ile 10 milyon liralik usulsüz yani naylon fatura ile

yapilan soygunu çözmek için harekete geçen.

Stemmed tweet text:

gelen, 6 milyon lira kas açığ ile 10 milyon lira usul yani naylon fatura ile yapılan soygun

çöz için hareket geçen

In this example, if we exclude the stemming of “kasa” as “kas”, the algorithm performed very

well. However, the algorithm failed in this example because the addition of a suffix “a”

(meaning “to” in Turkish) transformed the root into a word that has a second meaning. In

Turkish “kasa” means “money case”. However, it also means “to muscle”, where “muscle”

stands for “kas”.

4.3. Using tf-idf and NMF for Topic Clustering

Tf-idf is a popular measurement used in text mining and topic clustering. Tf stands for

term frequency. According to Sparck Jones (1972), a frequency of a term is positively related to

the weight of that term in a document. However, some common words such as the, a, an, have

high term frequency in most of documents. For this reason, tf itself tends to be a biased

20

measurement. Idf, which stands for inverse document frequency, is used to reduce the weight of

common words. The more common a word is, the smaller its inverse document frequency. The

tf-idf value of a word is the product of its term frequency and inverse document frequency.

After data preprocessing and stemming, each tweet is transformed into a vector of words.

The length of the vector is equal to the total number of unique words that appeared in a tweet. In

practice, it is usually recommended to eliminate highly frequent and infrequent words so as to

reduce information redundancy and facilitate subsequent computation. Therefore, we ignored

words that appear in more than ninety-five percent of tweets and words that appear less than

twice in total. As a result our tweet vectors become more compact without losing much

information.

To extract meaningful topics from the vectors of tweets, we used unsupervised machine

learning algorithm because tweets are unlabeled, and we don’t have prior knowledge of the

topics or the number of topics. NMF is an unsupervised topic classification algorithm widely

used in data mining and text retrieval. It has been shown to be successful in clustering Wikipedia

articles (Finn 2008) and tweet topics (Tanash et al. 2015) In our dataset each tweet vector is

treated as one row in a term-document matrix generated from tf-idf. NMF algorithm factors a

term-document matrix into a term-topic matrix and a document-topic matrix. The term-document

matrix has a dimension of 𝑛 𝑥 𝑚, where 𝑛 is the number of topics, a parameter that requires

tuning, and 𝑚 is the total number of words. We have set 𝑛 equal to 10, and generated groups of

topics for each community that we obtained in the previous section15. The following section

discusses the findings of this analysis.

15 n can be set to any number depending on how many topics we desire to extract. We have also tried setting n equal to 5, and found that the results did not change substantively. We did not report due to space limitations.

21

5. Topic Analysis per Communities

Table 2 lists the topics and usernames for each community. For presentation simplicity

we translated and presented only two topics per community. Full lists can be found in the

Appendix B. The largest community has 7,791 users. The topics from this community reference

the corruption scandal of December 2013 where Tayyip Erdoğan’s sons (Bilal and Burak) were

frequently mentioned in indictments. Particularly, the voice recordings between Bilal Erdoğan

and Recep Tayyip Erdoğan quickly became famous after the tapes were leaked and published

online. Words such as thief, Tayyoş (a rude way of saying Tayyip), steal, Erdoğan, Bilal, voice,

Tayyip, and Burak are referring to this issue. Also the words, Turkey and new, refer to the “new

Turkey” concept created by Turkey’s ruling AKP party. The supporters of AKP use this term to

describe their role since 2002, which, as they argue, was to replace the old secular Turkish

republic founded in 1923. The top usernames belonging to this community also refer to the

corruption scandal. For example BASCALAN can be translated as PRIMETHIEF, which was

intended to refer to the Prime Minister (Basbakan) Erdoğan. We also checked the other

usernames, and found that all are anti-government users who have actively tweeted regarding the

corruption scandal.

22

Table 1: Topic Analysis for Large Communities

The topics from the remaining top communities are overwhelmingly on the Kurdish

issue. We frequently see words like soldier, police, aircraft, and AKP together with words such

as bomb, murder, kill, massacre, arrest, gas, helicopter, war, shoot, and guerilla. The topics also

include location names of the fight between Turkish Armed Forces and the PKK/HPG. These

include Silopi, Amed (Diyarbakir in Kurdish), Bismil, and Cizre. During the military strikes,

these distinct pro-Kurdish accounts have regularly claimed that the Turkish armed forces are

killing civilians, including children and women. As a result the following words occur very often

among the hot topics: civilian, age, shoot, child, murder, woman, massacre, kill, police, and

soldier. The usernames also have pro-Kurdish connotations, such as AjansaKurdi1,

AjansaKurdi2, Kurd24M, and curdistani. The remaining user accounts are also extremely

outspoken, and have been criticizing the government on the Kurdish issues, and are supporters of

Community with length 7791

Topics thief Tayyoş police leave protect steal medium Turkey new Erdoğan Bilal Tayyip country Burak take voice

Usernames Mehtapyucell , csagir2014 , Fili_Z_ , fuatavnifuat , BASCALAN


Topics police soldier die kill HPG civilian age shoot child murder Kurd do make AKP Turk massacre child war speak peace

Usernames AjansaKurdi2 ,Hevalizmir , sisivas ,xerzan4 ,KemaIPir


Topics do police people soldier PKK guerilla make HPG Amed die Cizre Ajanskurdi child walk murder voice woman police leave Cizreunderattack

Usernames AjansaKurdi1 , mednuce, YilmazGedik , Kurd24M , Rojava


Topics do follow continue account protest make resist lynch new friend police strike age striking arrest gas remember worker weapon throw

Usernames Denizhuseyinulas, birlesik , Daglaradogru , dewedersim, Ortak__Platform


Topics minute last police Silopi soldier murder helicopter Bismil Amed start PKK do action announce guerilla decision take absent aircraft bomb

Usernames curdistani , RebellionKurde , Nisebiin , SeyithanE, Kullikwebun

23

the anti-government pro-Kurdish movement. This why there is more overlap between the blue,

yellow, green and red communities in Figure 3.

5.1. Topics from Singleton Users

Our analysis also revealed singletons nodes, which we defined as users with no links to

other users in our dataset (e.g. retweeting, replying etc.). Although these are not influential in our

graph, we found that Turkish censorship authorities still target these account holders. We

analyzed the content of the 547 censored singleton tweets to understand the potential cause for

the Turkish censorship authorities to withhold them. As depicted in Figure 5, these users have no

links to other users in the graph. Instead of talking about specific topics (such as corruption or

Kurdish issue), the content of these tweets is mostly about certain individuals (pro- or anti-

government), with the overwhelming majority containing insulting language. It is likely that

people who were insulted were the ones to have initiated the court process for censorship. The

largest group of tweets in this category consists of tweets targeting the businessman Aydın

Doğan, who is also the owner of a large media group. There are total of 233 tweets where he was

insulted. Although his media group recently lowered the tone of its government criticism, during

and after the Gezi park protests, Dogan’s media group was one of the most outspoken groups

criticizing the government. This suggests that many of these tweets were created by users who

supported the government.16 We have also identified 116 tweets that insult members of the AKP

government, such as Recep T. Erdogan (current president), Ahmet Davutoğlu (former Prime

Minister), and Lutfi Elvan (former Minister of Transport, Maritime, and Communication). The

remaining 198 tweets target less prominent individuals, but still use insulting language.

16 For example, the tweet “Sürüngen Aydın doğan ve yandaşları, tek tabanca RECEP TAYYIP ERDOĞAN”, can be translated to “Creeper Aydın Doğan and his team versus lonely pistol (warrior) Recep Tayyip Erdoğan”.

24

Figure (5): Users without edges (singleton)

6. Future Work

There are different directions for future work related to this research. One direction would be

to expand our work to include other social media as data sources. However, because the vast

25

majority of social media posts such as Facebook tend to be private, unlike Twitter, we believe it

will be difficult to collect enough public data and repeat the same experiment.

Another direction would be to use a machine learning classifier to detect tweets that may get

censored in the future, immediately after they are posted. An obvious choice would be to use

Naïve Bayes classifier. An accurate classifier can enable researchers to collect live tweets from

Twitter Streaming API and estimate the proportion of withheld tweets on the fly, without having

to check withheld statuses every several weeks or months. Additionally, Twitter’s Transparency

Report is found to underreport the true population of withheld tweets (Tanash et al. 2015), so the

estimated results from a tweet classifier can be used to cross-validate the actual scope of

censorship.

Our preliminary testing of 20,000 sampled tweets, 16,000 of which were used for training,

and 4,000 were used for testing, returned an average accuracy rate of 86% using tf-idf

transformation. In terms of predicting future tweets, we are still tuning our classifier and have yet

to reach a definite solution. As of now we have two major challenges. One challenge lies in the

insufficient new timely censored data to verify results against our classifier. This is because

there is always a delay between the time a tweet is created and the time it is withheld. Another

challenge is the evolving nature of Internet censorship. With new social events happening and

new political leaders emerging, the censored keywords continue to change and are dynamic.

More sophisticated classifiers are required to make better predictions. This is an ongoing work,

and we hope to report more results in the near future.

A different direction could be investigating what social media content increases the

likelihood of censorship. In a recent study, King et al. (2013) found that Chinese censorship

authorities mostly tolerate social media posts with government criticism, but not those with some

26

reference to a collective action event such as a street protest. Although their theoretical

expectations were confirmed in a dictatorial regime, it is not clear that these findings hold for

semi-democracies like Turkey. In particular, due to domestic as well as international concerns,

semi-democracies mostly allow political protests as long as they are peaceful. From a similar

standpoint, such governments also allow ordinary government criticisms. However, since they

have electoral concerns, they have the incentive to silence government criticism that has the

potential to harm the ruling party’s reelection prospects. To test the collective-action hypothesis,

we have conducted a preliminary analysis by identifying the major street protests and the Turkish

key words that are associated with these events.17 Using these keywords, we searched our

database for tweets posted on these dates, and extracted both censored and uncensored tweets.18

We found that the government censored only a very small fraction (on average 5.2 percent) of

the tweets with collective action reference.

We further examined our database by focusing on the social media posts of the users

whose tweets were censored only partially and conducted two separate topic analyses on the

censored and uncensored tweets of these users.19 The results show that the most salient topics for

non-censored topics are composed of mostly everyday language that is neither offensive nor

sharp-tongued. When we look at the topics from withheld posts, the picture is quite the opposite.

The words are overwhelmingly referring to the Kurdish/terrorism issue. Most importantly the

words associated with the government leadership (such as AKP, Erdogan, government,

17 For the purpose of this preliminary analysis, our time frame is June 11, 2015 to June 23, 2015. There were total of four major protests: Renault and EGO Workers (June 11), Women's Movement (June 12), Coal Miners (June 15), Gezi Park Anniversary (June 21). 18 King et al.’s (2013) methodology begins with identifying key words for the most salient topic areas of recent Chinese politics and crawling posts using these keywords in different topic areas. Since our data collection method is substantively different than King et al. (2013) we are unable to implement their data analysis step by step. 19 By comparing the posts from similar types of users, we aimed to control for many unobservable user-specific attributes that can influence the censorship outcome (Fu et al. 2013).

27

president, soldier, police etc.) frequently appeared together with terms related to violence (such

as massacre, kill, attack, war etc.).

7. Conclusion

Recent developments in communication technology have created significant shifts in

repressive governments’ attitudes toward censorship (Deibert 2009; Chadwick and Howard

2009). Although previously it was widely accepted that the Internet was immune to control, in

today’s world governments are implementing highly sophisticated and technological

mechanisms to regulate the information flow within their boundaries. As one of the leading

information sources, social media, particularly Twitter, has been greatly affected by the shift in

such censorship policies. Twitter’s policy change in 2012 enabled governments to withhold

certain posts if they violate local laws. In this study we analyzed data of censored tweets from

Turkey, which is by far the leading country in censorship in the recent years.

In this study we aimed to contribute to the political methodology literature by developing a

novel approach of creating complete dynamic data-flow graphs to model the relationship

between users and to identify influential users in user-communities. Combining our approach

with machine-learning topic clustering algorithm to analyze censored tweets in Turkey, we

investigated the hot topics in each community, and found that the users of censored tweets in

Turkey mentioned topics related to the graft probe in 2013, and the Kurdish separatist

movements. We believe that our framework can be applied to any twitter dataset to identify

influential users and hot topics such as, targeted marketing, and political campaigns.

28

Appendix A Legal Process of Censorship in Turkey

Unlike the censorship processes in dictatorial regimes where censors block sites without much

accountability, Turkey’s Internet censorship policy has an institutionalized mechanism, which

was outlined in the Law#5651.The process starts with initiation of a court case by individuals or

agencies claiming that certain content violates the law as it is assaulting someone’s personality,

is against the national interests, or because of other reasons defined in the law. Individual

citizens can apply to a court through their lawyers or directly themselves20. For government

agencies, the government recently created Prime Ministry Security Affairs General Management

(BGİGM, Başbakanlık Güvenlik İşleri Genel Müdürlüğü), which is responsible to coordinate the

litigation processes initiated by government agencies. The designated court for Internet crimes is

Penal Judgeship of Peace (Sulh Ceza Hakimligi)21. According to the law the court needs to give

its decision within 24 hours.

If the court accepts the claims, the decision is sent to the Telecommunication and

Communication Agency (TIB; Telekomunikasyon ve İletisim Başkanlığı), which is known as the

Internet watchdog of Turkey (Sozeri 2015). TIB is responsible to fax the court decision to the

social media company headquarter (such as Twitter Inc.). If the company declines withholding

the post, TIB may choose to block an entire site. This happened a few times in Turkey, for

example Twitter was temporarily blocked on July 22, 2015, after the Ankara bombing, which

killed 32 people. After the deletion or censorship of the tweet or account, Twitter occasionally

shares court decisions on the Chilling Effects website, recently known as Lumen Database.

20 for more info see http://www.aljazeera.com.tr/haber/sosyal-medyada-kendinizi-nasil-korursunuz 21 BGİGM has been mostly using Gölbaşı Penal Judgeship of Peace, which resides in Ankara. Altiparmak and Akdeniz (2015) states so far this court has been very friendly with the government agencies requests for censorship.

29

Figure 6 presents a screen-shoot of a court document stating the censorship decision. The shaded

regions are for the names of the judge and the person (agency) who requested the censorship.

Figure 6: A sample court decision asking to censor list of tweets

This censorship mechanism represents semi-democratic state structure. While the

government has been greatly benefiting from this censorship process, individual citizens or

opposition groups also take advantage of the system. That is to say, the government has been

dominating, but not monopolizing the censorship process. For example, after the violent attacks

against the local branches of Pro-Kurdish party, HDP, the lawyers of the party went to court to

withhold a total of 132 social media accounts in September 2015.The lawyers argued that these

accounts encouraged the violence against their party members and hence threatened the public

30

safety and peace.22 The court accepted the allegations of the party and confirmed withholding

these accounts23.

22 These accounts are known as ak trolls, many of whom are paid workers whose responsibility is to promote and advocate the policies of the ruling AK Party (Saka 2014, Akser 2014). 23 For more info see http://www.meydangazetesi.com.tr/gundem/mahkeme-trol-hesaplarini-hdp-nin-sikayeti-uzerine-kapatti-h33182.html (last accessed June 2011, 2016)

31

Appendix B Table 3: Topic Modeling per Communities

Community Length

Topic No Topics

7791

0 ol allah gel kapak hayır cennet bilgi sahip biraz gerek

1 hırsız tayyoş be polis çık koru fink çal orta di

2 yap şimdi su gün fatih takip koy iste dahil pkk

3 ver hesab el kapat paylaş kal twitter destek sayfa youtube

4 son dakika et skandal durum kenan ışık kesin avukat nun

5 günaydın hâlâ off yaş bak internet sabah lokma helva müjde

6 se yaş katil ye allah asker zengin padişah yaz fakir

7 türki yeni erdoğan bilal tayyip ülke burak do al ses

8 ed be devam ülke diren atatürk hakk çocuk hakaret büyük

9 akp san mhp gör gid katil önce din üye pkk

5276

0 polis asker öl öldür hpg sivil yaş vur çocuk katled

1 son dakika çatış nusaybin ilçe saldırı şiddet mahalle kızıltepe patla

2

cizre ajansakurdi ed nur cizreunderattack hdpgenelmerkezi yürü kaybet çıkma hayat

3 halk savaş diren iste karşı barış silvan demirtaş gör cenaze

4 hesab yeni takip ff izmirliheval kapat türki destek lütfen dost

5 pkk yok yap eylem et al ilan biji in karar

6 ver ses oy destek ölüm el insan haydi se akp

7 ol insan devlet ülke tc büyük katil çocuk şeref güzel

8 kürt yap ed akp türk katliam çocuk savaş konuş barış

9 hdp saldırı bina barış oy yay kal demirtaş faşist in

5241

0 son dakika patla mahalle büyük silopi gel dakıka çete yakın

1 kürt akp türk çocuk öl karşı yok vur konuş faşist

2

cizre ajansakurdi çocuk yürü katled ses kadın polis çık cizreunderattack

3 ypg ypj twitterkurds kobanê kobane li in ışid biji el

4 hdp akp oy mhp bina parti saldırı aday seçi başkan

5 ver se oy ses be el kardeş barış destek haydi

6 yap demirtaş pkk başkan savaş bak güc karşı saldırı çağrı

7 ed polis halk asker pkk gerilla et hpg amed öl

8 destek ff hesab takip yeni arkadaş kapat türki dost hevalnooo

9 ol insan tt müşahit seçi çocuk san anne sonra emin

5079

0 ed takip devam hesab protesto et diren linç yeni arkadaş

1 ol insan sela yok san se iste büyük herkes tek

32

2 konser yorum grup yer izmir 00 büyük yıl 20 30

3 son dakika unut adalet durum al dakıka silvan kuzey yoldaş

4 gel bak gün güzel insan yoldaş san tweet çocuklar izmir

5 halk karşı cephe adalet sokak diren tv düşman açıkla hakk

6 cizre devlet çık ajansakurdi katil çocuk yaş yol yok terör

7 redhack yay via hackledi dost mesaj baki aile unut oku

8 polis saldırı yaş saldır gözaltı gaz an işçi silah at

9 ver akp hdp yap saldırı bina oy seçi silah katil

4879

0 turkish turkey to the in of kurds police and amp

1 dakika son polis silopi asker katlet helikopter bismil amed başla

2 ol kürt çocuk hesab hevalno gör lütfen yay iyi şimdi

3 pkk et eylem ilan gerilla karar al yok uçak bombala

4 hdp li istanbul kadın oy çık milletvekil hedef 00 saat

5 ajansakurdi cizre ta polis hükümet şırnak ye özel genc çıkma

6 ver destek oy se hesab insan ff gece tc cnnturk

7

twitterkurds kurdistan in of kurdish ısıs ypg syria terroristturkey rojava

8 an itibari şırnak gev diren silopi meydan nusaybin merkez cenaze

9 yap ed akp gel saldırı asker acil polis katliam katled

33

Works Cited

Akın, A. A., & Akın, M. D. (2007). Zemberek, an open source NLP framework for Turkic languages. Structure, 10, 1-5. Akser, M. (2014). Turkish Film Festivals: Political Populism, Rival Programming and Imploding Activities. Film Festival Yearbook, 6, 141-155. Altiparmak K., Akdeniz Y., (2015) 5651, Madde 8/A Buyuk Sansur donanmasinin Amiral Gemisi, Guncel Hukuk, 11-143 Bamman, D., O'Connor, B., & Smith, N. (2012). Censorship and deletion practices in Chinese social media. First Monday, 17(3). Barberá, P. (2015). Birds of the same feather tweet together: Bayesian ideal point estimation using Twitter data. Political Analysis, 23(1), 76-91. Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10), P10008. Bonneau, J., Anderson, J., Anderson, R., & Stajano, F. (2009, March). Eight friends are enough: social graph approximation via public listings. In Proceedings of the Second ACM EuroSys Workshop on Social Network Systems (pp. 13-18). ACM. Chadwick, A., & Howard, P. N. (2009). Introduction: New directions in Internet politics research. Routledge handbook of Internet politics, 1-9. Chu, Z., Gianvecchio, S., Wang, H., & Jajodia, S. (2010, December). Who is tweeting on Twitter: human, bot, or cyborg?. In Proceedings of the 26th annual computer security applications conference (pp. 21-30). ACM. Deibert, R. (2009). The geopolitics of internet control: Censorship, sovereignty, and cyberspace. The Routledge handbook of internet politics, 323-336. Nielsen, F. Å. (2008). Clustering of scientific citations in Wikipedia. arXiv preprint arXiv:0805.1154. Fu, K. W., Chan, C. H., & Chau, M. (2013). Assessing censorship on microblogs in China: Discriminatory keyword analysis and the real-name registration policy. IEEE Internet Computing, 17(3), 42-50. Huberman, B. A., Romero, D. M., & Wu, F. (2008). Social networks that matter: Twitter under the microscope. Available at SSRN 1313405.

34

Java, A., Song, X., Finin, T., & Tseng, B. (2007, August). Why we twitter: understanding microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis (pp. 56-65). ACM. King, G., Pan, J., & Roberts, M. E. (2013). How censorship in China allows government criticism but silences collective expression. American Political Science Review, 107(02), 326-343. Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. M. (2013). Is the sample good enough? comparing data from twitter's streaming api with twitter's firehose. arXiv preprint arXiv:1306.5204. Newman, M. E. (2006). Modularity and community structure in networks. Proceedings of the national academy of sciences, 103(23), 8577-8582. Nielsen, F. Å. (2008). Clustering of scientific citations in Wikipedia. arXiv preprint arXiv:0805.1154. Saka, E. (2014). The AK Party's social media strategy: controlling the uncontrollable. Turkish Review, 4(4), 418. Sozeri, Efe K. (2015) The Two Faces of Twitter published at bianet.org Sparck Jones, Karen. "A statistical interpretation of term specificity and its application in retrieval." Journal of documentation 28.1 (1972): 11-21. Tanash, R. S., Chen, Z., Thakur, T., Wallach, D. S., & Subramanian, D. (2015, October). Known Unknowns: An Analysis of Twitter Censorship in Turkey. In Proceedings of the 14th ACM Workshop on Privacy in the Electronic Society (pp. 11-20). ACM. Ward, M. D., Stovel, K., & Sacks, A. (2011). Network analysis and political science. Annual Review of Political Science, 14, 245-264. Weng, J., Lim, E. P., Jiang, J., & He, Q. (2010, February). Twitterrank: finding topic-sensitive influential twitterers. In Proceedings of the third ACM international conference on Web search and data mining (pp. 261-270). ACM. Yamaguchi, Y., Takahashi, T., Amagasa, T., & Kitagawa, H. (2010). Turank: Twitter user ranking based on user-tweet graph analysis. In International Conference on Web Information Systems Engineering (pp. 240-253). Springer Berlin Heidelberg. Zhu, T., Phipps, D., Pridgen, A., Crandall, J. R., & Wallach, D. S. (2013). The velocity of censorship: High-fidelity detection of microblog post deletions. In Presented as part of the 22nd USENIX Security Symposium (USENIX Security 13) (pp. 227-240).

Detecting Influential Users and Communities in Censored Tweets …€¦ · follow a large number of...

Documents

Transcript of Detecting Influential Users and Communities in Censored Tweets …€¦ · follow a large number of...