Detecting Influential Users and Communities in Censored Tweets …€¦ · follow a large number of...
Transcript of Detecting Influential Users and Communities in Censored Tweets …€¦ · follow a large number of...
1
Detecting Influential Users and Communities in Censored Tweets Using Data-Flow Graphs
Rima S. Tanash1, Abdullah Aydogan2, Zhouhan Chen1, Dan Wallach1, Melissa Marschall2, Devika Subramanian1, Chris Bronk3.
1Rice University, Computer Science
2Rice University, Political Science 3University of Houston, Computer and Information Systems
Abstract
Current literature on social media censorship examined various aspects of censorship. However, the relationship among censored social media users has received much less attention. We address this gap in the literature by constructing a complete dynamic data-flow graph that models the communication between users, identifies influential users, and utilizes a wide variety of metadata embedded in tweets to follow data paths of censored tweets. Using a dataset that includes over 25 million tweets from Turkey, our analysis is based on 712,218 unique, censored tweets, associated with 13,056 distinct users. By applying a modularity metric to find user communities, our graph identified 5 large communities with influential users who are supporters of the Kurdish separatist movement, and famous accounts that had active social media involvement during the graft scandal of December 2013. In addition, using a machine-learning topic clustering algorithms to extract popular censored topics and keywords within these communities, we found that the largest community frequently mentioned topics regarding the leaked court documents of the corruption case, while the next four largest communities frequently referenced topics criticizing the military strikes against the separatist groups.
2
1. Introduction
Some social media outlets, like Twitter, provide mechanisms for governments to censor social
media posts in their geographical boundaries. This empirical reality triggered the development of
a new line of research that focuses on the nature of social media censorship (King et al. 2013,
2014; Zhu et al. 2013; Gunitsky 2015; Rød and Weidmann 2015). This literature has examined
various aspects of government censorship including the velocity of censorship, the content of
censored texts, and the geographic origins of posts. While current literature provides useful
techniques to examine the relationship among users (e.g. see Yamaguchi et al. 2010, Barbera
2015), existing studies use only following/follower (i.e. friendship) links to map the users’
relationships, and retweeted posts. We believe this approach provides a limited understanding of
the level of interaction between users (Huberman and Romero 2009). In this study we examine
another aspect of social relationships that involves studying the data flow between users to
investigate influential users in networks of user communities within censored data.
Building on these earlier works, we present a method for detecting user communities by
employing a novel approach of constructing complete data-flow graphs using a wide variety of
metadata embedded in tweets to follow paths of censored tweets. Our method captures tweets
and users (such as users being mentioned, retweeted, and replied to) the traditional approach,
which focuses exclusively on static friendship-based ties, misses. After detecting user
communities, we use machine learning statistical measures and topic clustering algorithms to
detect censored hot topics from each community.
Our analysis is based on a dataset of social media posts censored primarily by the Turkish
government. Turkey is by far the leading country in Twitter censorship over the last few years.1
1 See Appendix A, for the legal process of Internet censorship in Turkey.
3
In fact, 90 percent of the censored tweets during the second half of the 2016 were from Turkey.2
Applying our methods to 712,218 censored Turkish tweets between October 2014 and
September 2015, we detected interesting user communities and popular topics that shed light on
the aim of Turkish censors. Further examination of these communities and topics revealed that
the Turkish government targets mainly two types of users: those promoting corruption-related
discussions and those promoting Kurdish separatist movements. In addition, our analysis
provides important findings that indicate that there is a third, much smaller group that is also
censored. This group is more eclectic, with roughly a half representing tweets that are pro-
government in nature. The users in this group turn out to be mostly individuals who insult key
political figures by using vulgar language.
The paper is structured as following; we first explain the construction of data flow graphs
for detecting influential users. This will be followed by presentation of how we detect and
visualize social media user communities. We then present our analysis on topic clustering for
each of the largest communities. Finally we present some preliminary conclusions and discuss
some of our ongoing and future work.
2. Finding Influential Users
Graph metrics have been previously used by researchers to study social network
characteristics (Bonneau et al. 2009). In political science, more specifically, network analysis has
grown widely popular (Ward et al. 2011). In our research, we focus on finding influential users
and popular censored topics using a novel method to generate a dynamic data flow graph that
represents the data flow between the users.
2.1. Related Work 2 For more information, see Twitter transparency reports: https://transparency.twitter.com
4
Yamaguchi et al. (2010) showed methods for identifying authoritative users (i.e. influential
users) in Twitter by examining information flow in Twitter using data link analysis by assigning
an authoritative score to each user in the social graph. They argue that their approach is unlike
the existing methods that only deal with following and follower static relationships between users
(Java et al 2007 and Weng et al. 2010). Twitter users can be categorized as information
generators who post important tweets and tend to have many followers, and information seekers
who are users that follow a large number of users and are interested in reading more than posting
tweets (Yamaguchi et al. 2010). The following/follower relationships do not tell us much about
the level of interaction between these users because many users follow back other users as a
courtesy. Huberman and Romero (2009) found that the vast percentage of users in the
following/follower relationship do not interact with each other, so examining the link between
thousands of users (in many cases millions of users) is not useful. Additionally, over the years
Twitter attracted a large number of automated accounts known as bots that tend to randomly
follow a large number of users and retweet hoping to get followed back (Chu et al. 2010). For
these reasons, using the static followers/following relationship to identify influential users is not
very useful.
The Yamaguchi et al. rating algorithm considered the relationship from a user u to other
users that u follows, which is a static relationship that can be obtained from Twitter’s API. Next,
they considered the retweet dynamic relationship. The social graph they built consists of nodes
and edges, where nodes are user-nodes and tweet-nodes. The edges can be one or more of the
following relationships: a link from user u to a tweet that u posted, a link from user u to a user
that u is following, a link from tweet t to a tweet retweeted by t. After they obtain the graph, they
get an authoritative score based on the calculated weight of the edges. Twitter allows users to
5
mention, reply, and retweet other users that are not directly connected (not following each other).
This could occur for example, when the tweet is found in the public feed, or posted by some
intermediary users.
We argue that although their approach can be better than the existing methods, it only
considers users that are directly connected via following/follower relationships, and ignors users
that received the tweet from their timeline but are not directly connected to the user who
generated the tweet. This means that their approach missed many users that could also be
influential. Additionally, their approach only considers the retweet relationship, and ignores the
links between users that are mentioned or replied to in the tweets.
2.2. Twitter Data Structure
Twitter offers many free APIs for collecting public tweets, including the Streaming API3,
which captures continuous live tweets up to 1% of total Twitter feed, and the REST API4 for
retrieving historical tweets, such as tweets of specific users, or by tweets-IDs.
In this section we provide a brief description of some of the important fields in the tweet data
block returned by Twitter API. A typical tweet data block, looks similar to the example shown in
Figure 1, which we truncated for abbreviation.
3 Streaming API: https://dev.twitter.com/streaming/overview 4 REST API: https://dev.twitter.com/rest/public
6
Figure 1: Sample tweet returned by Twitter API
Some of the fields we present below could be absent or NULL. These fields help us
determine when tweets are: retweets, replies, mentions, original, and censored5:
1. “user”: this field contains attributes of the tweeting user, including, user-name, screen-
name, user-id, profile description, and many other attributes.
2. "retweeted_status": this field when present, means that the tweet is a retweet, and it
contains the representation of the original tweet including attributes of the original user and
the original tweet. Note that a retweeted-retweet does not contain attributes of the
intermediary retweet, but only the original tweet. For this reason, we consider the 5 Tweets’ fields guide: https://dev.twitter.com/overview/api/tweets
{"_id" : ObjectId("562c277d19a39eb2d8624f1a"),"contributors" : null,"truncated" : false,"text" : "RT @Jiyanizm: Bu defa hesabımı kapattırmaya kararlılar
bu vitaminsizler\n\nCok yogun spamlaniyorum arkadaslar ...","is_quote_status" : false,"in_reply_to_status_id" : null,"id" : NumberLong("625374103214653442"),"favorite_count" : 0,"source" : "<a href=\"http://twitter.com/download/android\" rel=
\"nofollow\">Twitter for Android</a>","retweeted" : false,"coordinates" : null,"entities" : {
"symbols" : [ ],"user_mentions" : [ ],"hashtags" : [ ],"urls" : [ ]
},"in_reply_to_screen_name" : null,"in_reply_to_user_id" : null,"retweet_count" : 23,"id_str" : "625374103214653442","favorited" : false,"retweeted_status" : {
"contributors" : null,"truncated" : false,"text" : "Bu defa hesabımı kapattırmaya kararlılar bu
vitaminsizler\n\nCok yogun spamlaniyorum arkadaslar ...",…}
7
relationship between the retweet and the original tweet logical, because the users are not
directly connected in the social graph.
3. "user_mentions": is a sub-field that contains a list of users being mentioned in the actual
text.
4. "in_reply_to_screen_name": this field indicates that the tweet is a reply. When present, it
contains the screen-name of the original user.
5. "withheld_in_countries": this field is only present if the tweet is censored. When
present, it contains a list of ISO-country codes in which the tweet is censored. For example,
"withheld_in_countries": [“TR”], means that the tweet is censored in Turkey only6.
Previous work by Tanash et al. (2015) validated that when this field is present, the tweet is
greyed-out when viewed from inside of the censoring country.
It is important to note that tweets may have a slightly different data block structure
depending on the API being used. Specifically, we discovered that a retweet collected using the
Streaming API does not include the "retweeted_status" field, appearing only as an original
tweet. When the same tweet is recollected using the REST API, it includes the
"retweeted_status" field, identifying the original user attributes. This finding is important as it
allows us to find additional users that may be of interest to the Turkish censors.
2.3. Data Collection
We used the same crawling methodologies and some data described in Tanash et al. (2015),
to get over 25 million tweets from Turkey between October 2014 and September 2015. We ran a
streamer from October 2014 until January 2015, and during the month of June 2015 to
6 Twitter uses “to withhold” instead of “to censor”. Throughout the paper we will use these two words interchangeably.
8
specifically capture tweets related to the Turkish election event. We also used the REST API to
collect tweets from users’ timelines for those found in our streamed dataset. Overall, we
identified 712,218 unique7 censored tweets each with a unique tweet id in Turkey, of which 96%
are retweets (contained “retweeted_status” field), associated with 13,056 distinct users.
Tweets may include additional strings (prefixes) that get automatically added by Twitter
depending on how the user chooses to post tweets, for example, the prefix “RT@username”
precedes the actual text when the post is a retweet. Additionally, if the tweet has a URL, Twitter
automatically converts the URL to a mini URL to reduce the number of characters, because the
size of the tweet cannot exceed 140 characters8. To find the number of identical messages , we
removed the prefixes and the mini URLs from all posts, and found 535,759 unique strings9. The
numbers we present above do not include tweets posted by 258 users that are withheld-accounts
in Turkey, because withheld-accounts are users for whom all posts are censored in specific
countries regardless of the tweet content.
2.4. Finding Censored Tweets
After collecting the tweets, we waited for few weeks, then we recollected the tweets
repeatedly for over a month, this is because Twitter censorship occurs days or months after the
tweet is posted. We used the REST API statuses/show/:id, which returns historical tweets by
providing the tweets ids as input parameters, we then examined if the “withheld_in_countries”
field was added by Twitter to the tweet data block, if true, we marked the tweet as censored
(Tanash et al. 2015).
7 Each tweet has a unique tweet id, we removed duplicate ids. 8 Twitter recently changed the limit on expressing a post with over 140-characters, but our data was collected prior to the change https://blog.twitter.com/express-even-more-in-140-characters. 9 This process is described in details in the "Data Pre-Processing” section of this paper
9
In the following section, we introduce a new simple algorithm to extract influential users by
converting tweets into a dynamic data flow graph that represents users’ communication.
2.5. Constructing Data Flow Graph
We introduce a simple algorithm for constructing a complete data-flow graph using a
wide variety of metadata embedded in tweets to follow data paths of censored tweets, instead of
the traditional approach of constructing such graphs using friendship relationships and retweets
only. To model the data-flow between users in our Twitter dataset, we converted the censored
tweets to a directed graph 𝐺(𝐸, 𝑈). Where U is the set of all nodes in our data, and each user is a
node. 𝐸 is the set of directed edges, representing the data flow link between the users. The edges
are added to the graph by examining every tweet’s JSON data structure, and applying the
appropriate rule that we defined in Table 1. The rules are described as following: (1) when a
tweet is original (𝑂), the node is a singleton, and no edge is added. (2) when the tweet is original
but contains a "user_mentions” subfield, we add an edge from the original user to each of the
mentioned users in the list, this is because when a user is mentioned in a tweet, the user receives
the tweet and the data flow direction is from the user to those who are mentioned. (3) If the tweet
contains a "retweeted_status" field, we add an edge from the original user to the retweeting
user, this is because the source of the data is the original user. (4) when the tweet contains both
"retweeted_status” field and the embedded original tweet (source) contains
"user_mentions” subfield, we first apply rule (3), then we add an edge form the original user to
each of the mentioned users. Finally, (5) when the tweet contains "in_reply_to_screen_name"
field, we add an edge from the replying user to the original user, this is because a reply tweet
generates new data which flows from the replying user to the original user.
10
Table (1): Data flow rules for adding directed edges
Note that because Twitter allows users to mention, reply, and retweet other users that are
not directly connected (not following each other), our algorithm allows us to captures additional
users from the metadata that we were not specifically crawling, while also identifying new users
who can also be influential. Users with the highest out-degree number (users who generated
most data) are users whose tweets reached to the most users in our graph and are influential. We
discuss the analysis of influential users from communities in the following sections.
3. Community detection
Now that we constructed our social data-flow graph, we can apply data clustering to
identify communities, also known as modularity, and analyze influential users from each
community. Modularity is a widely used metric for extracting modules in network structures
(Newman 2006) and is used for studying communities in social networks (Bonneau et al. 2009).
Each module or community consists of densely connected nodes with scarce connections to other
modules. In social networks, friends tend to communicate with each other more often than they
do with other nodes that are members of other communities. Studying communities and their
members is practically useful for conveying important information about common topics and
users’ characteristics.
11
Modularity is defined by Newman (2006) and noted in formula (1). 𝐺 is a directed graph.
Modularity 𝑄 is defined as the number of edges that exists in communities in 𝐺, minus the same
number of edges expected if the edges were distributed randomly. 𝑄 is calculated as following:
𝑄 =12𝑚 𝐴./ −
𝑘.𝑘/2𝑚
./
(1)
Where 𝑚 is the total number of edges in directed graph 𝐺, 𝑣 and 𝑤 are two random nodes, and
𝑘. and 𝑘/ are the out-degree for 𝑣 and 𝑤 respectively. When there is an edge between 𝑣 and 𝑤,
then 𝐴./=1, otherwise it is 0 (Bonneau et al. 2009). Dividing 𝐺 into two communities, the
algorithm is then repeated recursively until 𝑄 is maximized. The Louvain modularity method is
known to run faster by using greedy optimization (Blondel et al. 2008), which is appropriate for
processing very large networks. We decided to use a python library10 that uses the Louvain
modularity method. Applying modularity to our graph, we identified a total of 25 communities,
and found that there are five large communities, with the number of members per community
ranging from 7,791 to 4,879. In order to plot the top communities, we used Gephi an open source
visualization tool that uses a 3D engine to display graphs in real time and detects underlying
patterns that can tell a story about the data. We found Gephi to work fast when dealing with large
dynamic networks. It also comes with many useful functionalities such as filtering, clustering,
analysis, and exporting11. Figure 2 shows the top five communities created using Gephi with
10 Community detection for NetworkX’s documentation: http://perso.crans.org/aynaud/communities/ 11 For more info visit www.gephi.org
12
Atlas Force layout12. Each cluster is colored in a different color: red, green, blue, purple, and
yellow. Since the remaining communities had relatively smaller members size, ranging from 386
to 2 users, we omitted them in Figure 2.
Figure (2): Communities of censored tweets in Turkey- October 2014 through September 2015
Since the graph is constructed with data-flow rules that we defined in Table 1, we applied
vertex out-degree metric, where users with the highest out-degree number are users whose tweets
reached to the most users in our graph. These are the influential users. Using our visualization 12 A layout algorithms set the graph shape so it is more aesthetically pleasing, ForceAtlas layout sets the nodes that are connected closer and the nodes that are not connected gets pushed away. https://gephi.org/tutorials/gephi-tutorial-layouts.pdf
13
tool, we adjusted the node size to correspond with its out-degree size so that it is easier to
visually spot the most influential users (see Figure 3). Notice that the purple community is set
further away from the remaining communities in this layout. This is because the nodes in the
purple community are less connected to the other communities. On the other hand, the red,
green, yellow, and blue communities appeared tightly connected and closer to each other, which
suggests that users from these communities share common attributes, such as social contacts and
topics of interest.
14
Figure (3): Communities of censored tweets in Turkey October 2014- September 2015, with out-degree ranking for nodes.
3.1. Analysis of Influential Users in the Top 5 Communities
To confirm that these users are influential, we manually examined their profiles. The
community in purple is the largest community. The most influential users are fuatavnifuat,
15
HARAMZADELER333, BASCALAN, csagir2015, mehtabyuceel, and TheRedHack. One
important common feature of all these accounts is their active social media involvement during
the corruption scandal of December 2013. For example, Fuat Avni (@fuatavnifuat) has been
acting like a government insider leaking Erdogan’s strategies since the scandal was uncovered.
HARAMZADELER333 is the account that leaked corruption evidence such as legal tape
recordings between politicians and businessmen through Youtube and Twitter (Sozeri 2015).
The communities with red, green, and blue colors are the next three largest communities.
The influential users in these communities are overwhelmingly the advocates of the Kurdish
movement. These include AjansaKurdi1, AjansaKurdi2, Kurd24M, Diyarbakir7, curdistani, and
ROJOVA. AjansaKurdi1 and AjansaKurdi2 are the twitter accounts of Kurdish news site
www.ajansakurdi.com. Kurd24M is also a twitter account of a news source. Curdistani is another
pro-Kurdish user who regularly tweets promoting PKK and other Kurdish groups.
In addition to the top five communities, we identified one specific community with 387
nodes, shown in Figure 4. The graph representing this community takes the shape of a star
network, in which all of the nodes in this community are directly connected to the focal node
orgidee79 which acts as a hub, and information source. We categorized this as a porn community
of censored tweets because the content and the screen names contained sexual language. This
suggests that Turkish censors also target pornographic topics and users.
16
Figure (4): Porn Community
4. Topic Clustering
4.1. Data Preprocessing
Before feeding tweet texts into our topic classification model, we applied a sequence of
text processing techniques to ensure validity, accuracy, and relevance of our data. The first step
is extracting original tweets from retweets. As a popular social media platform, Twitter feed
consists of a large number of retweets. Indeed, in our dataset 96% of all posts are retweets. If a
tweet is a retweet, its JSON structure will have a non-empty retweeted_status field. In our top
five communities, the average retweet is almost 99%. Since retweets are endorsements for
original tweets and add little extra information, we decided to transform retweets into original
tweets by extracting the original content of the tweet. This approach also makes data de-
17
duplication easier as many RT@username prefixes are removed automatically. To determine
whether a tweet is a retweet or not, we checked its JSON structure. If retweeted_status field is
not empty, we extract the original tweet embedded in that field.
The second step is data de-duplication. Even though all tweets are original in the sense
that their retweeted_status field is empty, we still found a number of tweets that are almost
identical to each other. A further investigation revealed an RT@username + text structure. This
structure is usually generated when a user retweets another user but the resulting tweet is original,
i.e., has an empty retweeted_status field. We removed all RT@username structures that
appear at the beginning of a tweet and hashed every tweet to ensure uniqueness. At the end of
step two, three percent to twenty-five percent of tweets are eliminated as duplicates among
different communities.
At this point, our tweet texts are still not ready to be analyzed due to high variability
between similar words. For example, when a human sees three words university, University and
university’s, he knows that they are all related to university. For a computer, however, those
three words are coded differently and therefore are distinct from each other. A common text
processing technique is to make all letters lowercase. Besides doing this, we removed common
Turkish stop words13 as well as all occurrences of > and <, which are merely encodings for <
and > symbols (see Example One below). We removed any syllables that came immediately after
an apostrophe because they are used only for grammatical completeness. We also removed URLs
appended at the end of each tweet. Since Twitter automatically shortens and truncates those
URLs, it would be very difficult for us to decode the content behind each URL. As an illustration
of this process, Example Two shows a raw tweet text and its transformed version after data
preprocessing. 13 For Google Code of the Stop Words, see https://code.google.com/p/stop-words/.
18
Example One:
Tweet text shown on Twitter webpage:
>>>>AKPTERÖRÖRGÜTÜ<<<<
Tweet text encoded in our database:
>>>>AKPTERÖRÖRGÜTÜ<<<<<
Example Two:
Raw tweet text:
RT @PartizanChe: Ambulans ve sağlık ekiplerinden önce Toma geldi. Yazıklar olsun böyle
ülkeye! . #SuruçtaKatliamVar http://t.co/n9P5oGPHQ2
Tweet text after preprocessing:
ambulans ve sağlık ekiplerinden önce toma geldi yazıklar olsun böyle ülkeye .suruçtakatliamvar
4.2. Stemming
We also applied a stemming algorithm as a part of the preprocessing. Stemming is the
process of removing prefixes and suffixes (also called morphemes) so as to reduce derivational
differences among words with the same root meaning. For instance, stemming collapses words
university and universities into one word regardless of their grammatical tenses or pluralities. A
variety of open source Turkish stemmers are available on the Internet, and each has pros and
cons. Packages such as python snowballstemmer14 are too lightweight and easily customized,
and do not process Turkish language. We decide to use zemberek-nlp package because it is both
widely used and well researched (Akın and Akın 2007). One limitation of this package however, 14 Python Package. https://pypi.python.org/pypi/snowballstemmer
19
is that the program returns an empty string if there is no match for the given input word. In our
Twitter texts there are names (Erdoğan) and hashtags (AjansaKurdî) that are neither stemmable
nor in the package default dictionary. As a workaround for this, we modified the package to
return the original word if there is no match found. Example Three shows a raw tweet text and its
stemmed version.
Example Three :
Raw tweet text:
gelen, 6 milyon liralik kasa açiği ile 10 milyon liralik usulsüz yani naylon fatura ile
yapilan soygunu çözmek için harekete geçen.
Stemmed tweet text:
gelen, 6 milyon lira kas açığ ile 10 milyon lira usul yani naylon fatura ile yapılan soygun
çöz için hareket geçen
In this example, if we exclude the stemming of “kasa” as “kas”, the algorithm performed very
well. However, the algorithm failed in this example because the addition of a suffix “a”
(meaning “to” in Turkish) transformed the root into a word that has a second meaning. In
Turkish “kasa” means “money case”. However, it also means “to muscle”, where “muscle”
stands for “kas”.
4.3. Using tf-idf and NMF for Topic Clustering
Tf-idf is a popular measurement used in text mining and topic clustering. Tf stands for
term frequency. According to Sparck Jones (1972), a frequency of a term is positively related to
the weight of that term in a document. However, some common words such as the, a, an, have
high term frequency in most of documents. For this reason, tf itself tends to be a biased
20
measurement. Idf, which stands for inverse document frequency, is used to reduce the weight of
common words. The more common a word is, the smaller its inverse document frequency. The
tf-idf value of a word is the product of its term frequency and inverse document frequency.
After data preprocessing and stemming, each tweet is transformed into a vector of words.
The length of the vector is equal to the total number of unique words that appeared in a tweet. In
practice, it is usually recommended to eliminate highly frequent and infrequent words so as to
reduce information redundancy and facilitate subsequent computation. Therefore, we ignored
words that appear in more than ninety-five percent of tweets and words that appear less than
twice in total. As a result our tweet vectors become more compact without losing much
information.
To extract meaningful topics from the vectors of tweets, we used unsupervised machine
learning algorithm because tweets are unlabeled, and we don’t have prior knowledge of the
topics or the number of topics. NMF is an unsupervised topic classification algorithm widely
used in data mining and text retrieval. It has been shown to be successful in clustering Wikipedia
articles (Finn 2008) and tweet topics (Tanash et al. 2015) In our dataset each tweet vector is
treated as one row in a term-document matrix generated from tf-idf. NMF algorithm factors a
term-document matrix into a term-topic matrix and a document-topic matrix. The term-document
matrix has a dimension of 𝑛 𝑥 𝑚, where 𝑛 is the number of topics, a parameter that requires
tuning, and 𝑚 is the total number of words. We have set 𝑛 equal to 10, and generated groups of
topics for each community that we obtained in the previous section15. The following section
discusses the findings of this analysis.
15 n can be set to any number depending on how many topics we desire to extract. We have also tried setting n equal to 5, and found that the results did not change substantively. We did not report due to space limitations.
21
5. Topic Analysis per Communities
Table 2 lists the topics and usernames for each community. For presentation simplicity
we translated and presented only two topics per community. Full lists can be found in the
Appendix B. The largest community has 7,791 users. The topics from this community reference
the corruption scandal of December 2013 where Tayyip Erdoğan’s sons (Bilal and Burak) were
frequently mentioned in indictments. Particularly, the voice recordings between Bilal Erdoğan
and Recep Tayyip Erdoğan quickly became famous after the tapes were leaked and published
online. Words such as thief, Tayyoş (a rude way of saying Tayyip), steal, Erdoğan, Bilal, voice,
Tayyip, and Burak are referring to this issue. Also the words, Turkey and new, refer to the “new
Turkey” concept created by Turkey’s ruling AKP party. The supporters of AKP use this term to
describe their role since 2002, which, as they argue, was to replace the old secular Turkish
republic founded in 1923. The top usernames belonging to this community also refer to the
corruption scandal. For example BASCALAN can be translated as PRIMETHIEF, which was
intended to refer to the Prime Minister (Basbakan) Erdoğan. We also checked the other
usernames, and found that all are anti-government users who have actively tweeted regarding the
corruption scandal.
22
Table 1: Topic Analysis for Large Communities
The topics from the remaining top communities are overwhelmingly on the Kurdish
issue. We frequently see words like soldier, police, aircraft, and AKP together with words such
as bomb, murder, kill, massacre, arrest, gas, helicopter, war, shoot, and guerilla. The topics also
include location names of the fight between Turkish Armed Forces and the PKK/HPG. These
include Silopi, Amed (Diyarbakir in Kurdish), Bismil, and Cizre. During the military strikes,
these distinct pro-Kurdish accounts have regularly claimed that the Turkish armed forces are
killing civilians, including children and women. As a result the following words occur very often
among the hot topics: civilian, age, shoot, child, murder, woman, massacre, kill, police, and
soldier. The usernames also have pro-Kurdish connotations, such as AjansaKurdi1,
AjansaKurdi2, Kurd24M, and curdistani. The remaining user accounts are also extremely
outspoken, and have been criticizing the government on the Kurdish issues, and are supporters of
Community with length 7791
Topics thief Tayyoş police leave protect steal medium Turkey new Erdoğan Bilal Tayyip country Burak take voice
Usernames Mehtapyucell , csagir2014 , Fili_Z_ , fuatavnifuat , BASCALAN
Community with length 5276
Topics police soldier die kill HPG civilian age shoot child murder Kurd do make AKP Turk massacre child war speak peace
Usernames AjansaKurdi2 ,Hevalizmir , sisivas ,xerzan4 ,KemaIPir
Community with length 5241
Topics do police people soldier PKK guerilla make HPG Amed die Cizre Ajanskurdi child walk murder voice woman police leave Cizreunderattack
Usernames AjansaKurdi1 , mednuce, YilmazGedik , Kurd24M , Rojava
Community with length 5079
Topics do follow continue account protest make resist lynch new friend police strike age striking arrest gas remember worker weapon throw
Usernames Denizhuseyinulas, birlesik , Daglaradogru , dewedersim, Ortak__Platform
Community with length 4879
Topics minute last police Silopi soldier murder helicopter Bismil Amed start PKK do action announce guerilla decision take absent aircraft bomb
Usernames curdistani , RebellionKurde , Nisebiin , SeyithanE, Kullikwebun
23
the anti-government pro-Kurdish movement. This why there is more overlap between the blue,
yellow, green and red communities in Figure 3.
5.1. Topics from Singleton Users
Our analysis also revealed singletons nodes, which we defined as users with no links to
other users in our dataset (e.g. retweeting, replying etc.). Although these are not influential in our
graph, we found that Turkish censorship authorities still target these account holders. We
analyzed the content of the 547 censored singleton tweets to understand the potential cause for
the Turkish censorship authorities to withhold them. As depicted in Figure 5, these users have no
links to other users in the graph. Instead of talking about specific topics (such as corruption or
Kurdish issue), the content of these tweets is mostly about certain individuals (pro- or anti-
government), with the overwhelming majority containing insulting language. It is likely that
people who were insulted were the ones to have initiated the court process for censorship. The
largest group of tweets in this category consists of tweets targeting the businessman Aydın
Doğan, who is also the owner of a large media group. There are total of 233 tweets where he was
insulted. Although his media group recently lowered the tone of its government criticism, during
and after the Gezi park protests, Dogan’s media group was one of the most outspoken groups
criticizing the government. This suggests that many of these tweets were created by users who
supported the government.16 We have also identified 116 tweets that insult members of the AKP
government, such as Recep T. Erdogan (current president), Ahmet Davutoğlu (former Prime
Minister), and Lutfi Elvan (former Minister of Transport, Maritime, and Communication). The
remaining 198 tweets target less prominent individuals, but still use insulting language.
16 For example, the tweet “Sürüngen Aydın doğan ve yandaşları, tek tabanca RECEP TAYYIP ERDOĞAN”, can be translated to “Creeper Aydın Doğan and his team versus lonely pistol (warrior) Recep Tayyip Erdoğan”.
24
Figure (5): Users without edges (singleton)
6. Future Work
There are different directions for future work related to this research. One direction would be
to expand our work to include other social media as data sources. However, because the vast
25
majority of social media posts such as Facebook tend to be private, unlike Twitter, we believe it
will be difficult to collect enough public data and repeat the same experiment.
Another direction would be to use a machine learning classifier to detect tweets that may get
censored in the future, immediately after they are posted. An obvious choice would be to use
Naïve Bayes classifier. An accurate classifier can enable researchers to collect live tweets from
Twitter Streaming API and estimate the proportion of withheld tweets on the fly, without having
to check withheld statuses every several weeks or months. Additionally, Twitter’s Transparency
Report is found to underreport the true population of withheld tweets (Tanash et al. 2015), so the
estimated results from a tweet classifier can be used to cross-validate the actual scope of
censorship.
Our preliminary testing of 20,000 sampled tweets, 16,000 of which were used for training,
and 4,000 were used for testing, returned an average accuracy rate of 86% using tf-idf
transformation. In terms of predicting future tweets, we are still tuning our classifier and have yet
to reach a definite solution. As of now we have two major challenges. One challenge lies in the
insufficient new timely censored data to verify results against our classifier. This is because
there is always a delay between the time a tweet is created and the time it is withheld. Another
challenge is the evolving nature of Internet censorship. With new social events happening and
new political leaders emerging, the censored keywords continue to change and are dynamic.
More sophisticated classifiers are required to make better predictions. This is an ongoing work,
and we hope to report more results in the near future.
A different direction could be investigating what social media content increases the
likelihood of censorship. In a recent study, King et al. (2013) found that Chinese censorship
authorities mostly tolerate social media posts with government criticism, but not those with some
26
reference to a collective action event such as a street protest. Although their theoretical
expectations were confirmed in a dictatorial regime, it is not clear that these findings hold for
semi-democracies like Turkey. In particular, due to domestic as well as international concerns,
semi-democracies mostly allow political protests as long as they are peaceful. From a similar
standpoint, such governments also allow ordinary government criticisms. However, since they
have electoral concerns, they have the incentive to silence government criticism that has the
potential to harm the ruling party’s reelection prospects. To test the collective-action hypothesis,
we have conducted a preliminary analysis by identifying the major street protests and the Turkish
key words that are associated with these events.17 Using these keywords, we searched our
database for tweets posted on these dates, and extracted both censored and uncensored tweets.18
We found that the government censored only a very small fraction (on average 5.2 percent) of
the tweets with collective action reference.
We further examined our database by focusing on the social media posts of the users
whose tweets were censored only partially and conducted two separate topic analyses on the
censored and uncensored tweets of these users.19 The results show that the most salient topics for
non-censored topics are composed of mostly everyday language that is neither offensive nor
sharp-tongued. When we look at the topics from withheld posts, the picture is quite the opposite.
The words are overwhelmingly referring to the Kurdish/terrorism issue. Most importantly the
words associated with the government leadership (such as AKP, Erdogan, government,
17 For the purpose of this preliminary analysis, our time frame is June 11, 2015 to June 23, 2015. There were total of four major protests: Renault and EGO Workers (June 11), Women's Movement (June 12), Coal Miners (June 15), Gezi Park Anniversary (June 21). 18 King et al.’s (2013) methodology begins with identifying key words for the most salient topic areas of recent Chinese politics and crawling posts using these keywords in different topic areas. Since our data collection method is substantively different than King et al. (2013) we are unable to implement their data analysis step by step. 19 By comparing the posts from similar types of users, we aimed to control for many unobservable user-specific attributes that can influence the censorship outcome (Fu et al. 2013).
27
president, soldier, police etc.) frequently appeared together with terms related to violence (such
as massacre, kill, attack, war etc.).
7. Conclusion
Recent developments in communication technology have created significant shifts in
repressive governments’ attitudes toward censorship (Deibert 2009; Chadwick and Howard
2009). Although previously it was widely accepted that the Internet was immune to control, in
today’s world governments are implementing highly sophisticated and technological
mechanisms to regulate the information flow within their boundaries. As one of the leading
information sources, social media, particularly Twitter, has been greatly affected by the shift in
such censorship policies. Twitter’s policy change in 2012 enabled governments to withhold
certain posts if they violate local laws. In this study we analyzed data of censored tweets from
Turkey, which is by far the leading country in censorship in the recent years.
In this study we aimed to contribute to the political methodology literature by developing a
novel approach of creating complete dynamic data-flow graphs to model the relationship
between users and to identify influential users in user-communities. Combining our approach
with machine-learning topic clustering algorithm to analyze censored tweets in Turkey, we
investigated the hot topics in each community, and found that the users of censored tweets in
Turkey mentioned topics related to the graft probe in 2013, and the Kurdish separatist
movements. We believe that our framework can be applied to any twitter dataset to identify
influential users and hot topics such as, targeted marketing, and political campaigns.
28
Appendix A Legal Process of Censorship in Turkey
Unlike the censorship processes in dictatorial regimes where censors block sites without much
accountability, Turkey’s Internet censorship policy has an institutionalized mechanism, which
was outlined in the Law#5651.The process starts with initiation of a court case by individuals or
agencies claiming that certain content violates the law as it is assaulting someone’s personality,
is against the national interests, or because of other reasons defined in the law. Individual
citizens can apply to a court through their lawyers or directly themselves20. For government
agencies, the government recently created Prime Ministry Security Affairs General Management
(BGİGM, Başbakanlık Güvenlik İşleri Genel Müdürlüğü), which is responsible to coordinate the
litigation processes initiated by government agencies. The designated court for Internet crimes is
Penal Judgeship of Peace (Sulh Ceza Hakimligi)21. According to the law the court needs to give
its decision within 24 hours.
If the court accepts the claims, the decision is sent to the Telecommunication and
Communication Agency (TIB; Telekomunikasyon ve İletisim Başkanlığı), which is known as the
Internet watchdog of Turkey (Sozeri 2015). TIB is responsible to fax the court decision to the
social media company headquarter (such as Twitter Inc.). If the company declines withholding
the post, TIB may choose to block an entire site. This happened a few times in Turkey, for
example Twitter was temporarily blocked on July 22, 2015, after the Ankara bombing, which
killed 32 people. After the deletion or censorship of the tweet or account, Twitter occasionally
shares court decisions on the Chilling Effects website, recently known as Lumen Database.
20 for more info see http://www.aljazeera.com.tr/haber/sosyal-medyada-kendinizi-nasil-korursunuz 21 BGİGM has been mostly using Gölbaşı Penal Judgeship of Peace, which resides in Ankara. Altiparmak and Akdeniz (2015) states so far this court has been very friendly with the government agencies requests for censorship.
29
Figure 6 presents a screen-shoot of a court document stating the censorship decision. The shaded
regions are for the names of the judge and the person (agency) who requested the censorship.
Figure 6: A sample court decision asking to censor list of tweets
This censorship mechanism represents semi-democratic state structure. While the
government has been greatly benefiting from this censorship process, individual citizens or
opposition groups also take advantage of the system. That is to say, the government has been
dominating, but not monopolizing the censorship process. For example, after the violent attacks
against the local branches of Pro-Kurdish party, HDP, the lawyers of the party went to court to
withhold a total of 132 social media accounts in September 2015.The lawyers argued that these
accounts encouraged the violence against their party members and hence threatened the public
30
safety and peace.22 The court accepted the allegations of the party and confirmed withholding
these accounts23.
22 These accounts are known as ak trolls, many of whom are paid workers whose responsibility is to promote and advocate the policies of the ruling AK Party (Saka 2014, Akser 2014). 23 For more info see http://www.meydangazetesi.com.tr/gundem/mahkeme-trol-hesaplarini-hdp-nin-sikayeti-uzerine-kapatti-h33182.html (last accessed June 2011, 2016)
31
Appendix B Table 3: Topic Modeling per Communities
Community Length
Topic No Topics
7791
0 ol allah gel kapak hayır cennet bilgi sahip biraz gerek
1 hırsız tayyoş be polis çık koru fink çal orta di
2 yap şimdi su gün fatih takip koy iste dahil pkk
3 ver hesab el kapat paylaş kal twitter destek sayfa youtube
4 son dakika et skandal durum kenan ışık kesin avukat nun
5 günaydın hâlâ off yaş bak internet sabah lokma helva müjde
6 se yaş katil ye allah asker zengin padişah yaz fakir
7 türki yeni erdoğan bilal tayyip ülke burak do al ses
8 ed be devam ülke diren atatürk hakk çocuk hakaret büyük
9 akp san mhp gör gid katil önce din üye pkk
5276
0 polis asker öl öldür hpg sivil yaş vur çocuk katled
1 son dakika çatış nusaybin ilçe saldırı şiddet mahalle kızıltepe patla
2
cizre ajansakurdi ed nur cizreunderattack hdpgenelmerkezi yürü kaybet çıkma hayat
3 halk savaş diren iste karşı barış silvan demirtaş gör cenaze
4 hesab yeni takip ff izmirliheval kapat türki destek lütfen dost
5 pkk yok yap eylem et al ilan biji in karar
6 ver ses oy destek ölüm el insan haydi se akp
7 ol insan devlet ülke tc büyük katil çocuk şeref güzel
8 kürt yap ed akp türk katliam çocuk savaş konuş barış
9 hdp saldırı bina barış oy yay kal demirtaş faşist in
5241
0 son dakika patla mahalle büyük silopi gel dakıka çete yakın
1 kürt akp türk çocuk öl karşı yok vur konuş faşist
2
cizre ajansakurdi çocuk yürü katled ses kadın polis çık cizreunderattack
3 ypg ypj twitterkurds kobanê kobane li in ışid biji el
4 hdp akp oy mhp bina parti saldırı aday seçi başkan
5 ver se oy ses be el kardeş barış destek haydi
6 yap demirtaş pkk başkan savaş bak güc karşı saldırı çağrı
7 ed polis halk asker pkk gerilla et hpg amed öl
8 destek ff hesab takip yeni arkadaş kapat türki dost hevalnooo
9 ol insan tt müşahit seçi çocuk san anne sonra emin
5079
0 ed takip devam hesab protesto et diren linç yeni arkadaş
1 ol insan sela yok san se iste büyük herkes tek
32
2 konser yorum grup yer izmir 00 büyük yıl 20 30
3 son dakika unut adalet durum al dakıka silvan kuzey yoldaş
4 gel bak gün güzel insan yoldaş san tweet çocuklar izmir
5 halk karşı cephe adalet sokak diren tv düşman açıkla hakk
6 cizre devlet çık ajansakurdi katil çocuk yaş yol yok terör
7 redhack yay via hackledi dost mesaj baki aile unut oku
8 polis saldırı yaş saldır gözaltı gaz an işçi silah at
9 ver akp hdp yap saldırı bina oy seçi silah katil
4879
0 turkish turkey to the in of kurds police and amp
1 dakika son polis silopi asker katlet helikopter bismil amed başla
2 ol kürt çocuk hesab hevalno gör lütfen yay iyi şimdi
3 pkk et eylem ilan gerilla karar al yok uçak bombala
4 hdp li istanbul kadın oy çık milletvekil hedef 00 saat
5 ajansakurdi cizre ta polis hükümet şırnak ye özel genc çıkma
6 ver destek oy se hesab insan ff gece tc cnnturk
7
twitterkurds kurdistan in of kurdish ısıs ypg syria terroristturkey rojava
8 an itibari şırnak gev diren silopi meydan nusaybin merkez cenaze
9 yap ed akp gel saldırı asker acil polis katliam katled
33
Works Cited
Akın, A. A., & Akın, M. D. (2007). Zemberek, an open source NLP framework for Turkic languages. Structure, 10, 1-5. Akser, M. (2014). Turkish Film Festivals: Political Populism, Rival Programming and Imploding Activities. Film Festival Yearbook, 6, 141-155. Altiparmak K., Akdeniz Y., (2015) 5651, Madde 8/A Buyuk Sansur donanmasinin Amiral Gemisi, Guncel Hukuk, 11-143 Bamman, D., O'Connor, B., & Smith, N. (2012). Censorship and deletion practices in Chinese social media. First Monday, 17(3). Barberá, P. (2015). Birds of the same feather tweet together: Bayesian ideal point estimation using Twitter data. Political Analysis, 23(1), 76-91. Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10), P10008. Bonneau, J., Anderson, J., Anderson, R., & Stajano, F. (2009, March). Eight friends are enough: social graph approximation via public listings. In Proceedings of the Second ACM EuroSys Workshop on Social Network Systems (pp. 13-18). ACM. Chadwick, A., & Howard, P. N. (2009). Introduction: New directions in Internet politics research. Routledge handbook of Internet politics, 1-9. Chu, Z., Gianvecchio, S., Wang, H., & Jajodia, S. (2010, December). Who is tweeting on Twitter: human, bot, or cyborg?. In Proceedings of the 26th annual computer security applications conference (pp. 21-30). ACM. Deibert, R. (2009). The geopolitics of internet control: Censorship, sovereignty, and cyberspace. The Routledge handbook of internet politics, 323-336. Nielsen, F. Å. (2008). Clustering of scientific citations in Wikipedia. arXiv preprint arXiv:0805.1154. Fu, K. W., Chan, C. H., & Chau, M. (2013). Assessing censorship on microblogs in China: Discriminatory keyword analysis and the real-name registration policy. IEEE Internet Computing, 17(3), 42-50. Huberman, B. A., Romero, D. M., & Wu, F. (2008). Social networks that matter: Twitter under the microscope. Available at SSRN 1313405.
34
Java, A., Song, X., Finin, T., & Tseng, B. (2007, August). Why we twitter: understanding microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis (pp. 56-65). ACM. King, G., Pan, J., & Roberts, M. E. (2013). How censorship in China allows government criticism but silences collective expression. American Political Science Review, 107(02), 326-343. Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. M. (2013). Is the sample good enough? comparing data from twitter's streaming api with twitter's firehose. arXiv preprint arXiv:1306.5204. Newman, M. E. (2006). Modularity and community structure in networks. Proceedings of the national academy of sciences, 103(23), 8577-8582. Nielsen, F. Å. (2008). Clustering of scientific citations in Wikipedia. arXiv preprint arXiv:0805.1154. Saka, E. (2014). The AK Party's social media strategy: controlling the uncontrollable. Turkish Review, 4(4), 418. Sozeri, Efe K. (2015) The Two Faces of Twitter published at bianet.org Sparck Jones, Karen. "A statistical interpretation of term specificity and its application in retrieval." Journal of documentation 28.1 (1972): 11-21. Tanash, R. S., Chen, Z., Thakur, T., Wallach, D. S., & Subramanian, D. (2015, October). Known Unknowns: An Analysis of Twitter Censorship in Turkey. In Proceedings of the 14th ACM Workshop on Privacy in the Electronic Society (pp. 11-20). ACM. Ward, M. D., Stovel, K., & Sacks, A. (2011). Network analysis and political science. Annual Review of Political Science, 14, 245-264. Weng, J., Lim, E. P., Jiang, J., & He, Q. (2010, February). Twitterrank: finding topic-sensitive influential twitterers. In Proceedings of the third ACM international conference on Web search and data mining (pp. 261-270). ACM. Yamaguchi, Y., Takahashi, T., Amagasa, T., & Kitagawa, H. (2010). Turank: Twitter user ranking based on user-tweet graph analysis. In International Conference on Web Information Systems Engineering (pp. 240-253). Springer Berlin Heidelberg. Zhu, T., Phipps, D., Pridgen, A., Crandall, J. R., & Wallach, D. S. (2013). The velocity of censorship: High-fidelity detection of microblog post deletions. In Presented as part of the 22nd USENIX Security Symposium (USENIX Security 13) (pp. 227-240).