Troll Detection - Diva927209/FULLTEXT02.pdf · Troll Detection A comparative study ... A practical...

32
IN DEGREE PROJECT COMPUTER ENGINEERING, FIRST CYCLE, 15 CREDITS , STOCKHOLM SWEDEN 2016 Troll Detection A comparative study in detecting troll farms on Twitter using cluster analysis FELIX DE SILVA MARTIN ENGELIN KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Transcript of Troll Detection - Diva927209/FULLTEXT02.pdf · Troll Detection A comparative study ... A practical...

IN DEGREE PROJECT COMPUTER ENGINEERING,FIRST CYCLE, 15 CREDITS

, STOCKHOLM SWEDEN 2016

Troll DetectionA comparative study in detecting troll farms on Twitter using cluster analysis

FELIX DE SILVA

MARTIN ENGELIN

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

TrolldetektionEn jämförande studie i att upptäcka trollfarmar på Twitter med

hjälp av klusteralgoritmer

FELIX DE SILVA MARTIN ENGELIN

Examensarbete inom datalogi, DD143XHandledare: Dilian GurovExaminator: Örjan Ekeberg

CSC, KTH 2016-05-11

AbstractThe purpose of this research is to test whether clustering algorithms

can be used to detect troll farms in social networks. Troll farms are profes-sional organizations that spread disinformation online via fake personas.The research involves a comparative study of two different clustering algo-rithms and a dataset of Twitter users and posts that includes a fabricatedtroll farm. By comparing the results and the implementations of the K-means as well as the DBSCAN algorithm we have concluded that clusteranalysis can be used to detect troll farms and that DBSCAN is bettersuited for this particular problem compared to K-means.

SammanfattningMålet med denna rapport är att testa om klusteringalgoritmer kan

användas för att identifiera trollfarmer på sociala medier. Trollfarmer ärprofessionella organisationer som sprider desinformation online med hjälpav falska identiteter. Denna rapport är en jämförande studie med två olikaklusteringalgoritmer och en datamängd av Twitteranvändare och tweetssom inkluderar en fabrikerad trollfarm. Genom att jämföra resultaten ochimplementationerna av algoritmerna K-means och DBSCAN får vi framslutsatsen att klusteralgoritmer kan användas för att identifiera trollfar-mar och att DBSCAN är bättre lämpad för detta problem till skillnadfrån K-means.

Contents1 Introduction 1

1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Scope and constraints . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Background 32.1 Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Twitter REST API . . . . . . . . . . . . . . . . . . . . . . 32.2 IFTTT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Troll . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3.2 Characteristics . . . . . . . . . . . . . . . . . . . . . . . . 4

2.4 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4.1 What is a cluster? . . . . . . . . . . . . . . . . . . . . . . 52.4.2 Similarity between data points . . . . . . . . . . . . . . . 62.4.3 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . 62.4.4 Partitive Clustering . . . . . . . . . . . . . . . . . . . . . 72.4.5 Model-Based Clustering . . . . . . . . . . . . . . . . . . . 82.4.6 Density-based Clustering . . . . . . . . . . . . . . . . . . 9

3 Method 103.1 Generate trolls . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Construct the cluster algorithms . . . . . . . . . . . . . . . . . . 103.3 Collect Twitter data . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 Generate results . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4.1 DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4.2 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.5 Method reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Results 134.1 Twitter Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2.1 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2.2 DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Discussion 185.1 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.2 Method discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6 Conclusion 20

7 Appendix 227.1 Twitter data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227.2 K-means results - multiple run . . . . . . . . . . . . . . . . . . . 257.3 DBSCAN results . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1 IntroductionFor millions of people around the world social media sites are an integrated partof their daily life. There are hundreds of different social media sites supporting awide range of practices and interests [5]. Social networks such as Facebook andTwitter have become a source for news and a platform for political and moraldebate for a lot of users. Stories with different degrees of truthfulness are spreadand little source criticism is applied by regular people as well as journalists. [10]The act of spreading disinformation on social media has developed from beingcaused by bored youths to being commercialized by organisations and politicalblocks in the form of troll farms.

A troll farm is an organization whose sole purpose is to affect public opinionwith the means of social media. A practical implementation of a system or asoftware that can identify troll farms could be used in order to stop them andtherefore avoid the spread of disinformation. Such an implementation would beinteresting to the politicians, media, social networks or organizations that aretargeted since it could be used to clear their names.

1.1 Problem definitionThe aim of this project is to investigate ways to detect troll farms on Twitterwith clustering algorithms and the Twitter API. The approach will be to studyclustering algorithms, apply them to a database of tweets and analyze them.Clustering algorithms are very dependent on the cluster structure, and there istherefore no one algorithm that works on every problem instance. The goal isto research the Twitter REST API to uncover what kind of cluster algorithms isthe most appropriate when trying to use cluster twitter users. The research willalso involve different kinds of clustering models and appropriate algorithms forthem, in order to find out if there is any comparable advantages or disadvantagesbetween different clustering models.

Therefore, the problem statement is:

• Which clustering model is the most appropriate for clustering twitter datain search of troll farms?

1.2 Scope and constraintsFor this project we will research what types of clustering models there are andbased on that research choose two models to use and analyze. Our search willbe based on the activity of users rather than the content of their statuses. Themain parameters to be analyzed are:

• Time of day activity

• Rate of tweets

1

Future research on this topic can analyze more models as well as other searchparameters. A more detailed description of the search parameters can be foundin section 3.

2

2 Background2.1 TwitterTwitter is a social network platform based on one-to-many communication whichenables users to post messages using 140 or fewer characters. These posts, calledTweets, can include plain text, links or other media hosted on different webservers. This simple design enables several different uses of Twitter, making it asort of mashup of text, email, IM, news forum, microblog and social network. [13]

Twttr, which was the original name for Twitter, was created in 2006. In thebeginning it was a text message-based communication tool for groups. Text mes-sages (SMS) are for historical reasons limited to 160 characters, which resultedin Twitter’s character limit of 140: 20 characters for username, 140 charactersfor the message. The first real success for the platform was at SXSW Interactivein 2007 where it won the SXSW Web Award in the Blog category. At this pointsmartphones had not become a big hit yet and the user base for phones thatcould only text was big. This was an important reason for Twitters success,the fact that anyone could engage in social media without a computer. TodayTwitter has evolved into a web-based product with simple but smart APIs. [13]

2.1.1 Twitter REST API

API stands for Application Programming Interface, and is a set of routines,protocols, and tools for building software applications and it’s what allows anapplication to share its data to the rest of the world. Like a website, it is accessedthrough URL requests but instead of returning web pages it returns structureddata. The Twitter API was originally divided into two REST APIs and aStreaming API. REpresentational State Transfer (REST) is an architecturalstyle that makes sure that data is stateless, layered and well defined. Thisincreases scalability and flexibility as well as ease of development. [9]

The two REST APIs of Twitter were due to historical reasons. A companycalled Summize, Inc provided search capability for Twitter data. When Summizelater on was acquired by Twitter it proved difficult to fully integrate TwitterSearch and its API into the Twitter codebase. It took several years to do this,but today they are both integrated into a single REST API. [13] The RESTAPI uses OAuth to identify Twitter users and applications. [8]

2.2 IFTTTIFTTT or "If This Then That" is a web service that can connect and aggregatemany other web apps into one platform and then perform a specific action givensome certain criteria.

IFTTT give creative control over app and products to create recipes toperform these actions. A recipe simply connects apps and web services togetherto create an action that can be performed under the condition that some criteriahas occurred. There are two types of recipes IF-recipes and DO-recipes that doactions in different manners. [7]

3

• IF-recipes runs automatically in the background and do its action whenthe recipe’s IF-condition has been fulfilled.

• DO-recipes simply runs its action when it’s manually executed.

2.3 TrollAn Internet Troll does not in any way resemble the original mythical creaturefrom the old Scandinavian folklore. An Internet Troll (hence refereed to as Troll)is a person that interrupts, harasses or tries to impose his/her own opinions toothers. [16] In early days trolls where mostly considered as a small nuisance ononline forums. Since then the Internet has grown and the problem of trollingwith it. From being an activity primarily performed by bored individuals, ithas evolved to be industrialized by states and terrorist organizations. Theseprofessional groups are sometimes called Troll farms. [2]

Trolls and their activities, trolling, can exist in any kind of social media.Twitter has become a popular platform for trolling activity due to its role as anews forum as well as the fact that anyone can create multiple accounts.

2.3.1 History

Troll farms and their activities are often criminal, or at least morally question-able adn because of this fact their history is not completely clear. Professionaltroll farms have been known to exist in Russia since at least 2008, but possiblyeven before that and probably all around the world. The troll business getsmore popular each year, and although it is hard to estimate exactly how manypeople are working in the Troll business, claims have been made that there arethousands.

One known example of a Troll farm is the so called ’Internet ResearchAgency’ in St. Petersburg, which was first exposed in 2013 by a Russian jour-nalist working at the agency undercover. [2] At that point the agency had 40rooms which housed around 400 employees, with a budget of around 400,000US dollars a month and was active on every big social media, including Twitter.

2.3.2 Characteristics

The work at the Internet Research Agency was very organized. Each day themanagement set up quotas for amounts of posts in different topics that eachemployee had to meet on different platforms. These platforms where blogs aswell as every big social media, including Twitter. Employees had to work 12hours a day from 09:00 to 21:00.

Every employee had an amount of fake personas that was daily updatingnew posts. These fake personas often did not have a big number of friends orfollowers, but the ones that did were mostly other fake accounts, often createdin close proximity of each other.[2]

4

2.4 Cluster AnalysisCluster analysis is a generic term that consists of a wide range of numericalmethods, where the goal is to uncover, discover or cluster groups of homogeneousdata points from a set of data points.

Clustering techniques and algorithms tries to formalize and mimic the abilityto observe and discover patterns from a set of point, which humans do so well. [1]These algorithms assess whether or not a set of points can be summarized inany meaningful way to a relatively small number of groups or clusters wherethe objects resemble each other and are different from other clusters. [14]

Clustering techniques and algorithms can be categorized into a number ofdifferent categories. These different categories describe the characteristics ofhow the set of data points of objects will be clustered together.

Examples of categories are:

• Hierarchical Clustering

• Centroid-Based Clustering

• Model-Based Clustering

• Density-Based Clustering

Fig 2.1: 2D clustering of data [12]

2.4.1 What is a cluster?

The term cluster, group or class have been used for a long time and in manyintuitive manners without having a real formal definition. The term clusteris often used to describe some form of group or collection of some objects orpoints, where the properties of these objects or points have similar values orattributes. [14]

5

2.4.2 Similarity between data points

Clusters consists of a number of objects or data points that have some similaritybetween them. This measurement of similarity can be done in different waysdepending on the data set. Often the measurement are generally scaled in therange of [0,1], although it is sometimes expressed in percentages in the rangerof 0-100%.

Two point x and y can have a similarity coefficient Sxy of 1% or 100% ifthe point are identical, meaning that all the properties between the points areequal. Which in turn, a similarity value of zero indicates that the two pointsare 100% different from each other.

Different measurement of similarity can be calculated from the same set ofdata point or objects. This will often lead to different conclusions when used asthe basis of the cluster analysis. [14]

The mathematical formulas that could be used to calculate the distancebetween two points in an n dimensional space is these:

• Euclidean Distance - Calculates the distance between two point by mea-suring the length of the line segment between the two points. Usually usedin a 2 dimensional space and then called Pythagorean theorem, but in ann dimensional space is called Euclidean distance.

• Manhattan Distance - Calculated the distance between two points by mea-suring the lengths of the horizontal and vertical lines from point A topoint B.

The two mathematical formulas both gives good measurement of the lengthbetween points in a n dimensional space, but depending on the context oneformula could be better the another.

2.4.3 Hierarchical Clustering

Hierarchical Clustering is a class of clustering algorithms that produces data ina hierarchical classification. The data is not partitioned into clusters or groupsin a single step. Instead the classification consists of a number of recursivepartitions that may end up with the whole data set in one cluster or n clustersall containing a single data point. [1]

The algorithms constructs clusters by using recursive partitioning in a top-down or bottom-up fashion. These algorithms can in turn be divided into twosubcategories.

Agglomerative hierarchical clustering algorithms produce clusters bya series of successive recursive fusions, where n individual data points fuses tolarger clusters. With these kinds of algorithms, the fusion is irreversible once itis made and subsequently cannot appear in different clusters.

Divisive hierarchical clustering algorithms produce clusters in the re-verse order of the agglomerative way. Here all data points initially belong to asingle cluster and recursively divided into smaller sub-clusters.

6

The result of a hierarchical clustering could be represented as scatter plotor a dendrogram, representing the clusters of data points. [6]

Examples of hierarchical cluster algorithms are:

• Single-link Clustering

• Avarage-link Clustering

• Complete-link Clustering

Fig 2.2: Dendrogram of clustered data, where the clusters emerges when a cutis made across the horizontal axis [15]

2.4.4 Partitive Clustering

The most commonly used and simple centroid-based clustering algorithm is theLloyd’s algorithm, also known as the K-means algorithm. This non-deterministicalgorithm employs a squared error criterion by partitioning the data set into Kseparate clusters, where all the clusters are represented by their center. The cen-ter for each individual cluster is computed using the data point in the clusterto calculate the mean of all the data points belonging to the cluster.

The K-means algorithm can be viewed as a gradient-descent algorithm,meaning that is uses gradient-descent to find a local minimum in the errorfunction. The way the algorithm is doing this is by starting with an initial setof K separate cluster centers and iteratively update the initial set so that theerror function decreases.

A problem with the K-means algorithm is the initial selection of the clusters.The algorithm is very sensitive to the selection, which means that different initialselection could result in different minimums both local and global. In addition

7

to the sensitivity to initial selection, the algorithm is also sensitive to noisydata and outliers. Noisy data and outliers will negatively affect the result,just one single outlier can increase the square error dramatically. [6] However,because the K-means is a non-deterministic algorithm the result will be betterapproximated when the algorithm is run several times and then compiled to asingle result.

Fig 2.3: K-means clustering represented in Voronoi-cells [4]

2.4.5 Model-Based Clustering

The previously described hierarchical and centroid-based clustering approachesare mostly based on heuristically but intuitively reasonable methods. They arenot made from formal models for clustering data in a structured manner. Thismakes it difficult to decide which method to use, what number of initial clusters,etc.

In practise this is not a problem, cluster analysis is most often used as atool for data analysis. However, if a structured and acceptable cluster structurecould be used, then the result from the cluster analysis based on that modelwould give more promising solutions. [1]

These model-based methods attempts to optimize the fit between data pointusing some mathematical model. Unlike heuristic clustering, model-based clus-tering also tries to find characteristics to describe each group, so that each group

8

represents a concept or class. [6]The most common models are:

• Decision Trees - Data is represented as a hierarchical tree.

• Neural Network - Data is represented as neurons and constructed likea graph, often used in machine learning.

2.4.6 Density-based Clustering

Density-based cluster are a bit different from heuristic and model-based clus-tering. These types of methods do not need the number of clusters as input orthe assumption of the underlying density. [11]

Density-based methods assumes that data point that belong to a clustercomes from a probability distribution, where the distribution of data is assumedto come from a set of several distributions. The meaning of these types ofalgorithms is to identify the clusters and their distribution parameters. [6]

Sometimes density-based clusters are also known as "natural clusters" sincethey are much suitable for nature-inspired clusters. In cases of spatial data, clus-ters of points in a 3D-space may form along natural structure such as mountains,rivers and caverns.

The DBSCAN algorithm (Density-Based Spatial Clustering of Applicationswith Noice) can discover clusters of arbitrary shapes. This algorithm is efficientfor large spatial databases and uses search methods for searching for neighbor-hoods of each data point in the database. [11]

Fig 2.4: Density-based Clustering [3]

9

3 MethodDetecting troll networks is a difficult task and many factors come into playwhen trying to determine if a user is a part of such activity. To make this taskmore manageable we have restricted it to only take the activity of the users intoconsiderations, not the content of their tweets.

In order to implement and analyze our approach a certain number of thingshas to be done. The first step is to create a small troll network by creatingTwitter accounts and letting these act in a predefined uniform way. This isdone because testing cluster algorithms on random Twitter data would not yieldany results that could easily be analyzed since it would be nearly impossible todetermine if the results are true or false. Having a known troll network in theTwitter data gives the cluster algorithms a clear goal, to find the troll network.

The next step is to implement and construct the cluster algorithms, as wellas getting the relevant data from Twitter in a format that is suitable for thealgorithms.

3.1 Generate trollsThe key in generating fake trolls is to try to mimic the behaviour of a real troll.In this case the trolls will be a part of a network that is most likely run in anorganized manner alike a normal company. Because of limited resources thegenerated troll network used in this thesis consists of no more than 4 trolls.Each troll must obey the following behaviours.

• First tweet of the day between 08:00 and 09:30

• Last tweet of the day between 14:00 and 16:00

• 5-8 tweets every day

The trolls are only active during workdays (Monday - Friday) and otherthan the above listed behaviours a special behaviour is used where the trolls atspecific times burst out many statuses in a short time span.

In order to maintain consistency in the trolls’ behaviour and simplify themanagement of them, special services can be used. With IFTTT a recipe can besetup to tweet every event started in an online calendar. Connecting the Twitteraccounts of the trolls to IFTTT and a calendar reduces the work to creatingevents in the calendar which gives a clear overview of the trolls’ behaviour.

3.2 Construct the cluster algorithmsFor many clustering algorithms there are plenty of documentation and examplesonline so implementing them is generally not a big problem. The alterationbetween different implementations of an algorithm is the amount of parametersused. For this thesis each point of data consist of 6 parameters, these are:

• Average first tweet per day

10

• Average last tweets per day

• Average length of day (time between first tweet and last tweet)

• Deviation of length of day

• Average rate of tweets per day

• Deviation of rate of tweets per day

This research compares two algorithms: DBSCAN and K-means. DBSCANis a density-based clustering algorithm that connects all data that are withina specified distance from each other into clusters. K-means is a centroid-basedclustering algorithm that clusters the data based on a specified number of clus-ters. The fact that DBSCAN is deterministic and K-means is non-deterministicaffects the approach to each algorithm and how the results are presented.

3.3 Collect Twitter dataThe Twitter API allows for programs to perform normal Twitter searches, amongother things. For our purposes instead of printing the results of the search itis stored in a simple database, shown in figure 3.1. A search with the TwitterAPI yields big amounts of information about the tweet or user, such as geodata,language and hashtags. Only the data that are relevant for our algorithms arestored in the database. The implementation made in this thesis is programmedin Java, using a library called Twitter4J which integrated the Twitter API withJava.

Fig 3.1: Database tables

Other than the users in our troll network the Twitter profiles are randomlyselected by first searching for popular hashtags and storing the users in thedatabase. Every day the consecutive 14 days these user’s statuses are down-loaded and stored into the database.

The different parameters have a big variety of magnitude. The length ofthe day is for example in seconds, giving values of between a few thousand toover 50 thousand. The amount of tweets on the other hand is often somewherebetween 1 and 30. These kind of values as parameters in a clustering algorithm

11

is not ideal since the algorithm calculates distance between points. If a certainparameter has a much bigger magnitude than another it will affect the algorithmmore than a parameter with smaller magnitude. Because of this the values areadapted based on their magnitude and deviations, in an attempt to make eachparameter equally important.

In the end a total of over 7000 tweets by 130 users were downloaded andstored into the database to be used by the clustering algorithms.

3.4 Generate resultsWhen generating the result from the two algorithm some adjustments has tobe made to the algorithms for the result to be presented in a neat and uniformway.

3.4.1 DBSCAN

The DBSCAN algorithm has variables that handles the minimal acceptable sizefor a cluster and the threshold distance. These variables has to be specified andafter some experimentation with different values they were set to these values:

• Minimal acceptable size of a cluster = 4

• Threshold distance = 4900

A minimal acceptable size of a cluster is set to 4 simply because it seemsreasonable to only define a set of points as a cluster if they are at least 4.

3.4.2 K-means

K-means is a non-deterministic algorithm, an algorithm that does not yield thesame results if you run it with the same input several times. Because of this thealgorithm is run 10000 times and a calculation of how often the different usersended up in the same cluster as every other user i made. From that a value of0.9 (90%) is set, saying that if two users end up in the same cluster 90 percentof the time they are similar.

Another value that has to be set in the K-means algorithm is the amountof clusters (or the K-value). With our data of about 130 users we found that aK-value of 7-12 gave the most consistent results.

3.5 Method reasoningOne obvious problem with our method is the fact that we have created the trollnetwork (which we are searching for in the algorithms) ourselves. Even thoughwe have tried to not let that affect our implementations of the algorithms itis hardly a very scientific approach. The reason why we have done it in thisnon-scientific way is that we have to have something to search for. If we justran the algorithms on random Twitter data it would require a workload out ofthe scope of this report to analyze the results and generate any sensible results.

12

4 Results4.1 Twitter DataA total of 7100 tweets from 130 users, including our 4 troll accounts was re-trieved. This data was analyzed and for each user the 6 parameters mentionedin section 3.2 was calculated. The results can be found in section 7.1.

4.2 AlgorithmsThe two algorithms were run with the Twitter data as input. Because of thedifferent natures of the algorithms the results are not presented in the same wayfor both algorithms.

4.2.1 K-means

The results of the K-means algorithm consists of two parts, the results of asingle run of the algorithm and the results of calculating an average run of thealgorithm.

In figures 4.1 are examples of a single run in the form of graphs. Since thereare 6 parameters the graphs have been divided into three 2-dimensional graphs.The colored squares represent users, the colored circles represent the centroidsof each cluster. The trolls have a black frame.

Fig 4.1a: First/Last

13

Fig 4.1b: Length/Amount

Fig 4.1c: Deviations

In figures 4.1a-c all of the trolls except one has been placed in the samecluster. Another thing to note is that not all clusters are filled. In this case K ischosen to be 7, but one of the clusters is empty. This is a result of the K-meansalgorithm, if K is increased the theoretical amount of clusters are increased as

14

well, but in practice the amount of non-empty clusters are often about 5-7 (withthis specific data). Increasing K is in other words not a method to increase theaccuracy of the results.

A multiple run with calculations on which users are connected is availableas a list in section 7.2. This list consists of each user that ends up in the samecluster as some other user at least 90 % of the time. The list is calculated from10000 runs of the algorithm. Note that fact that the trolls are all connectedwith each other, as well as the fact that there are many different connectionsbetween different users, some with higher percentage than our trolls.

4.2.2 DBSCAN

The result of the DBSCAN algorithm consists of a single run. Where the resultsconsists of the clusters that satisfies the conditions we have set.

In figures 4.2a-c are the result of the DBSCAN algorithm of Twitter dataincluding the trolls. The cluster in red are the trolls and the other clusters areTwitter profiles that matches the same characteristics of the DBSCAN.

Fig 4.2a: First/Last

15

Fig 4.2b: Amount/Length

Fig 4.2c: Deviations

16

In these figures you can see that the troll cluster (in red color) are bundledtogether and the other clusters are either far away from the trolls or close tothe trolls.

A thing to note is that in figure 4.2a and 4.2.c we can see that the clusterin green has many similarities with the trolls in both time of day activity anddeviation. Whilst the cluster in blue and black who did get caught by the DB-SCAN algorithm are quite far away from out trolls and doesn’t have similaritiesfrom the trolls except the deviation.

To see the result output see section 7.3.

17

5 DiscussionThe results show that clustering based on user activity can be used to detectour troll networks. It has also been shown that DBSCAN is a better approachthan K-means. A big difference between the two is the fact that DBSCAN onlyclusters the users that are in fact close to each other, while K-means alwaysputs every user in some cluster.

K-means is a good choice when the data is divided into clear clusters sinceyou have to specify the amount of clusters as input. In this case however mostdata will be random, and the goal is to find a small percentage that is veryalike. Choosing a small K-value will result in big clusters where the users in acluster are not necessary close to each other but the probability of there beingclear differences between the clusters is high. Choosing a big K-value will resultin smaller clusters, but also more empty clusters and with a higher risk of twousers that are close to each other ending up in different clusters.

Figures 4.1a-c demonstrates the problems with K-means. It is obvious fromthe graphs that all trolls are close to each other in every parameter but despitethis, one of them ended up in a different cluster. Other users that are no wherenear the trolls have ended up in the same cluster as them.

DBSCAN is a more suitable algorithm for our cause since its aim is to findonly the clusters of data that are very similar. In a spatial prospective all thepoint are ploted in an 6-dimensional space and DBSCAN finds the cluster usinga threshold distance, meaning how far away is accaptable for a point to be partof an cluster. In this manner all point that are close to one another is part of acluster and the outliers gets excluded from the result.

Figure 4.2a-c shows that DBSCAN cluster the point that is spatially close toeach other and excludes the point that is outside the threshold distance. In thefigures you can see four distinct cluster where in some dimensions the clustersare on top of each other according to some parameter. However keep in mindthat the figures are projected onto a 2-dimensional plane from a 6-dimensionalplane according to the parameters.

As stated in section 3.3 we have used values that are adapted to each otherin an attempt to make each parameter as important in the cluster algorithms.Based on the figures in section 4 we believe that this has been successful. Allfigures have a clear division of colors that show that the parameters in the figurehave mattered in the algorithms. For all figures except 4.2b it is also clear thatthe parameters representing the x- and y-axis are not the only ones affecting theoutcome of the algorithms, which is good since there are four other parametersaffecting the results in each figure.

5.1 Future researchEven though the results have shown that clustering based on activity can beused for detecting troll networks, it should be used more as a tool in the searchrather than a complete solution. User activity is one attribute that users ina particular troll network can share but analyzing the patterns of the tweets

18

and the contents of the tweets of these users is an important part of the search.Future research can go into more depth and analyze other clustering models, aswell as finding ways to determine the accuracy of this approach.

One thing to have in mind is the actions taken by the troll farms basedon research like this. In this case it would not be difficult for the troll farm tochange its ways resulting in more random activity, making this approach useless.This is something to think about when trying to find ways to detect troll farms.

5.2 Method discussionOverall the method has been good but there are always ways to improve it. Asstated in previous section searching for troll networks based on only activityand not taking the contents of the tweets into account is not a perfect solution.Other improvements would be to use a bigger set of data as well as using andanalyzing more than two different algorithms.

The two algorithms that we did chose are very different in their natureand is therefore a good start when researching this area. The reason why theabove stated improvements have not been implemented is for a lack of time andresources. In spite of this restriction the generated results have been more orless what we were expecting and hoping for.

19

6 ConclusionIn conclusion this project has been more of less a success. In the search for trollsin the Twitter network we managed to find our own fabricated troll farm usingonly the daily activity and habits of the Twitter users. The results showed usthat both algorithms works and could be used in the search for trolls in a socialnetwork, however the DBSCAN algorithm worked better in this case comparedto the K-means algorithm. DBSCAN worked better in the manner of executionand result. It needs only one run and gives a clean result without any outlinerscompromising the results. Whilst K-means give a promising result it needs to berun several times and the result is sensitive to outliers which can give differentresults on every run.

20

References[1] Torsten Hothorn Brian Everitt. An Introduction to Applied Multivariate

Analysis with R. Springer New York, 2011.

[2] Adrian Chen. The agency. The New York Times Magazine, 2015.

[3] Chire. Density-based clustering, 2011.

[4] Chire. K-means clustering, 2011.

[5] Nicole B. Ellison Danah M. Boyd. Social network sites: Definition, history,and scholarship. Journal of Computer-Mediated Communication, 2007.

[6] Jörg Sander Arthur Zimek Hans-Peter Kriegel, Peer Kröger. Data Miningand Knowledge Discovery Handbook. Springer New York, 2005.

[7] IFTTT Inc. IFTTT.

[8] Twitter Inc. Twitter REST APIs.

[9] Kevin Makice. Twitter API: Up and Running. O’Reilly Media, Inc, 2009.

[10] Metro. The viral eye - let us check before you share. Metro, 2014.

[11] Lior Rokach Oded Maimon. Density-based clustering. John Wiley Sons,Ltd, issue 3 edition, 2011.

[12] Arkanath Pathak. Machine learning, clustering, k-means.

[13] Christopher Peri. The Twitter API in 24 Hours. Pearson Education, Inc,2011.

[14] Daniel Stahl Brian Everitt Sabine Landau, Morven Leese. Cluster Analysis,5th Edition. John Wiley Sons, Ltd, 2011.

[15] Saed Sayad. Hierarchical clustering.

[16] Jiwon Shin. Morality and internet behavior: A study of the internet trolland its relation with morality on the internet. Technical report, TeachersCollege, Columbia University, 2008.

21

7 Appendix7.1 Twitter data

- Normalized 1-3.pdf

User Average first of the day Average last of the day Average length of day in seconds Deviation of length Average amount of tweets per day Deviation of amount10000_KEKS 85395,3 119835 12666,3 11778,3 33,3 12,8

?_????_?_#BDS 93378,4 143997,6 18379,2 27938,4 2,2 1,2

?JoxuaLuxor? 103271 169900,5 24389,5 17832,5 49,5 0,5

@F0O0 96189,3 197053,7 36531 33163,1 30,3 21

aam 125558 189656 23258 19808,8 16,7 10,1

Ajete 48585 148233,5 35788,5 1261,5 19 2

Alex 68709 197963,6 46356,8 28794 14,1 23,9

Amanda_Rose 169232,6 200377,4 11480,8 12584,5 2,2 1

André_Kapsas 117987,5 131835 4707,5 3751,5 27 21

Andreas_B 123792,8 195298,9 25839,4 17603,2 5,6 4,4

anna_garrs 109407,7 120682 4354,3 2211,3 33,3 13,5

Austin_Cowan 91000,2 143560,8 19536,6 24261 4 2,2

Barb_Juarez 69275 123521,5 19346,5 18847,5 24,5 5,5

Bill_Kent 38084,5 163190,5 45066 27074 48,5 0,5

Bite 128868 155925,7 9831 8122,7 19,3 6,6

Bob_Critchley 49591,3 73162 8544 9127,4 33 14,8

CableAmerica 100841,9 108183,7 2725,8 8173,7 1,5 0,5

Campaign_USA_2016 12101 168601,5 56700,5 18900,5 25 3

Carly_Says 93442,5 144066 18363,5 748,5 47 1

Cathy_B 47332 142320 34188 28915 20,5 10,5

CFJ_---???---_#OiP 80907,3 119087,3 13300 16148,4 32,7 12,7

Chef 100159,7 215084,8 41489,6 28452 7,6 3,7

chole 107386 108622,5 736,5 736,5 4 3

ctmommy 66423,3 179790 40820 29367,3 16,7 12,3

CurrentResident 63078,5 166954 37115,5 2145,5 25 8

Daniel_Eberhardt 197171,7 200126,3 968 1236,4 2 0,8

Darrell_Palmer 67319,7 112099,7 15673,3 11066,9 32 12,7

David_Engelin 98420 141434,2 15407,5 23552,1 2,7 2,1

David_Konecny 116073,2 168189,8 18734,3 13690,4 3,2 2,1

DBI 83361 188404,4 37959 33185 5,1 3,4

dexter_künkel 40088,1 152460,8 40180,7 28082,6 7 5,5

different?dissident 64874,3 198602 48241 18313,8 8,2 4,1

DowHeater 117960 170867 19747 19747 25 24

Dr,_Magued_Refaat 100421 125312,5 9541,5 5800,3 3,8 1,1

ekopolitan 81446 180084,2 35718,2 22094,8 16,6 8,1

Eliana 82164 167029 30525 28992 8 6

FirstToMarkets 96544 124432 10128 7721,1 28,8 21,9

FotosFred 77968,5 133971 20402,5 8391,5 71 24

Frank_Knopers 140466,5 198397 20960,5 18588,2 6,8 4,8

freiden 92597,9 194530,8 36831,1 28342,2 8,6 5,3

gclicque 95672,7 179645,3 30372,7 14765,8 8,7 3,7

Gooner_1907 82842 178755,3 34340 29192,1 12 7,9

Ha?im10 194935,7 205676 4047 5723,3 1,7 0,9

Hannah_Donovan 121340,8 165540,2 15852,8 24626,5 6,4 9,9

22

- Normalized 2-3.pdf

User Average first of the day Average last of the day Average length of day in seconds Deviation of length Average amount of tweets per day Deviation of amountHenrik_Nyh 101828,3 167483,2 23794,8 15441,3 4 2,4

HRZone 114038,7 184788,3 25652,4 7434,5 14,4 9,4

Igor 95724 98415,7 811,7 715,6 6 6,4

indya 161907,6 218180,8 20263,1 12726,3 5,6 4,1

J? 101245,3 180823,8 28545,2 24105,8 8,3 3,6

Jesse_Camacho 54239,5 158613,5 37414 8997 48,5 0,5

Johan_Nilsson 116398,8 184498,5 24859,7 20976,5 3,3 1,5

johan_åberg 111887,1 159751 17018,9 17960,7 2,5 1,6

John 94325 141954,2 17779,2 10146,1 23,5 16,5

josacar 134236 148691,8 5319,8 10232,6 2,8 1,7

Josephine 108113,8 184810,5 27610 22670,7 4,5 1,6

Josi_Bility 131976,3 193449,8 22486,8 13678,6 4,5 3,5

justice_pr_l'Egypte 91478,6 145358,8 18975,1 23624,6 3,5 1,5

k106 137772 160424,5 8192,5 8614,4 1,9 0,6

Kaffedrevet 59594,7 195318,6 49203,9 21448,1 21,3 9

Keep_it_in_Kent 77424,3 145770,3 24212,7 16228,5 34,7 22,4

KexBot2 85684,2 143354,8 20905,9 7802,6 4,4 1,6

Kex_Bot 88553 147081,8 20724,8 6610,5 4,6 1,6

Kipid 81061 123840 15107 10845,4 4 3,6

Konzepp 135075,3 167203,9 11734,3 14250,8 3,6 3,9

LaindonTweets 85742 118380,5 12378,5 63,5 54 6

Laura_Marshall 66346,2 139514,2 26498 20611,9 18 10,1

LinuxFera 121634 136128,3 5494,3 3992,1 33 22,6

Lorraine_Pascale 85240,5 157575,2 26004,8 33245,9 12 9,2

Macflu 76150,4 187327,9 40086,6 31998,7 5,6 4,8

MADmoiselleMim 100800,7 182173,5 29219,5 22626 8 4,8

Mailplane_Support 61244 92739,2 11095,2 19217,5 1,5 0,9

Maria_Höfl-Riesch 132104,2 150055,5 6371,3 7628 1,5 0,5

Marius_S,_Reichelt 93846,3 152446,7 21573,7 22760,3 4,3 1,2

Mark_BadAss_Trumpk 88563,3 118378,7 11495,3 6868,1 31,3 9,5

MarketUpdate 151595 198397,2 17282,2 16609,5 7,2 2,5

Markus_van_de_Wey 73130 162632 32102 18845 12 3

McStyleAvenue 73284,5 109828,5 13524 3462 48,5 1,5

Meysam_Doai 128641,5 167737,5 13856 423 4 1

Mick 114841 162483 16582 16582 16 15

mojave_rattler 106064,3 118474 5049,7 2965,5 32,7 14,6

newsonus 56562 133470,5 28148,5 5497,5 62,5 13,5

OrXan_ALiyev 136625 210968,3 27010 37986,7 6,7 6,6

Pedro_Moreira 117116 162781,3 16198,7 4840,8 2,3 0,5

PERU 57230,3 136618,6 28699,3 25193,9 4,2 1,5

Peter_? 109622,2 183864,7 26949,2 15613,3 8,2 4,3

PolitiTrends 58609,7 182094,9 44605,1 23650,7 4,4 1

Pramod_Mishra 118294,8 137522 7140,5 13409,9 3,7 2,7

ProCarCredit 91031 153598,3 22663,3 23224 7,1 7

23

- Normalized 3-3.pdf

User Average first of the day Average last of the day Average length of day in seconds Deviation of length Average amount of tweets per day Deviation of amountProducers_Passion 58977 59120,5 83,5 10,5 45 1

RandomInternetPerso 142246,7 179781,7 13721,7 16907 16,7 7,5

Reiðhjólaverzlunin 174784,6 174784,6 0 0 1 0

roger 39694,3 110504,3 25690 23517 32,7 12,4

Ross_Mayfield 59534,7 191778 47797,6 32926,8 3,9 2

ruby_lewis 105909 114802,7 3733,7 4144,9 25 18,5

Sam_Shahidi 137355 181304,2 15969,2 28237,5 1,7 1,1

Samuel_Ryan 71579 114246 16007 3318 25 3

SanFranciscoForTrum 16670 59741 16251 10817 50 0

Sarah_Blackburn 103910,8 172447,7 24723,5 31384,1 3,3 2,7

Sgt_Lambert 58197 59980 1063 236 41,5 7,5

ShoutEssex 98224,5 118173,5 7169 1436 59,5 12,5

Sigge_Eklund 164691,5 174442,5 3451 5903,1 1,8 0,8

silvertejp 111318,4 198590,8 31248,4 25691,3 10 9,5

SJ_AB 112755,8 190289,5 27913,8 17201,8 26,2 16,1

SOFIA_KAREMYR 189518,5 198477,5 2979 2919 4 2

Steve_Jobs_Syria 107878,7 118273 4647,7 2668,5 23,7 14,2

stone3u66ha 74483,5 203266,5 46063 8328 34 2

SWISSUKRAINE,OR 148641,5 172637,5 8396 7664 19 14

tanja_stark 102947,7 132982 10822,9 26421,6 1,6 0,7

the_guts_of 68170,5 152911,2 30440,8 30889,3 2,5 1,7

TheAwkwardRepublic 48650 64245 5755 835 60,5 10,5

tia 34023,9 189199,6 56421,4 26769,6 7 2,7

Tribün_K?z? 143992 213134 24688,7 21740,6 6,7 3,4

trollbot3 88669,3 142254,5 19433,2 6815,2 4,8 1,7

trollbot4 85395,5 143410,3 20886,8 4879,5 4,4 1,6

TSB 108151,3 162299 19694,3 14221,7 20,3 19,5

Tugendfurie 68669,2 202410,8 48451,5 29754 14,2 10,8

User_95815 49954,8 117148 24481,2 17187,3 10 6,8

Valentina 76585 121271 16419,3 12691,7 33 12

Varyagi 41703,3 91729 18012,3 12929,6 20,7 13,9

Vinyl_Digest 79907,8 113109,8 12242 22443 20,6 34,4

Voice_of_Reason 145354 172558,2 9724,2 16654,9 12,5 16

Wild_Hits 111876,8 130564 7017,2 12154,2 25,8 42,9

WishVintage 139834 151930,5 3656,5 642,5 3,5 0,5

Worldbulletin 93899,3 146100,5 18921,2 9669,4 8,2 4,3

Wreck-It_Rolfe 109586 121445,7 4713 2074,1 33,3 13,5

Yunita_Elysabeth 106177 143777 14160 12929 27 23

ZebOzzy 71766,7 148519,7 27193 17592,7 14,7 10,2

Zuhal 135427 138994 1340,3 1895,5 1,7 0,9

Zuhal_M 117063 119035,4 700,4 1112,1 2 1,3

[M]lordDVD 108712,3 171796,3 22644 16923,6 16,7 15,1

24

7.2 K-means results - multiple run

- K-means multiple.pdf

-Austin_Cowan (90.2%) -Amanda_Rose (90.9%) -Austin_Cowan (92.7%)

-Daniel_Eberhardt (95.7%) -justice_pr_l'Egypte (93.7%)

-Ha?im10 (90.9%) -Sigge_Eklund (92.9%) -ProCarCredit (90.3%)

-indya (90.8%) -SOFIA_KAREMYR (96.4%)

-indya (91.6%)

-mojave_rattler (96.4%) -Amanda_Rose (90.8%)

-Wreck-It_Rolfe (99.5%) -MarketUpdate (91.6%) -Maria_Höfl-Riesch (90.9%)

-WishVintage (90.9%)

-?_????_?_#BDS (90.2%) -MADmoiselleMim (95.8%)

-justice_pr_l'Egypte (97.1%) -anna_garrs (96.4%)

-Marius_S._Reichelt (92.7%) -k106 (90.4%) -Wreck-It_Rolfe (96.1%)

-Maria_Höfl-Riesch (96.2%)

-Dr._Magued_Refaat (91.3%) -Marius_S._Reichelt (90.3%)

-Austin_Cowan (97.1%)

-freiden (90.3%) -David_Engelin (90.8%) -Sgt_Lambert (96.5%)

-Marius_S._Reichelt (93.7%)

-Tugendfurie (91.0%) -Daniel_Eberhardt (93.4%)

-josacar (90.4%) -Sigge_Eklund (95.5%)

-Ha?im10 (95.7%) -SOFIA_KAREMYR (92.7%)

-Reiðhjólaverzlunin (93.4%) -Kex_Bot (94.9%)

-Sigge_Eklund (96.3%) -trollbot3 (94.1%) -Macflu (90.8%)

-SOFIA_KAREMYR (99.3%) -trollbot4 (92.1%)

-Worldbulletin (92.7%) -Producers_Passion (96.5%)

-justice_pr_l'Egypte (90.8%)

-tanja_stark (94.8%) -KexBot2 (94.9%) -Daniel_Eberhardt (96.3%)

-trollbot3 (95.7%) -Ha?im10 (92.9%)

-freiden (90.8%) -trollbot4 (93.8%) -Reiðhjólaverzlunin (95.5%)

-Gooner_1907 (91.7%) -Worldbulletin (93.6%) -SOFIA_KAREMYR (95.9%)

-Macflu (93.8%)

-DBI (93.8%) -Daniel_Eberhardt (99.3%)

-CableAmerica (91.3%) -Gooner_1907 (91.8%) -Ha?im10 (96.4%)

-Ross_Mayfield (90.8%) -Reiðhjólaverzlunin (92.7%)

-Chef (90.3%) -Sigge_Eklund (95.9%)

-DBI (90.8%) -J? (95.8%)

-David_Engelin (94.8%)

-DBI (91.7%) -josacar (96.2%)

-Macflu (91.8%) -Meysam_Doai (90.9%) -KexBot2 (94.1%)

-Kex_Bot (95.7%)

-anna_garrs (99.5%) -ctmommy (91.0%) -trollbot4 (96.3%)

-mojave_rattler (96.1%) -Worldbulletin (91.6%)

-Meysam_Doai (90.9%)

-KexBot2 (92.7%) -Zuhal (90.8%) -KexBot2 (92.1%)

-Kex_Bot (93.6%) -Kex_Bot (93.8%)

-trollbot3 (91.6%) -WishVintage (90.8%) -trollbot3 (96.3%)

?_????_?_#BDS is connected with: Ha?im10 is connected with: Marius_S._Reichelt is connected with:

Amanda_Rose is connected with:

MarketUpdate is connected with:anna_garrs is connected with: indya is connected with:

Meysam_Doai is connected with:

Austin_Cowan is connected with: J? is connected with:mojave_rattler is connected with:

josacar is connected with:

CableAmerica is connected with: ProCarCredit is connected with:justice_pr_l'Egypte is connected with:

Chef is connected with: Producers_Passion is connected with:

ctmommy is connected with: Reiðhjólaverzlunin is connected with:k106 is connected with:

Daniel_Eberhardt is connected with:KexBot2 is connected with:

Ross_Mayfield is connected with:

Sgt_Lambert is connected with:David_Engelin is connected with:

Kex_Bot is connected with: Sigge_Eklund is connected with:

DBI is connected with:

Macflu is connected with: SOFIA_KAREMYR is connected with:Dr._Magued_Refaat is connected with:

freiden is connected with:MADmoiselleMim is connected with:

tanja_stark is connected with:Gooner_1907 is connected with: Maria_Höfl-Riesch is connected with:

trollbot3 is connected with:

Wreck-It_Rolfe is connected with: Tugendfurie is connected with:

WishVintage is connected with:Worldbulletin is connected with: trollbot4 is connected with:

Zuhal is connected with:

25

7.3 DBSCAN results

 

DBSCAN RESULTS!  Cluster of Size: 5 found! Cluster of Size: 4 found! Cluster of Size: 5 found! Cluster of Size: 4 found! Cluster of Size: 4 found!  Threshold distance: 230.0 Minimal point limit: 4  Cluster 0 includes ?_????_?_#BDS [1167.2, 1800.0, 525.1, 1396.9, 55.0, 46.6] Austin_Cowan [1137.5, 1794.5, 558.2, 1213.1, 100.0, 87.6] David_Engelin [1230.2, 1767.9, 440.2, 1177.6, 66.7, 82.2] justice_pr_l'Egypte [1143.5, 1817.0, 542.1, 1181.2, 87.5, 60.0] Marius_S._Reichelt [1173.1, 1905.6, 616.4, 1138.0, 108.3, 49.9]  Cluster 1 includes J? [1265.6, 2260.3, 815.6, 1205.3, 208.3, 143.6] Johan_Nilsson [1455.0, 2306.2, 710.3, 1048.8, 83.3, 59.6] Josephine [1351.4, 2310.1, 788.9, 1133.5, 112.5, 64.3] MADmoiselleMim [1260.0, 2277.2, 834.8, 1131.3, 200.0, 193.2]  Cluster 2 includes KexBot2 [1071.1, 1791.9, 597.3, 390.1, 108.8, 65.7] Kex_Bot [1106.9, 1838.5, 592.1, 330.5, 115.0, 62.5] Trollbot3 [1108.4, 1778.2, 555.2, 340.8, 120.0, 66.5] Trollbot4 [1067.4, 1792.6, 596.8, 244.0, 110.0, 62.5] Worldbulletin [1173.7, 1826.3, 540.6, 483.5, 204.2, 173.5]  Cluster 3 includes anna_garrs [1367.6, 1508.5, 124.4, 110.6, 833.3, 539.0] Mojave_rattler [1325.8, 1480.9, 144.3, 148.3, 816.7, 584.5] Steve_Jobs_Syria [1348.5, 1478.4, 132.8, 133.4, 591.7, 567.9] Wreck­It_Rolfe [1369.8, 1518.1, 134.7, 103.7, 833.3, 539.0]  Cluster 4 includes 10,000_KEKS [1067.4, 1497.9, 361.9, 588.9, 833.3, 510.5] CFJ_­­­???­­­_#OiP [1011.3, 1488.6, 380.0, 807.4, 816.7, 508.4] Darrell_Palmer [841.5, 1401.2, 447.8, 553.3, 800.0, 507.0] Valentina [957.3, 1515.9, 469.1, 634.6, 825.0, 481.1]  

26

www.kth.se