TweeProﬁles: detection of spatio-temporal patterns on...

FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO

TweeProfiles: detection ofspatio-temporal patterns on Twitter

Tiago Daniel Sá Cunha

Mestrado Integrado em Engenharia Informática e Computação

Supervisor: Carlos Soares (PhD)

Co-Supervisor: Eduarda Mendes Rodrigues (PhD)

6th February, 2013

c© Tiago Daniel Sá Cunha, 2013

TweeProfiles: detection of spatio-temporal patterns onTwitter

Tiago Daniel Sá Cunha

Mestrado Integrado em Engenharia Informática e Computação

6th February, 2013

Resumo

Redes sociais na internet apresentam-se como fontes de informação valiosas, no que diz respeitoaos seus utilizadores e aos seus respectivos interesses. Tal informação tem sido sujeita a váriosestudos, conduzidos por investigadores de Data Mining de todo o Mundo, de forma a descobrircomportamentos e padrões dos utilizadores. Para além disso, tem existido também investimentoem criar plataformas para extracção contínua e visualização de informação.

Esta dissertação espera identificar perfis de tweets envolvendo multiplos tipos de informação,nomeadamente espacial, temporal, social e de conteúdo. Cada dimensão é processada separada-mente e agregada no resultado final, considerando um sistema de pesos.

Os objetivos de TweeProfiles sao atingidos através da adaptação de algoritmos de clustering efunções de distância para minar os perfis e, também, para apresentar os resultados obtidos numaplataforma desenvolvida para uma utilização dinâmica e intuitiva, que pretende revelar os padrõesobtidos de uma forma compreensível.

O caso de estudo em que esta dissertação será aplicada é a Twitosfera portuguesa, apesar deser desenvolvida para suportar quaisquer tweets georeferenciados provenientes do Twitter.

i

Abstract

Online social networks present themselves as valuable information sources about their users andtheir respective interests. Such information has been subject to many studies conducted by DataMining scholars throughout the world in order to discover users’ behaviours and patterns. Besides,there has also been also investment applied in creating platforms for the continuous informationextraction and for their data visualization.

This dissertation aims to identify tweet profiles involving multiple types of data, namely spa-tial, temporal, social and content. Each dimension is computed separately and aggregated to thefinal result considering a weighting scheme for each dimension.

The goals for TweeProfiles are achieved by adapting clustering algorithms and distance func-tions to mine the profiles and also to displays the obtained results in a visualization platformdesigned for a dynamic and intuitive usage, aimed at revealing the discovered patterns in an un-derstandable way.

The study case on which it will be applied is the Portuguese Twittosphere, although it will bedeveloped to use any geo-referenced tweets extracted form Twitter itself.

iii

Contents

1 Introduction 11.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 State of the art 32.1 Twitter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.2 Twitter API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.3 TwitterEcho platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3.1 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3.2 Hierarchical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.3 Density based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.4 Grid based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.5 Graph based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.7 Clustering Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Distance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.1 Spatial Distance Functions . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.2 Temporal Distance Functions . . . . . . . . . . . . . . . . . . . . . . . 182.4.3 Social Distance Functions . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.4 Content Distance Functions . . . . . . . . . . . . . . . . . . . . . . . . 192.4.5 Mixed Distance Functions . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.5 Spatio-Temporal Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.5.1 Clustering Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 212.5.2 Georeferenced Data Visualization . . . . . . . . . . . . . . . . . . . . . 222.5.3 Timestamped Data Visualization . . . . . . . . . . . . . . . . . . . . . . 24

3 Solution Perspective 273.1 Solution Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3 Experimental setup, validation and evaluation . . . . . . . . . . . . . . . . . . . 303.4 Workplan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

References 33

v

CONTENTS

vi

List of Figures

2.1 TwitterEcho Physical Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Dissimilarity Matrix [Han06]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Clustering Visualization [LAR07]. . . . . . . . . . . . . . . . . . . . . . . . . . 212.4 Clustering Visualization [CSX08]. . . . . . . . . . . . . . . . . . . . . . . . . . 222.5 Map with event detection on Twitter [Lee12]. . . . . . . . . . . . . . . . . . . . 222.6 Modified Google Earth rule visualization tool [CM07]. . . . . . . . . . . . . . . 232.7 Real-time heat maps of positive and negative sentiments expressed via Twitter

[Fit12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.8 Time Graphic for event detection on Twitter [Lee12]. . . . . . . . . . . . . . . . 242.9 Timeline [RLW12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1 Steps in the development process. . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Proposed solution architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3 Expected final tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4 Current distribution of georeferenced tweets in TwitterEcho. . . . . . . . . . . . 303.5 Timeline for the second phase of the dissertation. . . . . . . . . . . . . . . . . . 31

vii

LIST OF FIGURES

viii

List of Tables

2.1 Comparison of clustering algorithms (Addapted from [LMS11]). . . . . . . . . . 15

3.1 Algorithms and distance functions for proposed solution. . . . . . . . . . . . . . 28

ix

LIST OF TABLES

x

Abbreviations

API Application Programming Interface

HDFS Hadoop Distributed File System

HTML HyperText Markup Language

HTTP Hypertext Transfer Protocol

IDF Inverse document frequency

JSON JavaScript Object Notation

REST Representational State Transfer

TFIDF Term frequency - Inverse document frequency

xi

Chapter 1

Introduction

1.1 Context

A social network is defined in social sciences as a social structure composed by a set of actors

and the ties between them [PMMR13]. More recently, it acquired a new meaning in information

science which is "a dedicated website or other application which enables users to communicate

with each other" [Dic]. More generally, online social networks present a variety of social media

services.

In recent years, social media services have achieved a huge importance in social life and also

in business strategies for companies, since they "have been regarded as a timely and cost-effective

source of spatio-temporal information" [LYCW11]. The massive adhesion and the number of

platforms that provide social interaction lead to a growth in the data stored within these services.

This data has been used by many investigators in order to extract information from [RLW12,

LMS11, CM07].

Twitter has proven to be a popular data source within social media due to the large number of

active users and the easy access to their public API. As such, it has fuelled a number of studies

[BO12, Bru, Cor, Gol10, AFK11].

TwitterEcho [BO12] is a research platform. It collects tweets and users data from Portuguese

Twittosphere and aims to support R&D and journalistic tools. These tools use Data Mining tech-

niques to generate knowledge. Some of the main functionalities already implemented in this

platform are text mining, opinion mining and social network analysis.

1.2 Motivation

This dissertation aims to explore in detail the Twitter’s spatio-temporal component and adapt clus-

tering algorithms and distance functions better suited for Twitter data considering all following

dimensions: spatial, temporal, social and content.

For the platform’s final user it aims to evaluate the profiles extracted from Twitter and to

explore the patterns within Twitter’s spatio-temporal context. Ultimately, it aims to help journalists

1

Introduction

and researchers to answer where, when, what and by whom a given story is characterized in

Twitter.

1.3 Objectives

The scientific objectives set for this dissertation are to create a spatio-temporal data analysis mod-

ule for TwitterEcho, recurring to clustering algorithms upon tweet and user data, using the best

combination of clustering algorithms and distance measures for clustering accordingly to spatial,

temporal, social and content dimensions.

While spatial and temporal dimensions are the basis of this dissertation, social distances be-

tween users and tweets content similarities must also be used to complete the similarity com-

parison. The visualization platform must use spatio-temporal visualization techniques and tools to

better represent the information and at the same time, provide tools to interact with the information

provided.

The visualization tool aims to be interactive, since it must provide the user the ability to define

a priority scheme for each of the dimensions previously defined and expect an instant update on

the visualized information, based on the pre-processed information calculated by the clustering

algorithms.

1.4 Document Structure

This document is organized as follows:

Chapter 2 contains the state of the art for the fields related to this project. We explain the Twit-

terEcho project in more detail alongside the clustering algorithms and distance measures studied

for each dimension in Spatio-temporal Data Mining.

In Chapter 3 there is a presentation of the planned approach, technologies to be used, the

experimental setup and validation and the work plan for the next 6 months.

2

Chapter 2

State of the art

2.1 Twitter Overview

This section provides a description of the Twitter 1 social media service and its API, followed by

an introduction to the TwitterEcho platform.

2.1.1 General Description

Twitter is a microblogging service that enables users to publish short messages (also known as

"tweets") with a maximum size of 140 characters.

Within Twitter, a social tie is defined by whether a user is following or being followed by other

users.

Each tweet has very well defined items in its structure, although most are not mandatory. Each

item has the purpose of enhancing the social interaction or complete the information related to the

message in question. Below we present these functionalities:

• Retweet (RT) Share another user’s tweet [Twi13a].

• Mention (@ + username) Identify a user in a tweet [Twi13e].

• Reply (@ + username) Answer to a previous user tweet [Twi13e].

• Hashtag (# + topic name) Association of a keyword to a tweet [Twi13d].

• Localization User’s geo-coordinates when sending the a tweet [Twi13b].

2.1.2 Twitter API

Twitter provides two APIs to access its information, namely the Streaming API and the REST API

[Twi13c]. The REST API requires an oAuth authentication and is request-based and Streaming,

on the other hand, requires oAuth or HTTP basic authentication and provides information through

events.1https://twitter.com/

3

State of the art

The Streaming API provides real time data (where each tweet is flagged as a event), although

the only data available for querying is the data collected by the Streaming API since the session’s

beginning, in opposition to the REST API, which allows access to information in the past and

where the only limits are the availability of Twitter data and the methods and applications rate

limits.

The Twitter REST API enables access to the user’s information, timeline, friends & followers,

direct messages and general search, streaming, Places & Geo and trends, although there are limits

imposed to the number of requests allowed. For the REST API, a request window is declared with

15 minutes duration during each user is allowed either 15 or 180 requests per window and method

invoked. However, each application invoking this API has a general 120 requests per hour limit.

In the Streaming API since there is not a request policy but a connection policy instead, limits

are imposed to the volume of data transmitted per client per second. Public access does not allow

to receive more than 50 tweets per second or 4 320 000 tweets per day.

2.1.3 TwitterEcho platform

The TwitterEcho project [BO12] is a research platform for extracting, storing and analysing the

Portuguese Twittosphere for R&D and journalistic purposes. Its current architecture is presented

in figure 2.1.

TwitterEcho collects data using the Twitter API. This platform accesses the Twitter Streaming

API to obtain real time tweets through the crawler clients. These tweets are sent to a message

broker (i.e., data format translator program) and processed on two components: stream processing

and pre-processing. The resulting data is stored in both Apache Solr 1 and MongoDB 2.

In order to ease the access to the information in a simple and effective manner, message and

users indexes were created using Apache Solr. This allows a parallel exploration of the tweets by

text searching tweets or users in Solr and retrieving all their information in Hadoop 3.

After the information is stored in Hadoop, it is subjected to batch processing in order to mine

different kinds of knowledge. This knowledge is available through analysis modules which include

text mining, opinion mining, sentiment analysis and social network analysis.

1http://lucene.apache.org/solr/2http://www.mongodb.org/3http://hadoop.apache.org/

4

State of the art

TwitterTweets & Users

(Searchable)Message Broker

Stream Processing

Trend topics

Most mentioned users

Most mentioned URLs

Pre-processing

URL Unshortener

Language detection

Tokenization

Geo tagged tweets extraction

Batch processing

Computes users interactions

Aggregation & statistics

Spam detection

Bot detection

Network DB

Crawling client

Follow topics

Follow users

Follow sample stream

Follow location

File System

TwitterEcho - Data collector

Geo DB

Kafka

Storm

Solr / MongoDB

Neo4j

Java Python / MapReduce

Python

Figure 2.1: TwitterEcho Physical Architecture.

Among TwitterEcho’s databases, we would like to highlight the GeoDB and NetworkDB, that

will be the main data source available to solve this dissertation’s problem.

Although the complete system is presented in 2.1, only the most important steps were ex-

plained.

2.2 Data Mining

Data Mining is "the process of discovering interesting patterns from large ammount of data"

[Han06]. [Cur06] claims that "Data Mining is a multi-disciplinary field at the confluence of Statis-

tics, Computer Science, Machine Learning, Artificial Intelligence (AI), Database Technology, and

Pattern Recognition". The main tasks of Data Mining are:

• Characterization and discrimination: summarization of general characteristics or features.

• Mining of frequent patterns, associations and correlations: finding patterns that occur fre-

quently in the data.

• Classification and regression: obtain a model that represents the data.

• Clustering analysis: groups objects in subgroups, where similar objects are in the same

subgroup and different objects are in different subgroups.

• Outlier analysis: find objects that are very different from the majority of other objects.

We will now present the specialized types of data mining expected to be used in this disserta-

tion.

Temporal data mining is concerned the analysis of events ordered by one or more dimensions

of time [RS99] with the objective of "inferring relationships of contextual and temporal proximity"

[RS02]. Within this category, there are two main approaches: discovery of causal relationships

and discovery of similar patterns within the time sequences (also known as time series analysis).

5

State of the art

The main techniques used by these methods are curve approximation, noise reduction, time

series comparisons and prediction computing using mathematical models.

On the other hand, "Spatial data mining can be superficially considered as the multi-dimensional

equivalent of temporal data mining" [RS99].

Examples of approaches applied to this branch of Data Mining are association rules, clustering

and characterization.

Spatio-temporal Data Mining is subdivided in two approaches: embedding temporal dimen-

sion into Spatial Mining and introduction of spatial dimension into Temporal Data Mining systems

[RS99].

2.3 Clustering

Clustering is defined as "the process of grouping a set of data objects into multiple groups or

clusters so that objects within a cluster have high similarity, but are very dissimilar to objects in

other clusters" [Han06]. Similarity assessment is calculated through distance functions, which are

described in section 2.4.

Considering the proposed problem, clustering is the logical choice for extracting patterns from

unlabelled data, such as the geo-reference and timestamp within each tweet. Furthermore, clus-

tering provides grouping of similar objects, which answers directly to the main objectives for this

dissertation.

Another consideration that must be made is the necessity of using a technique applicable to

not only spatial and temporal dimensions, but also to social and content similarities. Therefore,

clustering presents itself as the better suited technique.

In this section we present the most representative clustering algorithms for each type as well

as clustering evaluation methods.

2.3.1 Partitioning

There are 4 types of clustering methods from raw data: Partitioning, Hierarchical, Density-based

and Grid-based. Other implementations have also been applied to graphs which need a different

approach, as we explain in this section.

Partitioning algorithms are known for generating mutually exclusive clusters of spherical

shape using distance-based techniques to group objects. They generally use mean or medoid

to represent cluster centers and have proven effective up to medium size sets [Han06]. A partition

algorithm organizes the objects to create partitions accordingly to a particular criterion.

Within this set of partitioning algorithms, the most well known are k-Means and k-medoids

[Han06]. COD-CLARANS [THH01] is introduced due to its capability of considering obstacles

when performing clustering.

k-Means intends on solving the NP-hard problem of partitioning objects into clusters [Han06].

It defines a centroid as the mean value of points in a cluster and assigns each object to the most

6

State of the art

similar cluster, comparing the distance of each object to each cluster centroid. It employs an

iterative approach in order to improve the variation of distances in clusters, recalculating means

and re-assigning objects to more similar clusters in each iteration. The within-cluster variation

equation used to calculate the cluster quality in each iteration is presented in 2.1.

E =K

∑i=1

∑p∈Ci

dist(p,ci)2 (2.1)

where p is an object of cluster ci. The algorithm stops when the clusters defined posses no

difference, in two consecutive iterations, or by an imposed iteration limit. The pseudo-code of this

algorithm is presented in algorithm 1.

Algorithm 1 k-Means1: procedure K-MEANS(k : #clusters,D : dataset)

2: arbitrarily choose k objects from D as the initial cluster centers;

3: repeat4: (re)assign each object to the cluster to which the object is the most similar;

5: update the cluster means;

6: until no change in clusters

7: return set of k clusters;

8: end procedure

Although being a relatively scalable and efficient solution (complexity is O(nkt) where n is the

number of objects, k the number of clusters and t the number of iterations), it possesses disadvan-

tages. These are the fact that the results are sensible to the initial cluster centers and outliers and

also due to the necessity of previously indicating the number of clusters expected.

k-medoids tries to solve one of the previous disadvantages: the sensibility to outliers. It

changes the k-means processing by considering as cluster center an object (also known as rep-

resentative object) instead of the mean value of all points allowing for outliers to have less in-

fluence on the cluster shape. The partition is obtained in k-medoids by minimizing the sum of

dissimilarities between each object p and the representative object oi:

E =K

∑i=1

∑p∈Ci

dist(p,oi) (2.2)

This k-medoids concept was implemented by the PAM (Partitioning Around Medoids) algo-

rithm [KR08]. Firstly, this algorithm selects random objects (or seeds) for representative objects

and, in the same manner as k-Means, iterates by switching the center of cluster while the qual-

ity of the clustering is improvable [Han06]. The pseudo-code of PAM algorithm is presented in

algorithm 2:

7

State of the art

Algorithm 2 PAM1: procedure PAM(k : #clusters,D : dataset)

2: arbitrarily choose k objects from D as the initial representative objects or seeds;

3: repeat4: assign each remaining object to the cluster with the nearest representative object;

5: randomly select a non representative object: orandom;

6: compute total cost, S, of swapping representative object, o j, with orandom;

7: if S < 0

8: then swap o j with orandom to form a new set ok k representative objects;

9: until no change in clusters

10: return set of k clusters

11: end procedure

Although PAM indeed reduces the impact of outliers on the shape of the cluster, enabling better

results, it presents a higher complexity ((O(k(n− k)2))) and is therefore only indicated for small

sized data sets. In order to overcome the scalability problem, a new approach was created by the

algorithm CLARA (Clustering LARrge Applications) [KR08]. It resorts to a sampling technique,

in order to only cluster a small set instead of the all data and applies the PAM algorithm to the

sample [Han06]. It assumes that the sample distribution is the same as the set it was retrieved

from. However, CLARA’s effectiveness depends on the sample chosen. Therefore, it is a simple

solution to cluster large data sets but far from a perfect one.

CLARANS (Clustering LARge Applications based upon RANDdomized Search) [Ng] was

created based on CLARA in order to improve its scalability and clustering quality [Han06]. Not

only it samples the data set, but it also includes a random search within the points in the data

set to search for a better medoid and if the absolute-error criterion is improvable, it changes the

representative object and continues the process.

CLARANS guarantees a local optimum when applied to large data sets. COD-CLARANS

(Clustering with Obstructed Distance) [THH01] is a variation of CLARANS which conserves

its advantages but was designed for a specific purpose: clustering considering obstacles. This

technique may be useful in this dissertation if we consider that in spacial clustering there are

spacial frontiers that need to be considered. For instance, regions or districts can be considered as

frontiers.

COD-CLARANS consists of three main parts: the main algorithm, the squared-error function

E and the pruning function E’. The pruning function enables to avoid computation of E by prun-

ing search and provides focusing information when E cannot be computed in order to improve

efficiency.

In pre-processing stage, COD-CLARANS creates a BSP (Binary-Space-Partition) Tree which

determines if each two points are visible to each other, i.e., there is no obstacle between them in

a straight line. This information is converted to a Visibility Graph to improve efficiency of the

algorithm.

8

State of the art

After determining the visibility, it invokes CLARANS methodology for clustering. It improves

efficiency due to the application of a technique called micro-clustering. Micro-clustering is a

"compressed representation of a group of points which are close together that they are likely to

belong to the same cluster" [THH01]. Therefore, COD-CLARANS includes each representative

object of each micro-cluster into the sampling data set that CLARANS uses for initial computation,

instead of letting the algorithm choosing random objects for seeds.

Although using micro-clustering has a small effect on cluster quality [THH01], this algorithm

is well suited for large data sets and solves effectively the obstacle problem.

With this algorithm we finish the most relevant Partition Algorithm and introduce now the

Hierarchical Algorithms for clustering.

2.3.2 Hierarchical

"A Hierarchical clustering method works by grouping data objects intro a hierarchy or a "tree" of

clusters" [Han06]. This method can either be agglomerative (if it starts with small clusters and

recursively merge them to find a single final cluster) or divisive (all objects are in a single cluster

and iteratively are divided until it has only one object or the objects in each final cluster are very

similar).

Usually, the results of hierarchical algorithms are represented by dendogram (i.e. tree dia-

gram), which separates by levels the similarity of objects and represents the connections of clusters

by creating lines from the root to the leafs.

We start by introducing the BIRCH algorithm [?]. BIRCH introduces the definition Clustering

Feature (CF) used to summarize a cluster. It is a 3D-vector defined by:

CF = 〈n,LS,SS〉 (2.3)

where n is the number of points, LS is the linear sum of points and SS is the square sum of

the data points. CF enables the computation of a cluster centroid, radius and diameter for future

processing.

This data structure is then used in a CF-tree whose objective is to represent the cluster hi-

erarchy and use the previous formulae to ensure tightness of each cluster. Since CF verifies the

additive property, agglomerating two clusters is basically summing each component within each

CF. This is the key for the space efficiency of BIRCH.

BIRCH builds an initial tree from the data set, where each CF is inserted into the closest leaf.

These leafs are then provided to the clustering algorithm in order to group dense clusters into

larger ones.

Although BIRCH has a computational complexity of O(n), each CF-tree has a limited size

and this translates into a clustering with less resemblance to what the user may consider a natural

cluster. Also, BIRCH does not perform well for non-spherical clusters since it uses the previous

formulae for radius and diameter to organize the clustering.

9

State of the art

Chameleon [KH] is another agglomerative hierarchical algorithm which uses dynamic mod-

elling to determine similarity between 2 clusters. This technique is based on two concepts for

similarity: the relative interconnectivity (RI) and relative closeness (RC) between clusters.

Chameleon algorithm builds a k-nearest-neighbour graph where each edge is weighted for

measuring similarity and vertex are connected if they are within the k-most similar objects. The

graph is then subjected to a graph partition algorithm to generate smaller clusters minimizing

edge cuts. Lastly, an agglomerative hierarchical algorithm merges sub-clusters to output the final

resulting cluster.

Chameleon can adapt itself to the cluster characteristics and therefore discover arbitrarily

shaped clusters. It is also applicable to all data types demanding only a suitable similarity function.

However, it presents a complexity of O(n2).

Probabilistic clustering algorithms [Han06] use probabilistic models to measure distances be-

tween objects in a data set. Well known probabilistic distributions, like Gaussian or Bernoulli’s

are used to represent the data set and used to compute the hierarchy of clusters.

This algorithm has many advantages such as the ability to handle partially observable data

and possession of a complexity similar to agglomerative algorithms. However, it only outputs a

hierarchy per each distribution and is therefore less faithful to the real data.

This algorithm finishes the Hierarchical clustering algorithms, although there are many more,

including variations of the presented before. We now look with more attention to the density based

algorithms.

2.3.3 Density based

Density-based clustering algorithms follow the strategy of modelling clusters as "dense regions

in that data space, separated by sparse regions" [Han06]. Therefore, these algorithms are very

suitable to finding non-spherical shaped clusters.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [EKSX96] finds

core objects (i.e., points with dense neighbourhood) and iteratively connects them to the neigh-

bours if these are in the core object’s ε-neighbourhood.

The ε-neighbourhood is defined through a user inputted parameter: the radius ε and states that

a point is in the core object’s ε-neighbourhood if it is within the pre-defined radius. Therefore,

for two points p and q, we can say that p is directly density-reachable from q is it is in the ε-

neighbourhood of q.

Another user input is MinPts that determines if a point is a core object. If within the ε-

neighbourhood there are at least MinPts points, then we are in the presence of a core object.

The algorithm 3 takes in account the two previous concepts and iteratively connects core ob-

jects to its ε-neighbourhood until all objects are processed.

10

State of the art

Algorithm 3 DBSCAN1: procedure DBSCAN(MinPts : neighborhood_threshold,D : dataset,ε : radius_parameter)

2: Mark all objects as unvisited;

3: do {

4: Randomly select an unvisited object p;

5: Mark p as visited;

6: if the ε-neighborhood of p has at least MinPts objects {

7: Create a new cluster C and add p to C;

8: Let N be the set of objects in the ε-neighborhood of p;

9: for each point p’ in N {

10: if p’ is unvisited {

11: Mark p’ as visited;

12: if the ε-neighborhood of p’ has at least MinPts points;

13: Add those points to N;

14: }

15: if p’ is not yet a member of any cluster

16: add p’ to C;

17: }

18: Output C;

19: }

20: else mark p as noise;

21: } until no object is unvisited;

22: end procedure

DBSCAN possesses a complexity of O(n2), but effectively finds non-spherical shaped clusters.

OPTICS [AMJ99] is a variation of DBSCAN with the purpose of removing the necessity of

user defined global parameters. Instead, these are updated automatically in each iteration in order

to better adapt to the data characteristics. However, OPTICS does not output a cluster, but a

clustering order, i.e. an ordered sequence of database items accordingly to the computed density.

OPTICS introduces two concepts: core-distance and reachability-distance. Core-distance is

the smallest ε ′ such that the ε ′-neighbourhood has at least MinPts. Reachability-distance is the

minimum radius that make two points density-reachable, i.e., the distances between two points is

within the radius defined by the core-distance.

OPTICS starts by ordering all objects and computing core-distances and reachability-distances.

With these values, it assigns objects to its neighbours, such as DBSCAN. However, at each iter-

ation, these values are automatically updated and the radius parameter can be different for each

core object.

This algorithm is based on DBSCAN, and therefore possesses the same time complexity:

O(n2) otherwise. However, it allows better fitting to each sub-cluster.

11

State of the art

DENCLUE (DENsity-based CLUstEring) [HK] is a clustering algorithm based on a set of

distribution functions. It makes use of a non-parametric density estimation approach called kernel

density [Han06]:

f̂h(x) =1nh

n

∑i=1

Kx− xi

h(2.4)

The kernel density used in DENCLUE is generally a Gaussian kernel. This measure enables

the definition of density attractors: points located in each local maximum of the density function.

These density attractors are then filtered through a threshold to find the center of clusters.

DENCLUE defines a cluster as a set of density attractors with objects assigned to them, in-

cluding other density attractors to generate the complete cluster.

This algorithm can find arbitrarily shaped clusters as well as it is invariant to noise in data,

since it distributes the noise throughout the density distribution.

2.3.4 Grid based

Grid-based algorithms use a space-driven approach instead of a data-driven approach as in the

previous algorithms [Han06]. They partition the space into cells of a multi-resolution grid data

structure. This ensures a fast processing time independent from the size of the data set, although

it is affected by the resolution of the grid.

STING (STatistical INformation Grid) [WM97] is a grid-based multi-resolution technique that

splits the data into rectangular cells per each level. Each higher-level cell is decomposed into

smaller cells in the lower level since the lowest the level is, the higher is the resolution. At each

level, statistical measures are computed and saved for future query processing.

[Han06] states that "the distribution of a higher-level cell can be computed based on the ma-

jority of distribution types of its corresponding lower-level cells in conjunction with a threshold

filtering process". This allows for the clustering to be found recurring to querying the hierarchical

structure using a top-down query, that goes through each level until it reaches the lowest level

and returns the relevant cells for the specified query. The pseudo-code of STING is presented in

algorithm 4.

12

State of the art

Algorithm 4 STING1: procedure STING(layered_hierarchical_structure)

2: Determine a layer to begin with;

3: For each cell of this layer, we calculate the confidence interval (or estimated range) of

probability that this cell is relevant to the query;

4: From the interval calculated above, we label the cell as relevant or not relevant;

5: If this layer is the bottom layer, goto Step (7); otherwise goto Step (6);

6: We go down the hierarchy structure by one level. Go to Step (3) for those cells that form

the relevant cells of the higher level layer.

7: If the specification of the query is met, goto Step (9): otherwise goto Step (8);

8: Retrieve those data fall into the relevant cells and do further processing. Return the result

that meet the requirement of the query. Goto Step (10);

9: Find the regions of relevant cells. return those regions that meet the requirements of the

query. Goto Step (10);

10: Stop.

11: end procedure

This algorithm presents many advantages: the grid structure is query independent and enables

parallel processing and updates which leads to withstand more scalable problems. Also, it assumes

a complexity of O(n) for both creating the hierarchical grid structure and also per query.

However, cluster quality directly depends on the number of levels in the structure and it can

only produce isothetic clustering, i.e. cluster boundaries are either vertical or horizontal, but never

diagonal, which affects directly the shapes of the cluster produced.

CLIQUE [GGRJ] is a simple grid-based method for retrieving density-based clusters in sub-

spaces of data.

Initially, it partitions each dimension into non-overlapping intervals (i.e. cells in the grid) and

recurring to a threshold, classifies each cell as dense or sparse. Then, dense cells are used to

assemble clusters. In the second part, dense cells are connected and the clusters created.

CLIQUE is insensitive to the order in which objects are presented and does not presume a

specific distribution of the data. And although it provides good scalability, clustering quality is

dependant from the grid size for lower resolution will introduce error in the final clustering result.

2.3.5 Graph based

Since a cluster can be seen as a graph with similar objects connected to each others and typically

closer to a cluster centroid, graph clustering methods have the advantage of already having a well

defined structure (already has connected objects) for clustering, instead of raw data. According to

[Han06], there are two types of graph clustering methods: generic clustering methods for high-

dimensional data and (calculate the similarity matrix of the objects and apply a general clustering

algorithm like the ones seen before) or graph-driven approaches.

A well defined algorithm for graph clustering is SCAN [Han06], which is based on DBSCAN.

13

State of the art

Algorithm 5 SCAN1: procedure SCAN(G : graph,ε : thresold,µ : population_threshold)

2: Set all vertices as unlabeled;

3: forall unlabeled vertex u {

4: if (u is a core) {

5: Generate a new cluster-id c;

6: Insert all v ∈ Nε(u) into a queue Q;

7: while Q 6=∅ {

8: w← the first vertex in Q;

9: R← the set of vertices that can be directly reached from w;

10: forall s ∈ R {

11: if (s is not unlabeled or labeled as nonmember {

12: assign the current cluster-id c to s;

13: }

14: if (s is unlabeled) {

15: insert s into queue Q;

16: }}

17: remove w from Q;

18: }}

19: else

20: label u as nonmember;

21: }}

22: forall vertex u labeled as nonmember {

23: if ∃x,y ∈ Γ(u) : x and y have different cluster-ids;

24: label u as hub;

25: else

26: label u as outlier;

27: }

28: return set of clusters;

29: end procedure

It uses a similarity measure called structural-context similarity σ for two vertices u and v:

σ(u,v) =|Γ(u)Γ(v)|√|Γ(u)||Γ(v)|

(2.5)

where Γ(u) = {v|(u,v) ∈ E}∪{u} in a graph G = (V,E).

Defining a ε threshold for similarity, a ε-neighbourhood is defined by:

Nε = {v ∈ Γ(u)|σ(u,v)≥ ε} (2.6)

14

State of the art

and means that a vertex’s neighbourhood contains all vertices whose structural-context simi-

larity is greater than the defined threshold. In SCAN, a core vertex must have Nε(u)≥ µ where µ

is a popularity threshold that defines the minimum boundary for a vertex to be a core vertex.

These core vertices are the basis of SCAN (5), since in each iteration, they are connected to

the other vertices within their ε-neighbourhood until no vertices remain to be processed, or when

an outlier is detected and signalled as such.

With a time complexity of O(n), SCAN is expected to provide good scalability for large

graphs.

Similarity and Dissimilarity Matrices can also be used to perform clustering. [Han06] defines

a Dissimilarity Matrix as a "structure that stores a collection of proximities that are available for

all pairs of n objects". These proximities are calculated using the formulae in the next section.

A Dissimilarity Matrix can be represented as a n-by-n table, being n the number of elements

in the data set.

0

d(2,1) 0d(3,1) d(3,2) 0

......

...d(n,1) d(n,2) · · · · · · 0

Figure 2.2: Dissimilarity Matrix [Han06].

2.3.6 Discussion

[LMS11] took into consideration many of the previous explained algorithms and condensed the

information about their characteristics in table 2.1.

Algorithm Input numberof clusters

Pair-wisedistance

computation

Mandatoryspace-mapping

No outliersdetection

k-Means Y N N Y

CLARANS Y Y N Y

BIRCH N N Y Y

CURE N Y N N

Chameleon N Y N N

DBSCAN N Y N N

OPTICS N Y N N

STING N N Y Y

CLIQUE N N Y Y

Table 2.1: Comparison of clustering algorithms (Addapted from [LMS11]).

15

State of the art

2.3.7 Clustering Evaluation

Clustering evaluation is the assessment of three components: clustering tendency, number of clus-

ters in a data set and clustering quality [Han06].

Clustering tendency checks that the data set is a non-random structure, i.e., it does not possess

a uniform distribution. This is important because even if the clustering is completed, the results

are meaningless. In order to compute cluster tendency, we can use the Hopkins Statistic:

H =∑

ni=1 yi

∑ni=1 xi +∑

ni=1 yi

(2.7)

where xi = minv∈D{dist(pi,v)} and yi = minv∈D,v 6=qi{dist(qi,v)}. This function calculates dis-

tances between neighbours and assesses whether they are not similar. If the resulting coefficient is

near zero, then the data set is not uniformly distributed.

The number of clusters is the second evaluation step. This takes into consideration that the

number of clusters is not always known a-priori. Therefore, three methods are explained to help

compute this value in a scientific manner.

The first is a simplistic approach. It always sets the number of clusters as√

n/2, therefore

producing√

2n clusters, considering n as the number of objects in the data set.

The second method is called the elbow method. Firstly, the number of clusters is varied sys-

tematically and the within-cluster variation is computed for all execution. Then, in the analysis of

this function, the global maximum is chosen as the optimal number of clusters to provide to the

final clustering algorithm. The within-cluster variation for a point p in a cluster Ci with centroid ci

is given by:

E =k

∑i=1

∑ p ∈Cidist(p,ci)2 (2.8)

The final method is cross-validation. It divides the data set into m parts and clusters m - 1 parts.

The remaining is used as a test set. With this test set, distances from each point to the respective

centroid are computed and assembled as a unique coefficient. This coefficient can be optimized,

by testing different numbers of clusters and therefore reach the optimal solution.

The final step in clustering evaluation is to assess the cluster quality. This is sub-divided in two

different types of methods, considering whether the ground truth (i.e., perfect clustering result) is

known. If there is no possibility of knowing the ground truth, Intrinsic methods should be used.

Otherwise, there are Extrinsic methods available.

The extrinsic methods check 4 criteria: cluster homogeneity, cluster completeness, rag bag

(when in presence of heterogeneous objects) and small cluster preservation. A famous extrinsic

method is BCubed Precision that calculates how many objects fit into the same category as the

object being tested. There is another metric able to be computed, named Recall BCubed. It

calculates how many objects of the same category can be found in the same clusters.

16

State of the art

In intrinsic methods, cluster compactness and separation are evaluated, mainly recurring to the

silhouette coefficient:

s(o) =b(o)−a(o)

max{a(o),b(o)}(2.9)

where a(o) computes the compactness of a cluster and b(o) the cluster degree of separation.

The coefficient is computed by calculating the average silhouette of all objects. The closer the

silhouette coefficient to 1, the more it verifies both compactness and separation in the clusters.

2.4 Distance Measures

Clustering algorithms, as we have seen in Section 2.3, need distance functions in order to calcu-

late distances between objects to group objects by similarity. [Han06] states that "the objective

function aims for high inter-cluster similarity and low inter-cluster similarity".

2.4.1 Spatial Distance Functions

In this context, spatial dimension is defined by latitude and longitude numeric values extracted

from tweets. Therefore, similarity functions between numeric values must be explored. Accord-

ingly to [Han06] the 4 most important distances of this type are the Euclidean Distance, the Man-

hattan Distance, the Minkowski Distance and the Mahalanobis Distance. [AHSV03] also defines

the Chebychev Distance. Since weighted distances are also very useful for assigning different

importances to different components, the Weighted Euclidean Distance is defined as an example,

although weighting can be applied to other distance functions.

Let i = {xi1,xi2, ...,xip} and j = {x j1,x j2, ...,x jp} be two objects described by p attributes.

Euclidean Distance is defined by:

d(i, j) =√(xi1− x j1)2 +(xi2− x j2)2 + ...+(xip− x jp)2 (2.10)

While the Euclidean Distance is known as the distance in a straight line, the Manhattan Dis-

tance invokes the city block distance paradigm. This paradigm defines distance between two points

as the sum of horizontal and vertical distances for each pair of points. Considering objects i and j

as the previous objects we have the Manhattan Distance:

d(i, j) = |xi1− x j1|+ |xi2− x j2|+ ...+ |xip− x jp| (2.11)

The Minkowski distance is a generalization of both the Euclidean and the Manhattan Dis-

tances.

d(i, j) = h√

(xi1− x j1)h +(xi2− x j2)h + ...+(xip− x jp)h (2.12)

17

State of the art

It introduces the real number h , where h ≥ 1. When h = 1 we have the Euclidean Distance

and when h = 2 the Manhattan Distance is defined. However, for h→∞, we obtain the Chebychev

Distance (also known as the supremum distance).

d(i, j) = limh→∞

(p

∑f=1|xi f − x j f |h

) 1h

=p

maxf|xi f − x j f | (2.13)

This distance provides the maximum difference in values between the two objects i and j (the

attribute f ).

When each attribute has different importance, a weighting system can be applied. The Weighted

Euclidean Distance is defined as follows:

d(i, j) =√

w1(xi1− x j1)2 +w2(xi2− x j2)2 + ...+wm(xip− x jp)2 (2.14)

The Mahalanobis Distance, although not as popular as the previous distances defined, also

has long been used in clustering techniques. This distance includes the covariance matrix (V ) of

distribution of the objects which determines whether each two objects vary together or not.

DM =√

(x− y)V−1(x− y)T (2.15)

2.4.2 Temporal Distance Functions

As far as the temporal dimension goes, contrary to the previous distances where the distances

are mapped in R2, time is represented in R, which facilitates the difference calculation. The time

interval can be defined by the following formulae:

d(ti, t j) = |ti− t j| (2.16)

However, any of the previous distance measuring functions is applicable, since in this dimen-

sion the objects are represented by numerical values also.

2.4.3 Social Distance Functions

Considering connections between users stored in TwitterEcho, it is possible to assume the implicit

existence of a social graph. Therefore, the social distance is simplified to a distance between

nodes within a graph. [Han06] defines two distance measures for graphs: Geodesic Distance and

SimRank.

Geodesic distance is the shortest path between two vertices. This calculation relies simply on

the minimum number of edges between two vertices.

SimRank stands for Similarity Based on Random Walk and Structural Context. In this distance

measure, two vertices are similar if they are connected with common vertices. In order to calculate

the similarity, there is the need to introduce the concept individual in-neighbourhood. Considering

18

State of the art

a directed graph G = (V,E), where V defines a set of vertices and the set of edges is E ⊆ V ×V ,

the individual in-neighbourhood of a vertex v is defined as:

I(v) = {u|(u,v) ∈ E} (2.17)

SimRank Distance is defined, for any two vertices u and v within graph G as:

s(u,v) =

0, if I(u) = 0∨ I(v)=0C

|I(u)||I(v)| ∑x∈I(u) ∑y∈I(v) s(x,y), if I(u) 6= 0∧ I(v) 6= 0(2.18)

where C is a constant between 0 and 1.

2.4.4 Content Distance Functions

Here, we present functions to calculate the similarity between two texts, which in our problem

are the tweets message. [Han06] defines Cosine similarity distance (as the most commonly used)

and a variation denominated Tanimono distance, while [MRS08] also introduces IDF and TFIDF.

Lastly, [RLW12] proposes a variation of Jaccard similarity complemented with Dice’s coefficient.

Before exploring content similarity functions, document representations must be explained,

since they enable the use of some of the distance functions presented.

[MRS08] defines IDF (Inverse document frequency) as a type of representation to assert the

frequency of terms in documents. Considering a text collection N and a document frequency of

the term t as d ft , we have:

IDF(t) = logN

d f (t)(2.19)

TF-IDF (Term frequency - Inverse document frequency) [LAR07] is an extension of this mea-

sure. It includes the concept of term frequency within a well defined document as opposed to a

document collection.

T FIDF(t,D) = T f (t,D)∗ IDF(t) (2.20)

In order to use Cosine similarity to check whether two texts are similar, they must be converted

to a term-frequency vector (also known as Document vector). These numeric vectors are created

representing the number of times a word appears in each text. After this transformation, it is

possible to compare two term-frequency vectors x, y with:

sim(x,y) =x.y||x||y||

(2.21)

The output of this similarity function is a value between 0 and 1, that represent if the texts

being compared are not related at all or if they are the same, respectively. A variation of this

19

State of the art

similarity measure is the Tanimono Distance defined specially for binary-valued attributes within

the term-frequency vectors. Tanimoto Distance is defined by:

sim(x,y) =x.y

x.x+ y.y− x.y(2.22)

Jaccard and Dice’s similarity measure also evaluates if two texts x and y are similar:

sim(x,y) =|x∩ y|

min(|x|, |y|)(2.23)

[RLW12] uses this measure and also a combination of cosine similarity and TFIDF for apply-

ing weights to each term-frequency vectors in different data processing phases, but with the same

purpose of verifying similar documents.

[RKT11] proposed two variations for both Cosine and Jaccard similarity when applied to short

text clustering. This variants are proposed due to the sparsity in term-frequency vectors for short

texts, such as tweets. The variation of Cosine similarity is:

sim(x,y) = 1− ∑dk=1 xkyk

||x||y||)(2.24)

and Jaccard’s variation is represented by:

sim(x,y) = 1− |x∩ y||x∪ y|)

(2.25)

2.4.5 Mixed Distance Functions

To the best of our knowledge, there is only one proposal of distance measure that combined dif-

ferent types of data: the cosine similarity with temporal attenuation. [Lee12] states that content

similarity is directly related to a temporal dimension. He expects to show that two texts are more

similar if they occur in a short time windows, rather than in bigger ones.

sim(x,y) = ∑i

xi ∗ yi

|x||y|∗ e

ζ |tx−ty|W (2.26)

This temporal evaluation is made with a temporal penalty that assures that if the time interval

between two texts is big then the penalty suffered is also large. The parameter ζ enables to adjust

the penalty ratio.

2.5 Spatio-Temporal Visualization

An important step in the Data Mining process is interpretation of data and patterns. As [CM07]

claims, "Visual data mining refers to methods, approaches and tools for the exploration of large

data sets by allowing users to directly interact with visual representations of data and dynamically

modify parameters to see how they affect the visualized data.". The main properties that must be

20

State of the art

followed by visualization tools in Data Mining are: the appearance of data, displaying temporal

behaviour, showing properties of entire displayed scene and support interaction [Gah09].

[Gah09] states that the main visualization techniques are: map-based, chart-based, projection,

space-filling or pixel based, iconographical or compositional and hierarchical or network.

These data visualization techniques will be explored next accordingly to practical implemen-

tations found in related work.

2.5.1 Clustering Visualization

When referring to clustering, the most usual representation is a graph-like visualization. It presents

the objects in each cluster, maintaining the clustering goal of assigning similar objects a shortest

distance and verify sparsity between clusters with greater distance. [LAR07] developed a cluster-

ing visualization tool visible in 2.3.

Figure 2.3: Clustering Visualization [LAR07].

Another clustering visualization for a large amount of data involves assigning different colors

and objects. For objects in different clusters, overlapping ellipses over the most representative

objects are displayed to represent similar objects. [CSX08] applied it to study geographical lexical

variation to better assist on mapping the results.

21

State of the art

Figure 2.4: Clustering Visualization [CSX08].

2.5.2 Georeferenced Data Visualization

Georeferenced data typical involves plotting the information on top of a geographic representation,

being the most common the map.

The first visualization type discussed is the 2D map that is, currently, very popular due to

Google Maps 1. Using their API, a vast number of applications were implemented due to its

simplicity and the visual appeal. [Lee12] used this tool to overlap the map with representative

points for its solution, as well as geometric figures to highlight the obtained results. [LYCW11]

used it to detect events on Twitter.

Figure 2.5: Map with event detection on Twitter [Lee12].

1https://developers.google.com/maps/?hl=pt-pt

22

State of the art

Another Google creation related to mapping is Google Earth 2. This enables a 3D visualization

of the Earth and, like Google Maps it possesses an easy interface and allows an intuitive informa-

tion representation. [CM07] used it to represent the association rules extracted from a data set of

Hurricane Isabel.

[CM07] included also a timeline (which will be presented in the next section) since it uses a

spatio-temporal approach to retrieve knowledge.

Figure 2.6: Modified Google Earth rule visualization tool [CM07].

Although only Google’s visualization tools have been detailed, there are many competitors in

this market niche that also provide a map API.

An approach created by Silicon Graphics International in partnership with the University of

Illinois created a real-time visualization tool of sentiment mining on Twitter [Fit12]. The repre-

sentation adopted a heat map approach, in which each color represented a different value for the

majority of positive or negative comments.

2https://developers.google.com/earth/?hl=pt-pt

23

State of the art

Figure 2.7: Real-time heat maps of positive and negative sentiments expressed via Twitter [Fit12].

2.5.3 Timestamped Data Visualization

Timestamped data invokes a linear organization of events and therefore the most intuitive repre-

sentation is a graphic, possessing in one of the axis the temporal dimension and in the other axis

the values for analysis. [Lee12] uses a graphic to plot probability of a keyword belonging to a

location over time, as is visible in the figure 2.8.

Figure 2.8: Time Graphic for event detection on Twitter [Lee12].

Although many graphic tools provide interaction, nowadays the concept of timeline has emerged

as a very usable tool to navigate data through time. [RLW12] incorporated timeline as a filter for

the information gathered, to simplify access to the most important data collected.

24

State of the art

Figure 2.9: Timeline [RLW12].

25

State of the art

26

Chapter 3

Solution Perspective

3.1 Solution Description

This section presents the proposed solution. We present the planned approaches to address the

problems of the thesis and achieve the goals established. The expected development process is

presented in figure 3.1, with the parts already finished on the left side of the image.

Figure 3.1: Steps in the development process.

In the development process presented in 3.1 it is possible to see that analysis were made upon

clustering and distance functions as well as appropriate visualization tools. It was also analysed

the platform TwitterEcho.

In the next steps, the clustering algorithms and distance functions will be chosen for this

particular problem. To apply these, data must be prepared conveniently. The clustering results

will be presented using a visualization design well suited for both the data types and also the

pattern detection.

After the previous steps, evaluation of the results is required. This evaluation will be scientific

(using clustering evaluation methods already discussed) and also regarding user experience and

clarity of the results.

27


Now, that the methodologies were explained, it is time to explain in more detail the architecture

proposed for the solution. Although TwitterEcho possesses a complex architecture, to facilitate

interpretation, we just consider the databases where geo-referenced tweets and user information

are present, since these are the only data sets required for solving the proposed problem in this

dissertation. These are used as input for the distance calculation module in the backend server

and accessible by a Java-based RESTful service. The clustering results will be available via the

RESTful service to the website that incorporates the visualization tool. Clustering and matrix

operations will be processed in R Server 2, a specialized programming tool for mathematical and

algorithmical purposes.

Geo DB

Network DB

Calculate distances Clustering

R Server

Local DB (similarity matrices)

TwitterEcho

Backend Server

pattie.fe.up.pt

DBMS Server

RESTful Server

Frontend Webpage(HTML +

Javascript)

Figure 3.2: Proposed solution architecture.

In the table 3.1 we present the proposed clustering algorithms and distance functions per di-

mension, to be stored in a similarity matrix per each dimension:

Dimension Clustering Algorithm Distance FunctionSpatial COD-CLARANS Euclidean Distance

Temporal DBSCAN Time Interval

Social SCAN SimRank

Content DBSCAN Cosine Distance

Table 3.1: Algorithms and distance functions for proposed solution.

It is also being considered the introduction of Twitter’s functionalities (retweets, mentions and

hashtag) into the distance functions, in order to consider also the social and content importance

2http://www.rstudio.com/ide/docs/server/getting_started

28


implicit in these functionalities.

After each similarity matrix is computed, the overlapping of the final clustering will depend

on the user selected parameters through the visualization tool. It must use a weight system, to

apply different importances to each dimension and output the final result. There is also the need

to normalize the different distances in each dimension, in order for all to follow the same variance

(for instance, all between 0 and 1).

The expected final tool is presented below. It must contain a 2D map with tweet locations and

a timeline to filter the number of visible tweets. The user selected parameters for the clustering are

represented through a slidebar for each dimension, expected to follow a predefined scale to fight

the major issue of scalability on this solution.

This feature will allow a simplistic way of pre-computing all combinations of values selectable

by the user and reduce the overhead of the final cluster computing to allow the fastest visualization

possible.

Figure 3.3: Expected final tool.

The last parameter will be which way to present the clustering. This is necessary in order to

maintain all tweets mapped and also represent the clustering associations without rearranging the

tweets position. Many representation types are being considered, namely:

• Use different color, line style, object shape and size to represent each dimension within the

final cluster.

• Make cluster visible only when a given point belonging to the cluster is selected, and show

all connecting lines.

29


• Use only different colors for objects in different clusters and overlap the information of each

cluster with an ellipse shape with the same color as the objects

These will be tested to find the best technique, in order to output the knowledge mined, while

keeping a good presentation of these patterns collected, maximizing their interpretability for a

wide range of users.

3.2 Technologies

The backend server will host a R Server (responsible for matrix and clustering processing), a

DBMS server (in order to store calculated matrices and clustering results) and a RESTful service

to organize communication throughout the system. The webpage is expected to de developed in

HTML and JavaScript, since it must adhere to the TwitterEcho’s look & feel.

3.3 Experimental setup, validation and evaluation

The available dataset of Portuguese geo-referenced tweets currently consists of 3.316 in continen-

tal Portugal plus 55 in the archipelagos of Azores and Madeira. Although the data must be filtered

to the Portuguese Twittosphere, we present in the map below the distribution of Portuguese tweets

throughout the world.

Figure 3.4: Current distribution of georeferenced tweets in TwitterEcho.

The experiment’s validation will most surely involve a comparison with TwitterEcho’s search

platform in order to verify that all tweets of the same type are clustered together, although this is

30


not enough to validate all dimensions. It is expected that the ultimate validation will be done by

the final users.

To answer the goals set for this dissertation, the similarities must be clear for every dimension.

This is, if a clustering is totally based on the spatial dimension, the clustering will be visible in the

map since it must connect to the closest objects. If, however, it is temporal, the resulting clustering

must associate tweets not far apart in time. When content is the goal, some words will be common

and/or the same hashtag will be present. Social dimension clustering validation can be verified

through Twitter’s social graph and also assessing if, by any chance, a tweet is a retweet or a reply,

which is signalled by the respective symbol "@".

This solution evaluation will incorporate the formulae and techniques present in chapter 2.3.7.

Since the ground truth is not known a-priori, an intrinsic method for evaluation of clustering

quality will be processed, namely the silhouette coefficient.

3.4 Workplan

In the table 3.5 we present a Gantt diagram with calendarization for the remaining project, with

tasks assigned to the time expected to be used to conclude them.

Feb Mar Apr May Jun Jul

2 3 4 1 2 3 4 1 2 3 4 5 1 2 3 4 1 2 3 4 1 2 3 4 5

Collect TwitterEcho’s data

Implement clustering algorithms

Performance evaluation & integration

Visualization tool development

Dissertation Report Writing

Article Writing

Figure 3.5: Timeline for the second phase of the dissertation.

31


32

References

[AFK11] Houben G. Abel F., Gao Q. and Tao K. Analyzing Temporal Dynamics in TwitterProfiles for Personalized Recommendations in the Social Web. 2011.

[AHSV03] McKeown G. Al-Harbi S. and Rayward-Smith V. A New Metric for Categorical Data.In Statistical Data Mining and Knowledge Discovery. Chapman and Hall/CRC, July2003.

[AMJ99] Kriegel H. Ankerst M., Breunig M. and Sander J. OPTICS: ordering points to identifythe clustering structure. ACM SIGMOD Record, pages 49–60, 1999.

[BO12] M. Bošnjak and E. Oliveira. TwitterEcho - A Distributed Focused Crawler to SupportOpen Research with Twitter Data. 2012.

[Bru] A. Bruns. Information , Communication & Society How long is a tweet? Mappingdynamic conversation networks on Twitter using GAWK and GEPHI. (November2012):37–41.

[CM07] P. Compieta and S. Di Martino. Exploratory spatio-temporal data mining and visual-ization. Journal of Visual Languages and Computing, 2007.

[Cor] M. Cordeiro. Twitter event detection : combining wavelet analysis and topic infer-ence summarization.

[CSX08] B. Connor, N. A Smith, and E. Xing. A Latent Variable Model for Geographic LexicalVariation. 2008.

[Cur06] J. Curran. Statistical data mining and knowledge discovery. Structural equationmodeling: a multidisciplinary journal, 13(4):649–652, 2006.

[Dic] Oxford Dictionaries. "Social Network". Oxford Dictionaries. April 2010.

[EKSX96] M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discoveringclusters in large spatial databases with noise. Proceedings of the 2nd International In-ternational Conference on Knowledge Discovery and Data Mining (KDD-96), 1996.

[Fit12] B. Fitzgerald. SGI Twitter Heat Map: Supercomputer Shows Where Angriest Tweet-ers Live. Available at http://www.huffingtonpost.com/2012/11/19/sgi-twitter-heat-map_n_2138726.html, accessed at 1 of February 2013,2012.

[Gah09] M. Gahegan. Visual Exploration and Explanation in Geography Analysis with Light.In Geographic Data Mining and Knowledge Discovery, Second Edition, Chapman& Hall/CRC Data Mining and Knowledge Discovery Series, pages 291–324. CRCPress, May 2009.

33

http://www.huffingtonpost.com/2012/11/19/sgi-twitter-heat-map_n_2138726.html

http://www.huffingtonpost.com/2012/11/19/sgi-twitter-heat-map_n_2138726.html

REFERENCES

[GGRJ] Johannes Gehrke, Dimitrios Gunopulos, Harry Road, and San Jose. Automatic Sub-space Clustering of High Dimensional Data for Data Mining Applications RakeshAgrawal.

[Gol10] S. Golder. Tweet , Tweet , Retweet : Conversational Aspects of Retweeting on Twit-ter. pages 1–10, 2010.

[Han06] J. Han. Data Mining : Concepts and Techniques (2nd Edition) , 2006.

[HK] A. Hinneburg and D. Keim. An Efficient Approach to Clustering in Large MultimediaDatabases with Noise.

[KH] G. Karypis and E. Han. CHAMELEON : A Hierarchical Clustering Algorithm UsingDynamic Modeling.

[KR08] L. Kaufman and P. Rousseeuw. Introduction. In Finding Groups in Data, pages 1–67.John Wiley & Sons, Inc., 2008.

[LAR07] Paulovich F. Lopes A., Pinho R. and Minghim R. Visual text mining using associationrules. Computers & Graphics, 31(3):316–326, June 2007.

[Lee12] C. Lee. Mining spatio-temporal information on microblogging streams usinga density-based online clustering method. Expert Systems with Applications,39(10):9623–9641, August 2012.

[LMS11] W. Loh, S. Mane, and J. Srivastava. Mining temporal patterns in popularity of webitems. Information Sciences, 181(22):5010–5028, November 2011.

[LYCW11] C. Lee, H. Yang, T. Chien, and W. Wen. A Novel Approach for Event Detection byMining Spatio-temporal Information on Microblogs. 2011 International Conferenceon Advances in Social Networks Analysis and Mining, pages 254–259, July 2011.

[MRS08] C. Manning, P. Raghavan, and H. Schtze. Introduction to Information Retrieval.Cambridge University Press, New York, NY, USA, 2008.

[Ng] R. Ng. Efficient and Effective Clustering Data Mining Methods for Spatial. pages144–155.

[PMMR13] B. Pescosolido, J. Martin, J. McLeod, and A. Rogers. Handbook of the Sociologyof Health, Illness, and Healing: A Blueprint for the 21st Century. Handbooks ofSociology and Social Research. Springer, 2013.

[RKT11] A. Rangrej, S. Kulkarni, and A. Tendulkar. Comparative study of clustering tech-niques for short text documents. Proceedings of the 20th international conferencecompanion on World wide web - WWW ’11, page 111, 2011.

[RLW12] H. Ryu, M. Lease, and N. Woodward. Finding and exploring memes in social media.Proceedings of the 23rd ACM conference on Hypertext and social media - HT ’12,page 295, 2012.

[RS99] J. Roddick and M. Spiliopoulou. A bibliography of temporal, spatial and spatio-temporal data mining research. ACM SIGKDD Explorations Newsletter, 1(1):34–38,June 1999.

34

REFERENCES

[RS02] J. Roddick and IEEE Computer Society. A Survey of Temporal Knowledge DiscoveryParadigms and Methods. 14(4):750–767, 2002.

[THH01] A. Tung, J. Hou, and J. Han. Spatial clustering in the presence of obstacles. In DataEngineering, 2001. Proceedings. 17th International Conference on, pages 359–367,2001.

[Twi13a] Twitter. FAQs About Retweets (RT). Available at https://support.twitter.com/groups/31-twitter-basics/topics/109-tweets-messages/articles/77606-faqs-about-retweets-rt, accessed at 31 of January2013, 2013.

[Twi13b] Twitter. FAQs about Tweet Location. Available at https://support.twitter.com/groups/31-twitter-basics/topics/109-tweets-messages/articles/78525-faqs-about-tweet-location, accessed at 31 of January2013, 2013.

[Twi13c] Twitter. TwitterAPI - Documentation. Available at https://dev.twitter.com/docs, accessed at 31 of January 2013, 2013.

[Twi13d] Twitter. What Are Hashtags ("#" Symbols)? Avail-able at https://support.twitter.com/groups/31-twitter-basics/topics/109-tweets-messages/articles/49309-what-are-hashtags-symbols, accessed at 31 of January 2013, 2013.

[Twi13e] Twitter. What Are @Replies and Mentions? Avail-able at https://support.twitter.com/groups/31-twitter-basics/topics/109-tweets-messages/articles/14023-what-are-replies-and-mentions, accessed at 31 of January 2013,2013.

[WM97] W. Wang and R. Muntz. STING : A Statistical Information Grid Approach to SpatialData Mining. 1997.

35

https://support.twitter.com/groups/31-twitter-basics/topics/109-tweets-messages/articles/77606-faqs-about-retweets-rt



https://support.twitter.com/groups/31-twitter-basics/topics/109-tweets-messages/articles/78525-faqs-about-tweet-location



https://dev.twitter.com/docs

https://dev.twitter.com/docs

https://support.twitter.com/groups/31-twitter-basics/topics/109-tweets-messages/articles/49309-what-are-hashtags-symbols



https://support.twitter.com/groups/31-twitter-basics/topics/109-tweets-messages/articles/14023-what-are-replies-and-mentions



TweeProﬁles: detection of spatio-temporal patterns on...

Documents

Transcript of TweeProﬁles: detection of spatio-temporal patterns on...