Management and Mining of Spatio-Temporal Data Rui Zhang rui The University of Melbourne.
TweeProfiles: detection of spatio-temporal patterns on...
Transcript of TweeProfiles: detection of spatio-temporal patterns on...
FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO
TweeProfiles: detection ofspatio-temporal patterns on Twitter
Tiago Daniel Sá Cunha
Mestrado Integrado em Engenharia Informática e Computação
Supervisor: Carlos Soares (PhD)
Co-Supervisor: Eduarda Mendes Rodrigues (PhD)
6th February, 2013
c© Tiago Daniel Sá Cunha, 2013
TweeProfiles: detection of spatio-temporal patterns onTwitter
Tiago Daniel Sá Cunha
Mestrado Integrado em Engenharia Informática e Computação
6th February, 2013
Resumo
Redes sociais na internet apresentam-se como fontes de informação valiosas, no que diz respeitoaos seus utilizadores e aos seus respectivos interesses. Tal informação tem sido sujeita a váriosestudos, conduzidos por investigadores de Data Mining de todo o Mundo, de forma a descobrircomportamentos e padrões dos utilizadores. Para além disso, tem existido também investimentoem criar plataformas para extracção contínua e visualização de informação.
Esta dissertação espera identificar perfis de tweets envolvendo multiplos tipos de informação,nomeadamente espacial, temporal, social e de conteúdo. Cada dimensão é processada separada-mente e agregada no resultado final, considerando um sistema de pesos.
Os objetivos de TweeProfiles sao atingidos através da adaptação de algoritmos de clustering efunções de distância para minar os perfis e, também, para apresentar os resultados obtidos numaplataforma desenvolvida para uma utilização dinâmica e intuitiva, que pretende revelar os padrõesobtidos de uma forma compreensível.
O caso de estudo em que esta dissertação será aplicada é a Twitosfera portuguesa, apesar deser desenvolvida para suportar quaisquer tweets georeferenciados provenientes do Twitter.
i
ii
Abstract
Online social networks present themselves as valuable information sources about their users andtheir respective interests. Such information has been subject to many studies conducted by DataMining scholars throughout the world in order to discover users’ behaviours and patterns. Besides,there has also been also investment applied in creating platforms for the continuous informationextraction and for their data visualization.
This dissertation aims to identify tweet profiles involving multiple types of data, namely spa-tial, temporal, social and content. Each dimension is computed separately and aggregated to thefinal result considering a weighting scheme for each dimension.
The goals for TweeProfiles are achieved by adapting clustering algorithms and distance func-tions to mine the profiles and also to displays the obtained results in a visualization platformdesigned for a dynamic and intuitive usage, aimed at revealing the discovered patterns in an un-derstandable way.
The study case on which it will be applied is the Portuguese Twittosphere, although it will bedeveloped to use any geo-referenced tweets extracted form Twitter itself.
iii
iv
Contents
1 Introduction 11.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 State of the art 32.1 Twitter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.2 Twitter API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.3 TwitterEcho platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3.2 Hierarchical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.3 Density based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.4 Grid based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.5 Graph based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.7 Clustering Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Distance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.1 Spatial Distance Functions . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.2 Temporal Distance Functions . . . . . . . . . . . . . . . . . . . . . . . 182.4.3 Social Distance Functions . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.4 Content Distance Functions . . . . . . . . . . . . . . . . . . . . . . . . 192.4.5 Mixed Distance Functions . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Spatio-Temporal Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.5.1 Clustering Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 212.5.2 Georeferenced Data Visualization . . . . . . . . . . . . . . . . . . . . . 222.5.3 Timestamped Data Visualization . . . . . . . . . . . . . . . . . . . . . . 24
3 Solution Perspective 273.1 Solution Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3 Experimental setup, validation and evaluation . . . . . . . . . . . . . . . . . . . 303.4 Workplan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
References 33
v
CONTENTS
vi
List of Figures
2.1 TwitterEcho Physical Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Dissimilarity Matrix [Han06]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Clustering Visualization [LAR07]. . . . . . . . . . . . . . . . . . . . . . . . . . 212.4 Clustering Visualization [CSX08]. . . . . . . . . . . . . . . . . . . . . . . . . . 222.5 Map with event detection on Twitter [Lee12]. . . . . . . . . . . . . . . . . . . . 222.6 Modified Google Earth rule visualization tool [CM07]. . . . . . . . . . . . . . . 232.7 Real-time heat maps of positive and negative sentiments expressed via Twitter
[Fit12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.8 Time Graphic for event detection on Twitter [Lee12]. . . . . . . . . . . . . . . . 242.9 Timeline [RLW12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Steps in the development process. . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Proposed solution architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3 Expected final tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4 Current distribution of georeferenced tweets in TwitterEcho. . . . . . . . . . . . 303.5 Timeline for the second phase of the dissertation. . . . . . . . . . . . . . . . . . 31
vii
LIST OF FIGURES
viii
List of Tables
2.1 Comparison of clustering algorithms (Addapted from [LMS11]). . . . . . . . . . 15
3.1 Algorithms and distance functions for proposed solution. . . . . . . . . . . . . . 28
ix
LIST OF TABLES
x
Abbreviations
API Application Programming Interface
HDFS Hadoop Distributed File System
HTML HyperText Markup Language
HTTP Hypertext Transfer Protocol
IDF Inverse document frequency
JSON JavaScript Object Notation
REST Representational State Transfer
TFIDF Term frequency - Inverse document frequency
xi
Chapter 1
Introduction
1.1 Context
A social network is defined in social sciences as a social structure composed by a set of actors
and the ties between them [PMMR13]. More recently, it acquired a new meaning in information
science which is "a dedicated website or other application which enables users to communicate
with each other" [Dic]. More generally, online social networks present a variety of social media
services.
In recent years, social media services have achieved a huge importance in social life and also
in business strategies for companies, since they "have been regarded as a timely and cost-effective
source of spatio-temporal information" [LYCW11]. The massive adhesion and the number of
platforms that provide social interaction lead to a growth in the data stored within these services.
This data has been used by many investigators in order to extract information from [RLW12,
LMS11, CM07].
Twitter has proven to be a popular data source within social media due to the large number of
active users and the easy access to their public API. As such, it has fuelled a number of studies
[BO12, Bru, Cor, Gol10, AFK11].
TwitterEcho [BO12] is a research platform. It collects tweets and users data from Portuguese
Twittosphere and aims to support R&D and journalistic tools. These tools use Data Mining tech-
niques to generate knowledge. Some of the main functionalities already implemented in this
platform are text mining, opinion mining and social network analysis.
1.2 Motivation
This dissertation aims to explore in detail the Twitter’s spatio-temporal component and adapt clus-
tering algorithms and distance functions better suited for Twitter data considering all following
dimensions: spatial, temporal, social and content.
For the platform’s final user it aims to evaluate the profiles extracted from Twitter and to
explore the patterns within Twitter’s spatio-temporal context. Ultimately, it aims to help journalists
1
Introduction
and researchers to answer where, when, what and by whom a given story is characterized in
Twitter.
1.3 Objectives
The scientific objectives set for this dissertation are to create a spatio-temporal data analysis mod-
ule for TwitterEcho, recurring to clustering algorithms upon tweet and user data, using the best
combination of clustering algorithms and distance measures for clustering accordingly to spatial,
temporal, social and content dimensions.
While spatial and temporal dimensions are the basis of this dissertation, social distances be-
tween users and tweets content similarities must also be used to complete the similarity com-
parison. The visualization platform must use spatio-temporal visualization techniques and tools to
better represent the information and at the same time, provide tools to interact with the information
provided.
The visualization tool aims to be interactive, since it must provide the user the ability to define
a priority scheme for each of the dimensions previously defined and expect an instant update on
the visualized information, based on the pre-processed information calculated by the clustering
algorithms.
1.4 Document Structure
This document is organized as follows:
Chapter 2 contains the state of the art for the fields related to this project. We explain the Twit-
terEcho project in more detail alongside the clustering algorithms and distance measures studied
for each dimension in Spatio-temporal Data Mining.
In Chapter 3 there is a presentation of the planned approach, technologies to be used, the
experimental setup and validation and the work plan for the next 6 months.
2
Chapter 2
State of the art
2.1 Twitter Overview
This section provides a description of the Twitter 1 social media service and its API, followed by
an introduction to the TwitterEcho platform.
2.1.1 General Description
Twitter is a microblogging service that enables users to publish short messages (also known as
"tweets") with a maximum size of 140 characters.
Within Twitter, a social tie is defined by whether a user is following or being followed by other
users.
Each tweet has very well defined items in its structure, although most are not mandatory. Each
item has the purpose of enhancing the social interaction or complete the information related to the
message in question. Below we present these functionalities:
• Retweet (RT) Share another user’s tweet [Twi13a].
• Mention (@ + username) Identify a user in a tweet [Twi13e].
• Reply (@ + username) Answer to a previous user tweet [Twi13e].
• Hashtag (# + topic name) Association of a keyword to a tweet [Twi13d].
• Localization User’s geo-coordinates when sending the a tweet [Twi13b].
2.1.2 Twitter API
Twitter provides two APIs to access its information, namely the Streaming API and the REST API
[Twi13c]. The REST API requires an oAuth authentication and is request-based and Streaming,
on the other hand, requires oAuth or HTTP basic authentication and provides information through
events.1https://twitter.com/
3
State of the art
The Streaming API provides real time data (where each tweet is flagged as a event), although
the only data available for querying is the data collected by the Streaming API since the session’s
beginning, in opposition to the REST API, which allows access to information in the past and
where the only limits are the availability of Twitter data and the methods and applications rate
limits.
The Twitter REST API enables access to the user’s information, timeline, friends & followers,
direct messages and general search, streaming, Places & Geo and trends, although there are limits
imposed to the number of requests allowed. For the REST API, a request window is declared with
15 minutes duration during each user is allowed either 15 or 180 requests per window and method
invoked. However, each application invoking this API has a general 120 requests per hour limit.
In the Streaming API since there is not a request policy but a connection policy instead, limits
are imposed to the volume of data transmitted per client per second. Public access does not allow
to receive more than 50 tweets per second or 4 320 000 tweets per day.
2.1.3 TwitterEcho platform
The TwitterEcho project [BO12] is a research platform for extracting, storing and analysing the
Portuguese Twittosphere for R&D and journalistic purposes. Its current architecture is presented
in figure 2.1.
TwitterEcho collects data using the Twitter API. This platform accesses the Twitter Streaming
API to obtain real time tweets through the crawler clients. These tweets are sent to a message
broker (i.e., data format translator program) and processed on two components: stream processing
and pre-processing. The resulting data is stored in both Apache Solr 1 and MongoDB 2.
In order to ease the access to the information in a simple and effective manner, message and
users indexes were created using Apache Solr. This allows a parallel exploration of the tweets by
text searching tweets or users in Solr and retrieving all their information in Hadoop 3.
After the information is stored in Hadoop, it is subjected to batch processing in order to mine
different kinds of knowledge. This knowledge is available through analysis modules which include
text mining, opinion mining, sentiment analysis and social network analysis.
1http://lucene.apache.org/solr/2http://www.mongodb.org/3http://hadoop.apache.org/
4
State of the art
TwitterTweets & Users
(Searchable)Message Broker
Stream Processing
Trend topics
Most mentioned users
Most mentioned URLs
Pre-processing
URL Unshortener
Language detection
Tokenization
Geo tagged tweets extraction
Batch processing
Computes users interactions
Aggregation & statistics
Spam detection
Bot detection
Network DB
Crawling client
Follow topics
Follow users
Follow sample stream
Follow location
File System
TwitterEcho - Data collector
Geo DB
Kafka
Storm
Solr / MongoDB
Neo4j
Java Python / MapReduce
Python
Figure 2.1: TwitterEcho Physical Architecture.
Among TwitterEcho’s databases, we would like to highlight the GeoDB and NetworkDB, that
will be the main data source available to solve this dissertation’s problem.
Although the complete system is presented in 2.1, only the most important steps were ex-
plained.
2.2 Data Mining
Data Mining is "the process of discovering interesting patterns from large ammount of data"
[Han06]. [Cur06] claims that "Data Mining is a multi-disciplinary field at the confluence of Statis-
tics, Computer Science, Machine Learning, Artificial Intelligence (AI), Database Technology, and
Pattern Recognition". The main tasks of Data Mining are:
• Characterization and discrimination: summarization of general characteristics or features.
• Mining of frequent patterns, associations and correlations: finding patterns that occur fre-
quently in the data.
• Classification and regression: obtain a model that represents the data.
• Clustering analysis: groups objects in subgroups, where similar objects are in the same
subgroup and different objects are in different subgroups.
• Outlier analysis: find objects that are very different from the majority of other objects.
We will now present the specialized types of data mining expected to be used in this disserta-
tion.
Temporal data mining is concerned the analysis of events ordered by one or more dimensions
of time [RS99] with the objective of "inferring relationships of contextual and temporal proximity"
[RS02]. Within this category, there are two main approaches: discovery of causal relationships
and discovery of similar patterns within the time sequences (also known as time series analysis).
5
State of the art
The main techniques used by these methods are curve approximation, noise reduction, time
series comparisons and prediction computing using mathematical models.
On the other hand, "Spatial data mining can be superficially considered as the multi-dimensional
equivalent of temporal data mining" [RS99].
Examples of approaches applied to this branch of Data Mining are association rules, clustering
and characterization.
Spatio-temporal Data Mining is subdivided in two approaches: embedding temporal dimen-
sion into Spatial Mining and introduction of spatial dimension into Temporal Data Mining systems
[RS99].
2.3 Clustering
Clustering is defined as "the process of grouping a set of data objects into multiple groups or
clusters so that objects within a cluster have high similarity, but are very dissimilar to objects in
other clusters" [Han06]. Similarity assessment is calculated through distance functions, which are
described in section 2.4.
Considering the proposed problem, clustering is the logical choice for extracting patterns from
unlabelled data, such as the geo-reference and timestamp within each tweet. Furthermore, clus-
tering provides grouping of similar objects, which answers directly to the main objectives for this
dissertation.
Another consideration that must be made is the necessity of using a technique applicable to
not only spatial and temporal dimensions, but also to social and content similarities. Therefore,
clustering presents itself as the better suited technique.
In this section we present the most representative clustering algorithms for each type as well
as clustering evaluation methods.
2.3.1 Partitioning
There are 4 types of clustering methods from raw data: Partitioning, Hierarchical, Density-based
and Grid-based. Other implementations have also been applied to graphs which need a different
approach, as we explain in this section.
Partitioning algorithms are known for generating mutually exclusive clusters of spherical
shape using distance-based techniques to group objects. They generally use mean or medoid
to represent cluster centers and have proven effective up to medium size sets [Han06]. A partition
algorithm organizes the objects to create partitions accordingly to a particular criterion.
Within this set of partitioning algorithms, the most well known are k-Means and k-medoids
[Han06]. COD-CLARANS [THH01] is introduced due to its capability of considering obstacles
when performing clustering.
k-Means intends on solving the NP-hard problem of partitioning objects into clusters [Han06].
It defines a centroid as the mean value of points in a cluster and assigns each object to the most
6
State of the art
similar cluster, comparing the distance of each object to each cluster centroid. It employs an
iterative approach in order to improve the variation of distances in clusters, recalculating means
and re-assigning objects to more similar clusters in each iteration. The within-cluster variation
equation used to calculate the cluster quality in each iteration is presented in 2.1.
E =K
∑i=1
∑p∈Ci
dist(p,ci)2 (2.1)
where p is an object of cluster ci. The algorithm stops when the clusters defined posses no
difference, in two consecutive iterations, or by an imposed iteration limit. The pseudo-code of this
algorithm is presented in algorithm 1.
Algorithm 1 k-Means1: procedure K-MEANS(k : #clusters,D : dataset)
2: arbitrarily choose k objects from D as the initial cluster centers;
3: repeat4: (re)assign each object to the cluster to which the object is the most similar;
5: update the cluster means;
6: until no change in clusters
7: return set of k clusters;
8: end procedure
Although being a relatively scalable and efficient solution (complexity is O(nkt) where n is the
number of objects, k the number of clusters and t the number of iterations), it possesses disadvan-
tages. These are the fact that the results are sensible to the initial cluster centers and outliers and
also due to the necessity of previously indicating the number of clusters expected.
k-medoids tries to solve one of the previous disadvantages: the sensibility to outliers. It
changes the k-means processing by considering as cluster center an object (also known as rep-
resentative object) instead of the mean value of all points allowing for outliers to have less in-
fluence on the cluster shape. The partition is obtained in k-medoids by minimizing the sum of
dissimilarities between each object p and the representative object oi:
E =K
∑i=1
∑p∈Ci
dist(p,oi) (2.2)
This k-medoids concept was implemented by the PAM (Partitioning Around Medoids) algo-
rithm [KR08]. Firstly, this algorithm selects random objects (or seeds) for representative objects
and, in the same manner as k-Means, iterates by switching the center of cluster while the qual-
ity of the clustering is improvable [Han06]. The pseudo-code of PAM algorithm is presented in
algorithm 2:
7
State of the art
Algorithm 2 PAM1: procedure PAM(k : #clusters,D : dataset)
2: arbitrarily choose k objects from D as the initial representative objects or seeds;
3: repeat4: assign each remaining object to the cluster with the nearest representative object;
5: randomly select a non representative object: orandom;
6: compute total cost, S, of swapping representative object, o j, with orandom;
7: if S < 0
8: then swap o j with orandom to form a new set ok k representative objects;
9: until no change in clusters
10: return set of k clusters
11: end procedure
Although PAM indeed reduces the impact of outliers on the shape of the cluster, enabling better
results, it presents a higher complexity ((O(k(n− k)2))) and is therefore only indicated for small
sized data sets. In order to overcome the scalability problem, a new approach was created by the
algorithm CLARA (Clustering LARrge Applications) [KR08]. It resorts to a sampling technique,
in order to only cluster a small set instead of the all data and applies the PAM algorithm to the
sample [Han06]. It assumes that the sample distribution is the same as the set it was retrieved
from. However, CLARA’s effectiveness depends on the sample chosen. Therefore, it is a simple
solution to cluster large data sets but far from a perfect one.
CLARANS (Clustering LARge Applications based upon RANDdomized Search) [Ng] was
created based on CLARA in order to improve its scalability and clustering quality [Han06]. Not
only it samples the data set, but it also includes a random search within the points in the data
set to search for a better medoid and if the absolute-error criterion is improvable, it changes the
representative object and continues the process.
CLARANS guarantees a local optimum when applied to large data sets. COD-CLARANS
(Clustering with Obstructed Distance) [THH01] is a variation of CLARANS which conserves
its advantages but was designed for a specific purpose: clustering considering obstacles. This
technique may be useful in this dissertation if we consider that in spacial clustering there are
spacial frontiers that need to be considered. For instance, regions or districts can be considered as
frontiers.
COD-CLARANS consists of three main parts: the main algorithm, the squared-error function
E and the pruning function E’. The pruning function enables to avoid computation of E by prun-
ing search and provides focusing information when E cannot be computed in order to improve
efficiency.
In pre-processing stage, COD-CLARANS creates a BSP (Binary-Space-Partition) Tree which
determines if each two points are visible to each other, i.e., there is no obstacle between them in
a straight line. This information is converted to a Visibility Graph to improve efficiency of the
algorithm.
8
State of the art
After determining the visibility, it invokes CLARANS methodology for clustering. It improves
efficiency due to the application of a technique called micro-clustering. Micro-clustering is a
"compressed representation of a group of points which are close together that they are likely to
belong to the same cluster" [THH01]. Therefore, COD-CLARANS includes each representative
object of each micro-cluster into the sampling data set that CLARANS uses for initial computation,
instead of letting the algorithm choosing random objects for seeds.
Although using micro-clustering has a small effect on cluster quality [THH01], this algorithm
is well suited for large data sets and solves effectively the obstacle problem.
With this algorithm we finish the most relevant Partition Algorithm and introduce now the
Hierarchical Algorithms for clustering.
2.3.2 Hierarchical
"A Hierarchical clustering method works by grouping data objects intro a hierarchy or a "tree" of
clusters" [Han06]. This method can either be agglomerative (if it starts with small clusters and
recursively merge them to find a single final cluster) or divisive (all objects are in a single cluster
and iteratively are divided until it has only one object or the objects in each final cluster are very
similar).
Usually, the results of hierarchical algorithms are represented by dendogram (i.e. tree dia-
gram), which separates by levels the similarity of objects and represents the connections of clusters
by creating lines from the root to the leafs.
We start by introducing the BIRCH algorithm [?]. BIRCH introduces the definition Clustering
Feature (CF) used to summarize a cluster. It is a 3D-vector defined by:
CF = 〈n,LS,SS〉 (2.3)
where n is the number of points, LS is the linear sum of points and SS is the square sum of
the data points. CF enables the computation of a cluster centroid, radius and diameter for future
processing.
This data structure is then used in a CF-tree whose objective is to represent the cluster hi-
erarchy and use the previous formulae to ensure tightness of each cluster. Since CF verifies the
additive property, agglomerating two clusters is basically summing each component within each
CF. This is the key for the space efficiency of BIRCH.
BIRCH builds an initial tree from the data set, where each CF is inserted into the closest leaf.
These leafs are then provided to the clustering algorithm in order to group dense clusters into
larger ones.
Although BIRCH has a computational complexity of O(n), each CF-tree has a limited size
and this translates into a clustering with less resemblance to what the user may consider a natural
cluster. Also, BIRCH does not perform well for non-spherical clusters since it uses the previous
formulae for radius and diameter to organize the clustering.
9
State of the art
Chameleon [KH] is another agglomerative hierarchical algorithm which uses dynamic mod-
elling to determine similarity between 2 clusters. This technique is based on two concepts for
similarity: the relative interconnectivity (RI) and relative closeness (RC) between clusters.
Chameleon algorithm builds a k-nearest-neighbour graph where each edge is weighted for
measuring similarity and vertex are connected if they are within the k-most similar objects. The
graph is then subjected to a graph partition algorithm to generate smaller clusters minimizing
edge cuts. Lastly, an agglomerative hierarchical algorithm merges sub-clusters to output the final
resulting cluster.
Chameleon can adapt itself to the cluster characteristics and therefore discover arbitrarily
shaped clusters. It is also applicable to all data types demanding only a suitable similarity function.
However, it presents a complexity of O(n2).
Probabilistic clustering algorithms [Han06] use probabilistic models to measure distances be-
tween objects in a data set. Well known probabilistic distributions, like Gaussian or Bernoulli’s
are used to represent the data set and used to compute the hierarchy of clusters.
This algorithm has many advantages such as the ability to handle partially observable data
and possession of a complexity similar to agglomerative algorithms. However, it only outputs a
hierarchy per each distribution and is therefore less faithful to the real data.
This algorithm finishes the Hierarchical clustering algorithms, although there are many more,
including variations of the presented before. We now look with more attention to the density based
algorithms.
2.3.3 Density based
Density-based clustering algorithms follow the strategy of modelling clusters as "dense regions
in that data space, separated by sparse regions" [Han06]. Therefore, these algorithms are very
suitable to finding non-spherical shaped clusters.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [EKSX96] finds
core objects (i.e., points with dense neighbourhood) and iteratively connects them to the neigh-
bours if these are in the core object’s ε-neighbourhood.
The ε-neighbourhood is defined through a user inputted parameter: the radius ε and states that
a point is in the core object’s ε-neighbourhood if it is within the pre-defined radius. Therefore,
for two points p and q, we can say that p is directly density-reachable from q is it is in the ε-
neighbourhood of q.
Another user input is MinPts that determines if a point is a core object. If within the ε-
neighbourhood there are at least MinPts points, then we are in the presence of a core object.
The algorithm 3 takes in account the two previous concepts and iteratively connects core ob-
jects to its ε-neighbourhood until all objects are processed.
10
State of the art
Algorithm 3 DBSCAN1: procedure DBSCAN(MinPts : neighborhood_threshold,D : dataset,ε : radius_parameter)
2: Mark all objects as unvisited;
3: do {
4: Randomly select an unvisited object p;
5: Mark p as visited;
6: if the ε-neighborhood of p has at least MinPts objects {
7: Create a new cluster C and add p to C;
8: Let N be the set of objects in the ε-neighborhood of p;
9: for each point p’ in N {
10: if p’ is unvisited {
11: Mark p’ as visited;
12: if the ε-neighborhood of p’ has at least MinPts points;
13: Add those points to N;
14: }
15: if p’ is not yet a member of any cluster
16: add p’ to C;
17: }
18: Output C;
19: }
20: else mark p as noise;
21: } until no object is unvisited;
22: end procedure
DBSCAN possesses a complexity of O(n2), but effectively finds non-spherical shaped clusters.
OPTICS [AMJ99] is a variation of DBSCAN with the purpose of removing the necessity of
user defined global parameters. Instead, these are updated automatically in each iteration in order
to better adapt to the data characteristics. However, OPTICS does not output a cluster, but a
clustering order, i.e. an ordered sequence of database items accordingly to the computed density.
OPTICS introduces two concepts: core-distance and reachability-distance. Core-distance is
the smallest ε ′ such that the ε ′-neighbourhood has at least MinPts. Reachability-distance is the
minimum radius that make two points density-reachable, i.e., the distances between two points is
within the radius defined by the core-distance.
OPTICS starts by ordering all objects and computing core-distances and reachability-distances.
With these values, it assigns objects to its neighbours, such as DBSCAN. However, at each iter-
ation, these values are automatically updated and the radius parameter can be different for each
core object.
This algorithm is based on DBSCAN, and therefore possesses the same time complexity:
O(n2) otherwise. However, it allows better fitting to each sub-cluster.
11
State of the art
DENCLUE (DENsity-based CLUstEring) [HK] is a clustering algorithm based on a set of
distribution functions. It makes use of a non-parametric density estimation approach called kernel
density [Han06]:
f̂h(x) =1nh
n
∑i=1
Kx− xi
h(2.4)
The kernel density used in DENCLUE is generally a Gaussian kernel. This measure enables
the definition of density attractors: points located in each local maximum of the density function.
These density attractors are then filtered through a threshold to find the center of clusters.
DENCLUE defines a cluster as a set of density attractors with objects assigned to them, in-
cluding other density attractors to generate the complete cluster.
This algorithm can find arbitrarily shaped clusters as well as it is invariant to noise in data,
since it distributes the noise throughout the density distribution.
2.3.4 Grid based
Grid-based algorithms use a space-driven approach instead of a data-driven approach as in the
previous algorithms [Han06]. They partition the space into cells of a multi-resolution grid data
structure. This ensures a fast processing time independent from the size of the data set, although
it is affected by the resolution of the grid.
STING (STatistical INformation Grid) [WM97] is a grid-based multi-resolution technique that
splits the data into rectangular cells per each level. Each higher-level cell is decomposed into
smaller cells in the lower level since the lowest the level is, the higher is the resolution. At each
level, statistical measures are computed and saved for future query processing.
[Han06] states that "the distribution of a higher-level cell can be computed based on the ma-
jority of distribution types of its corresponding lower-level cells in conjunction with a threshold
filtering process". This allows for the clustering to be found recurring to querying the hierarchical
structure using a top-down query, that goes through each level until it reaches the lowest level
and returns the relevant cells for the specified query. The pseudo-code of STING is presented in
algorithm 4.
12
State of the art
Algorithm 4 STING1: procedure STING(layered_hierarchical_structure)
2: Determine a layer to begin with;
3: For each cell of this layer, we calculate the confidence interval (or estimated range) of
probability that this cell is relevant to the query;
4: From the interval calculated above, we label the cell as relevant or not relevant;
5: If this layer is the bottom layer, goto Step (7); otherwise goto Step (6);
6: We go down the hierarchy structure by one level. Go to Step (3) for those cells that form
the relevant cells of the higher level layer.
7: If the specification of the query is met, goto Step (9): otherwise goto Step (8);
8: Retrieve those data fall into the relevant cells and do further processing. Return the result
that meet the requirement of the query. Goto Step (10);
9: Find the regions of relevant cells. return those regions that meet the requirements of the
query. Goto Step (10);
10: Stop.
11: end procedure
This algorithm presents many advantages: the grid structure is query independent and enables
parallel processing and updates which leads to withstand more scalable problems. Also, it assumes
a complexity of O(n) for both creating the hierarchical grid structure and also per query.
However, cluster quality directly depends on the number of levels in the structure and it can
only produce isothetic clustering, i.e. cluster boundaries are either vertical or horizontal, but never
diagonal, which affects directly the shapes of the cluster produced.
CLIQUE [GGRJ] is a simple grid-based method for retrieving density-based clusters in sub-
spaces of data.
Initially, it partitions each dimension into non-overlapping intervals (i.e. cells in the grid) and
recurring to a threshold, classifies each cell as dense or sparse. Then, dense cells are used to
assemble clusters. In the second part, dense cells are connected and the clusters created.
CLIQUE is insensitive to the order in which objects are presented and does not presume a
specific distribution of the data. And although it provides good scalability, clustering quality is
dependant from the grid size for lower resolution will introduce error in the final clustering result.
2.3.5 Graph based
Since a cluster can be seen as a graph with similar objects connected to each others and typically
closer to a cluster centroid, graph clustering methods have the advantage of already having a well
defined structure (already has connected objects) for clustering, instead of raw data. According to
[Han06], there are two types of graph clustering methods: generic clustering methods for high-
dimensional data and (calculate the similarity matrix of the objects and apply a general clustering
algorithm like the ones seen before) or graph-driven approaches.
A well defined algorithm for graph clustering is SCAN [Han06], which is based on DBSCAN.
13
State of the art
Algorithm 5 SCAN1: procedure SCAN(G : graph,ε : thresold,µ : population_threshold)
2: Set all vertices as unlabeled;
3: forall unlabeled vertex u {
4: if (u is a core) {
5: Generate a new cluster-id c;
6: Insert all v ∈ Nε(u) into a queue Q;
7: while Q 6=∅ {
8: w← the first vertex in Q;
9: R← the set of vertices that can be directly reached from w;
10: forall s ∈ R {
11: if (s is not unlabeled or labeled as nonmember {
12: assign the current cluster-id c to s;
13: }
14: if (s is unlabeled) {
15: insert s into queue Q;
16: }}
17: remove w from Q;
18: }}
19: else
20: label u as nonmember;
21: }}
22: forall vertex u labeled as nonmember {
23: if ∃x,y ∈ Γ(u) : x and y have different cluster-ids;
24: label u as hub;
25: else
26: label u as outlier;
27: }
28: return set of clusters;
29: end procedure
It uses a similarity measure called structural-context similarity σ for two vertices u and v:
σ(u,v) =|Γ(u)Γ(v)|√|Γ(u)||Γ(v)|
(2.5)
where Γ(u) = {v|(u,v) ∈ E}∪{u} in a graph G = (V,E).
Defining a ε threshold for similarity, a ε-neighbourhood is defined by:
Nε = {v ∈ Γ(u)|σ(u,v)≥ ε} (2.6)
14
State of the art
and means that a vertex’s neighbourhood contains all vertices whose structural-context simi-
larity is greater than the defined threshold. In SCAN, a core vertex must have Nε(u)≥ µ where µ
is a popularity threshold that defines the minimum boundary for a vertex to be a core vertex.
These core vertices are the basis of SCAN (5), since in each iteration, they are connected to
the other vertices within their ε-neighbourhood until no vertices remain to be processed, or when
an outlier is detected and signalled as such.
With a time complexity of O(n), SCAN is expected to provide good scalability for large
graphs.
Similarity and Dissimilarity Matrices can also be used to perform clustering. [Han06] defines
a Dissimilarity Matrix as a "structure that stores a collection of proximities that are available for
all pairs of n objects". These proximities are calculated using the formulae in the next section.
A Dissimilarity Matrix can be represented as a n-by-n table, being n the number of elements
in the data set.
0
d(2,1) 0d(3,1) d(3,2) 0
......
...d(n,1) d(n,2) · · · · · · 0
Figure 2.2: Dissimilarity Matrix [Han06].
2.3.6 Discussion
[LMS11] took into consideration many of the previous explained algorithms and condensed the
information about their characteristics in table 2.1.
Algorithm Input numberof clusters
Pair-wisedistance
computation
Mandatoryspace-mapping
No outliersdetection
k-Means Y N N Y
CLARANS Y Y N Y
BIRCH N N Y Y
CURE N Y N N
Chameleon N Y N N
DBSCAN N Y N N
OPTICS N Y N N
STING N N Y Y
CLIQUE N N Y Y
Table 2.1: Comparison of clustering algorithms (Addapted from [LMS11]).
15
State of the art
2.3.7 Clustering Evaluation
Clustering evaluation is the assessment of three components: clustering tendency, number of clus-
ters in a data set and clustering quality [Han06].
Clustering tendency checks that the data set is a non-random structure, i.e., it does not possess
a uniform distribution. This is important because even if the clustering is completed, the results
are meaningless. In order to compute cluster tendency, we can use the Hopkins Statistic:
H =∑
ni=1 yi
∑ni=1 xi +∑
ni=1 yi
(2.7)
where xi = minv∈D{dist(pi,v)} and yi = minv∈D,v 6=qi{dist(qi,v)}. This function calculates dis-
tances between neighbours and assesses whether they are not similar. If the resulting coefficient is
near zero, then the data set is not uniformly distributed.
The number of clusters is the second evaluation step. This takes into consideration that the
number of clusters is not always known a-priori. Therefore, three methods are explained to help
compute this value in a scientific manner.
The first is a simplistic approach. It always sets the number of clusters as√
n/2, therefore
producing√
2n clusters, considering n as the number of objects in the data set.
The second method is called the elbow method. Firstly, the number of clusters is varied sys-
tematically and the within-cluster variation is computed for all execution. Then, in the analysis of
this function, the global maximum is chosen as the optimal number of clusters to provide to the
final clustering algorithm. The within-cluster variation for a point p in a cluster Ci with centroid ci
is given by:
E =k
∑i=1
∑ p ∈Cidist(p,ci)2 (2.8)
The final method is cross-validation. It divides the data set into m parts and clusters m - 1 parts.
The remaining is used as a test set. With this test set, distances from each point to the respective
centroid are computed and assembled as a unique coefficient. This coefficient can be optimized,
by testing different numbers of clusters and therefore reach the optimal solution.
The final step in clustering evaluation is to assess the cluster quality. This is sub-divided in two
different types of methods, considering whether the ground truth (i.e., perfect clustering result) is
known. If there is no possibility of knowing the ground truth, Intrinsic methods should be used.
Otherwise, there are Extrinsic methods available.
The extrinsic methods check 4 criteria: cluster homogeneity, cluster completeness, rag bag
(when in presence of heterogeneous objects) and small cluster preservation. A famous extrinsic
method is BCubed Precision that calculates how many objects fit into the same category as the
object being tested. There is another metric able to be computed, named Recall BCubed. It
calculates how many objects of the same category can be found in the same clusters.
16
State of the art
In intrinsic methods, cluster compactness and separation are evaluated, mainly recurring to the
silhouette coefficient:
s(o) =b(o)−a(o)
max{a(o),b(o)}(2.9)
where a(o) computes the compactness of a cluster and b(o) the cluster degree of separation.
The coefficient is computed by calculating the average silhouette of all objects. The closer the
silhouette coefficient to 1, the more it verifies both compactness and separation in the clusters.
2.4 Distance Measures
Clustering algorithms, as we have seen in Section 2.3, need distance functions in order to calcu-
late distances between objects to group objects by similarity. [Han06] states that "the objective
function aims for high inter-cluster similarity and low inter-cluster similarity".
2.4.1 Spatial Distance Functions
In this context, spatial dimension is defined by latitude and longitude numeric values extracted
from tweets. Therefore, similarity functions between numeric values must be explored. Accord-
ingly to [Han06] the 4 most important distances of this type are the Euclidean Distance, the Man-
hattan Distance, the Minkowski Distance and the Mahalanobis Distance. [AHSV03] also defines
the Chebychev Distance. Since weighted distances are also very useful for assigning different
importances to different components, the Weighted Euclidean Distance is defined as an example,
although weighting can be applied to other distance functions.
Let i = {xi1,xi2, ...,xip} and j = {x j1,x j2, ...,x jp} be two objects described by p attributes.
Euclidean Distance is defined by:
d(i, j) =√(xi1− x j1)2 +(xi2− x j2)2 + ...+(xip− x jp)2 (2.10)
While the Euclidean Distance is known as the distance in a straight line, the Manhattan Dis-
tance invokes the city block distance paradigm. This paradigm defines distance between two points
as the sum of horizontal and vertical distances for each pair of points. Considering objects i and j
as the previous objects we have the Manhattan Distance:
d(i, j) = |xi1− x j1|+ |xi2− x j2|+ ...+ |xip− x jp| (2.11)
The Minkowski distance is a generalization of both the Euclidean and the Manhattan Dis-
tances.
d(i, j) = h√
(xi1− x j1)h +(xi2− x j2)h + ...+(xip− x jp)h (2.12)
17
State of the art
It introduces the real number h , where h ≥ 1. When h = 1 we have the Euclidean Distance
and when h = 2 the Manhattan Distance is defined. However, for h→∞, we obtain the Chebychev
Distance (also known as the supremum distance).
d(i, j) = limh→∞
(p
∑f=1|xi f − x j f |h
) 1h
=p
maxf|xi f − x j f | (2.13)
This distance provides the maximum difference in values between the two objects i and j (the
attribute f ).
When each attribute has different importance, a weighting system can be applied. The Weighted
Euclidean Distance is defined as follows:
d(i, j) =√
w1(xi1− x j1)2 +w2(xi2− x j2)2 + ...+wm(xip− x jp)2 (2.14)
The Mahalanobis Distance, although not as popular as the previous distances defined, also
has long been used in clustering techniques. This distance includes the covariance matrix (V ) of
distribution of the objects which determines whether each two objects vary together or not.
DM =√
(x− y)V−1(x− y)T (2.15)
2.4.2 Temporal Distance Functions
As far as the temporal dimension goes, contrary to the previous distances where the distances
are mapped in R2, time is represented in R, which facilitates the difference calculation. The time
interval can be defined by the following formulae:
d(ti, t j) = |ti− t j| (2.16)
However, any of the previous distance measuring functions is applicable, since in this dimen-
sion the objects are represented by numerical values also.
2.4.3 Social Distance Functions
Considering connections between users stored in TwitterEcho, it is possible to assume the implicit
existence of a social graph. Therefore, the social distance is simplified to a distance between
nodes within a graph. [Han06] defines two distance measures for graphs: Geodesic Distance and
SimRank.
Geodesic distance is the shortest path between two vertices. This calculation relies simply on
the minimum number of edges between two vertices.
SimRank stands for Similarity Based on Random Walk and Structural Context. In this distance
measure, two vertices are similar if they are connected with common vertices. In order to calculate
the similarity, there is the need to introduce the concept individual in-neighbourhood. Considering
18
State of the art
a directed graph G = (V,E), where V defines a set of vertices and the set of edges is E ⊆ V ×V ,
the individual in-neighbourhood of a vertex v is defined as:
I(v) = {u|(u,v) ∈ E} (2.17)
SimRank Distance is defined, for any two vertices u and v within graph G as:
s(u,v) =
0, if I(u) = 0∨ I(v)=0C
|I(u)||I(v)| ∑x∈I(u) ∑y∈I(v) s(x,y), if I(u) 6= 0∧ I(v) 6= 0(2.18)
where C is a constant between 0 and 1.
2.4.4 Content Distance Functions
Here, we present functions to calculate the similarity between two texts, which in our problem
are the tweets message. [Han06] defines Cosine similarity distance (as the most commonly used)
and a variation denominated Tanimono distance, while [MRS08] also introduces IDF and TFIDF.
Lastly, [RLW12] proposes a variation of Jaccard similarity complemented with Dice’s coefficient.
Before exploring content similarity functions, document representations must be explained,
since they enable the use of some of the distance functions presented.
[MRS08] defines IDF (Inverse document frequency) as a type of representation to assert the
frequency of terms in documents. Considering a text collection N and a document frequency of
the term t as d ft , we have:
IDF(t) = logN
d f (t)(2.19)
TF-IDF (Term frequency - Inverse document frequency) [LAR07] is an extension of this mea-
sure. It includes the concept of term frequency within a well defined document as opposed to a
document collection.
T FIDF(t,D) = T f (t,D)∗ IDF(t) (2.20)
In order to use Cosine similarity to check whether two texts are similar, they must be converted
to a term-frequency vector (also known as Document vector). These numeric vectors are created
representing the number of times a word appears in each text. After this transformation, it is
possible to compare two term-frequency vectors x, y with:
sim(x,y) =x.y||x||y||
(2.21)
The output of this similarity function is a value between 0 and 1, that represent if the texts
being compared are not related at all or if they are the same, respectively. A variation of this
19
State of the art
similarity measure is the Tanimono Distance defined specially for binary-valued attributes within
the term-frequency vectors. Tanimoto Distance is defined by:
sim(x,y) =x.y
x.x+ y.y− x.y(2.22)
Jaccard and Dice’s similarity measure also evaluates if two texts x and y are similar:
sim(x,y) =|x∩ y|
min(|x|, |y|)(2.23)
[RLW12] uses this measure and also a combination of cosine similarity and TFIDF for apply-
ing weights to each term-frequency vectors in different data processing phases, but with the same
purpose of verifying similar documents.
[RKT11] proposed two variations for both Cosine and Jaccard similarity when applied to short
text clustering. This variants are proposed due to the sparsity in term-frequency vectors for short
texts, such as tweets. The variation of Cosine similarity is:
sim(x,y) = 1− ∑dk=1 xkyk
||x||y||)(2.24)
and Jaccard’s variation is represented by:
sim(x,y) = 1− |x∩ y||x∪ y|)
(2.25)
2.4.5 Mixed Distance Functions
To the best of our knowledge, there is only one proposal of distance measure that combined dif-
ferent types of data: the cosine similarity with temporal attenuation. [Lee12] states that content
similarity is directly related to a temporal dimension. He expects to show that two texts are more
similar if they occur in a short time windows, rather than in bigger ones.
sim(x,y) = ∑i
xi ∗ yi
|x||y|∗ e
ζ |tx−ty|W (2.26)
This temporal evaluation is made with a temporal penalty that assures that if the time interval
between two texts is big then the penalty suffered is also large. The parameter ζ enables to adjust
the penalty ratio.
2.5 Spatio-Temporal Visualization
An important step in the Data Mining process is interpretation of data and patterns. As [CM07]
claims, "Visual data mining refers to methods, approaches and tools for the exploration of large
data sets by allowing users to directly interact with visual representations of data and dynamically
modify parameters to see how they affect the visualized data.". The main properties that must be
20
State of the art
followed by visualization tools in Data Mining are: the appearance of data, displaying temporal
behaviour, showing properties of entire displayed scene and support interaction [Gah09].
[Gah09] states that the main visualization techniques are: map-based, chart-based, projection,
space-filling or pixel based, iconographical or compositional and hierarchical or network.
These data visualization techniques will be explored next accordingly to practical implemen-
tations found in related work.
2.5.1 Clustering Visualization
When referring to clustering, the most usual representation is a graph-like visualization. It presents
the objects in each cluster, maintaining the clustering goal of assigning similar objects a shortest
distance and verify sparsity between clusters with greater distance. [LAR07] developed a cluster-
ing visualization tool visible in 2.3.
Figure 2.3: Clustering Visualization [LAR07].
Another clustering visualization for a large amount of data involves assigning different colors
and objects. For objects in different clusters, overlapping ellipses over the most representative
objects are displayed to represent similar objects. [CSX08] applied it to study geographical lexical
variation to better assist on mapping the results.
21
State of the art
Figure 2.4: Clustering Visualization [CSX08].
2.5.2 Georeferenced Data Visualization
Georeferenced data typical involves plotting the information on top of a geographic representation,
being the most common the map.
The first visualization type discussed is the 2D map that is, currently, very popular due to
Google Maps 1. Using their API, a vast number of applications were implemented due to its
simplicity and the visual appeal. [Lee12] used this tool to overlap the map with representative
points for its solution, as well as geometric figures to highlight the obtained results. [LYCW11]
used it to detect events on Twitter.
Figure 2.5: Map with event detection on Twitter [Lee12].
1https://developers.google.com/maps/?hl=pt-pt
22
State of the art
Another Google creation related to mapping is Google Earth 2. This enables a 3D visualization
of the Earth and, like Google Maps it possesses an easy interface and allows an intuitive informa-
tion representation. [CM07] used it to represent the association rules extracted from a data set of
Hurricane Isabel.
[CM07] included also a timeline (which will be presented in the next section) since it uses a
spatio-temporal approach to retrieve knowledge.
Figure 2.6: Modified Google Earth rule visualization tool [CM07].
Although only Google’s visualization tools have been detailed, there are many competitors in
this market niche that also provide a map API.
An approach created by Silicon Graphics International in partnership with the University of
Illinois created a real-time visualization tool of sentiment mining on Twitter [Fit12]. The repre-
sentation adopted a heat map approach, in which each color represented a different value for the
majority of positive or negative comments.
2https://developers.google.com/earth/?hl=pt-pt
23
State of the art
Figure 2.7: Real-time heat maps of positive and negative sentiments expressed via Twitter [Fit12].
2.5.3 Timestamped Data Visualization
Timestamped data invokes a linear organization of events and therefore the most intuitive repre-
sentation is a graphic, possessing in one of the axis the temporal dimension and in the other axis
the values for analysis. [Lee12] uses a graphic to plot probability of a keyword belonging to a
location over time, as is visible in the figure 2.8.
Figure 2.8: Time Graphic for event detection on Twitter [Lee12].
Although many graphic tools provide interaction, nowadays the concept of timeline has emerged
as a very usable tool to navigate data through time. [RLW12] incorporated timeline as a filter for
the information gathered, to simplify access to the most important data collected.
24
State of the art
Figure 2.9: Timeline [RLW12].
25
State of the art
26
Chapter 3
Solution Perspective
3.1 Solution Description
This section presents the proposed solution. We present the planned approaches to address the
problems of the thesis and achieve the goals established. The expected development process is
presented in figure 3.1, with the parts already finished on the left side of the image.
Figure 3.1: Steps in the development process.
In the development process presented in 3.1 it is possible to see that analysis were made upon
clustering and distance functions as well as appropriate visualization tools. It was also analysed
the platform TwitterEcho.
In the next steps, the clustering algorithms and distance functions will be chosen for this
particular problem. To apply these, data must be prepared conveniently. The clustering results
will be presented using a visualization design well suited for both the data types and also the
pattern detection.
After the previous steps, evaluation of the results is required. This evaluation will be scientific
(using clustering evaluation methods already discussed) and also regarding user experience and
clarity of the results.
27
Solution Perspective
Now, that the methodologies were explained, it is time to explain in more detail the architecture
proposed for the solution. Although TwitterEcho possesses a complex architecture, to facilitate
interpretation, we just consider the databases where geo-referenced tweets and user information
are present, since these are the only data sets required for solving the proposed problem in this
dissertation. These are used as input for the distance calculation module in the backend server
and accessible by a Java-based RESTful service. The clustering results will be available via the
RESTful service to the website that incorporates the visualization tool. Clustering and matrix
operations will be processed in R Server 2, a specialized programming tool for mathematical and
algorithmical purposes.
Geo DB
Network DB
Calculate distances Clustering
R Server
Local DB (similarity matrices)
TwitterEcho
Backend Server
pattie.fe.up.pt
DBMS Server
RESTful Server
Frontend Webpage(HTML +
Javascript)
Figure 3.2: Proposed solution architecture.
In the table 3.1 we present the proposed clustering algorithms and distance functions per di-
mension, to be stored in a similarity matrix per each dimension:
Dimension Clustering Algorithm Distance FunctionSpatial COD-CLARANS Euclidean Distance
Temporal DBSCAN Time Interval
Social SCAN SimRank
Content DBSCAN Cosine Distance
Table 3.1: Algorithms and distance functions for proposed solution.
It is also being considered the introduction of Twitter’s functionalities (retweets, mentions and
hashtag) into the distance functions, in order to consider also the social and content importance
2http://www.rstudio.com/ide/docs/server/getting_started
28
Solution Perspective
implicit in these functionalities.
After each similarity matrix is computed, the overlapping of the final clustering will depend
on the user selected parameters through the visualization tool. It must use a weight system, to
apply different importances to each dimension and output the final result. There is also the need
to normalize the different distances in each dimension, in order for all to follow the same variance
(for instance, all between 0 and 1).
The expected final tool is presented below. It must contain a 2D map with tweet locations and
a timeline to filter the number of visible tweets. The user selected parameters for the clustering are
represented through a slidebar for each dimension, expected to follow a predefined scale to fight
the major issue of scalability on this solution.
This feature will allow a simplistic way of pre-computing all combinations of values selectable
by the user and reduce the overhead of the final cluster computing to allow the fastest visualization
possible.
Figure 3.3: Expected final tool.
The last parameter will be which way to present the clustering. This is necessary in order to
maintain all tweets mapped and also represent the clustering associations without rearranging the
tweets position. Many representation types are being considered, namely:
• Use different color, line style, object shape and size to represent each dimension within the
final cluster.
• Make cluster visible only when a given point belonging to the cluster is selected, and show
all connecting lines.
29
Solution Perspective
• Use only different colors for objects in different clusters and overlap the information of each
cluster with an ellipse shape with the same color as the objects
These will be tested to find the best technique, in order to output the knowledge mined, while
keeping a good presentation of these patterns collected, maximizing their interpretability for a
wide range of users.
3.2 Technologies
The backend server will host a R Server (responsible for matrix and clustering processing), a
DBMS server (in order to store calculated matrices and clustering results) and a RESTful service
to organize communication throughout the system. The webpage is expected to de developed in
HTML and JavaScript, since it must adhere to the TwitterEcho’s look & feel.
3.3 Experimental setup, validation and evaluation
The available dataset of Portuguese geo-referenced tweets currently consists of 3.316 in continen-
tal Portugal plus 55 in the archipelagos of Azores and Madeira. Although the data must be filtered
to the Portuguese Twittosphere, we present in the map below the distribution of Portuguese tweets
throughout the world.
Figure 3.4: Current distribution of georeferenced tweets in TwitterEcho.
The experiment’s validation will most surely involve a comparison with TwitterEcho’s search
platform in order to verify that all tweets of the same type are clustered together, although this is
30
Solution Perspective
not enough to validate all dimensions. It is expected that the ultimate validation will be done by
the final users.
To answer the goals set for this dissertation, the similarities must be clear for every dimension.
This is, if a clustering is totally based on the spatial dimension, the clustering will be visible in the
map since it must connect to the closest objects. If, however, it is temporal, the resulting clustering
must associate tweets not far apart in time. When content is the goal, some words will be common
and/or the same hashtag will be present. Social dimension clustering validation can be verified
through Twitter’s social graph and also assessing if, by any chance, a tweet is a retweet or a reply,
which is signalled by the respective symbol "@".
This solution evaluation will incorporate the formulae and techniques present in chapter 2.3.7.
Since the ground truth is not known a-priori, an intrinsic method for evaluation of clustering
quality will be processed, namely the silhouette coefficient.
3.4 Workplan
In the table 3.5 we present a Gantt diagram with calendarization for the remaining project, with
tasks assigned to the time expected to be used to conclude them.
Feb Mar Apr May Jun Jul
2 3 4 1 2 3 4 1 2 3 4 5 1 2 3 4 1 2 3 4 1 2 3 4 5
Collect TwitterEcho’s data
Implement clustering algorithms
Performance evaluation & integration
Visualization tool development
Dissertation Report Writing
Article Writing
Figure 3.5: Timeline for the second phase of the dissertation.
31
Solution Perspective
32
References
[AFK11] Houben G. Abel F., Gao Q. and Tao K. Analyzing Temporal Dynamics in TwitterProfiles for Personalized Recommendations in the Social Web. 2011.
[AHSV03] McKeown G. Al-Harbi S. and Rayward-Smith V. A New Metric for Categorical Data.In Statistical Data Mining and Knowledge Discovery. Chapman and Hall/CRC, July2003.
[AMJ99] Kriegel H. Ankerst M., Breunig M. and Sander J. OPTICS: ordering points to identifythe clustering structure. ACM SIGMOD Record, pages 49–60, 1999.
[BO12] M. Bošnjak and E. Oliveira. TwitterEcho - A Distributed Focused Crawler to SupportOpen Research with Twitter Data. 2012.
[Bru] A. Bruns. Information , Communication & Society How long is a tweet? Mappingdynamic conversation networks on Twitter using GAWK and GEPHI. (November2012):37–41.
[CM07] P. Compieta and S. Di Martino. Exploratory spatio-temporal data mining and visual-ization. Journal of Visual Languages and Computing, 2007.
[Cor] M. Cordeiro. Twitter event detection : combining wavelet analysis and topic infer-ence summarization.
[CSX08] B. Connor, N. A Smith, and E. Xing. A Latent Variable Model for Geographic LexicalVariation. 2008.
[Cur06] J. Curran. Statistical data mining and knowledge discovery. Structural equationmodeling: a multidisciplinary journal, 13(4):649–652, 2006.
[Dic] Oxford Dictionaries. "Social Network". Oxford Dictionaries. April 2010.
[EKSX96] M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discoveringclusters in large spatial databases with noise. Proceedings of the 2nd International In-ternational Conference on Knowledge Discovery and Data Mining (KDD-96), 1996.
[Fit12] B. Fitzgerald. SGI Twitter Heat Map: Supercomputer Shows Where Angriest Tweet-ers Live. Available at http://www.huffingtonpost.com/2012/11/19/sgi-twitter-heat-map_n_2138726.html, accessed at 1 of February 2013,2012.
[Gah09] M. Gahegan. Visual Exploration and Explanation in Geography Analysis with Light.In Geographic Data Mining and Knowledge Discovery, Second Edition, Chapman& Hall/CRC Data Mining and Knowledge Discovery Series, pages 291–324. CRCPress, May 2009.
33
REFERENCES
[GGRJ] Johannes Gehrke, Dimitrios Gunopulos, Harry Road, and San Jose. Automatic Sub-space Clustering of High Dimensional Data for Data Mining Applications RakeshAgrawal.
[Gol10] S. Golder. Tweet , Tweet , Retweet : Conversational Aspects of Retweeting on Twit-ter. pages 1–10, 2010.
[Han06] J. Han. Data Mining : Concepts and Techniques (2nd Edition) , 2006.
[HK] A. Hinneburg and D. Keim. An Efficient Approach to Clustering in Large MultimediaDatabases with Noise.
[KH] G. Karypis and E. Han. CHAMELEON : A Hierarchical Clustering Algorithm UsingDynamic Modeling.
[KR08] L. Kaufman and P. Rousseeuw. Introduction. In Finding Groups in Data, pages 1–67.John Wiley & Sons, Inc., 2008.
[LAR07] Paulovich F. Lopes A., Pinho R. and Minghim R. Visual text mining using associationrules. Computers & Graphics, 31(3):316–326, June 2007.
[Lee12] C. Lee. Mining spatio-temporal information on microblogging streams usinga density-based online clustering method. Expert Systems with Applications,39(10):9623–9641, August 2012.
[LMS11] W. Loh, S. Mane, and J. Srivastava. Mining temporal patterns in popularity of webitems. Information Sciences, 181(22):5010–5028, November 2011.
[LYCW11] C. Lee, H. Yang, T. Chien, and W. Wen. A Novel Approach for Event Detection byMining Spatio-temporal Information on Microblogs. 2011 International Conferenceon Advances in Social Networks Analysis and Mining, pages 254–259, July 2011.
[MRS08] C. Manning, P. Raghavan, and H. Schtze. Introduction to Information Retrieval.Cambridge University Press, New York, NY, USA, 2008.
[Ng] R. Ng. Efficient and Effective Clustering Data Mining Methods for Spatial. pages144–155.
[PMMR13] B. Pescosolido, J. Martin, J. McLeod, and A. Rogers. Handbook of the Sociologyof Health, Illness, and Healing: A Blueprint for the 21st Century. Handbooks ofSociology and Social Research. Springer, 2013.
[RKT11] A. Rangrej, S. Kulkarni, and A. Tendulkar. Comparative study of clustering tech-niques for short text documents. Proceedings of the 20th international conferencecompanion on World wide web - WWW ’11, page 111, 2011.
[RLW12] H. Ryu, M. Lease, and N. Woodward. Finding and exploring memes in social media.Proceedings of the 23rd ACM conference on Hypertext and social media - HT ’12,page 295, 2012.
[RS99] J. Roddick and M. Spiliopoulou. A bibliography of temporal, spatial and spatio-temporal data mining research. ACM SIGKDD Explorations Newsletter, 1(1):34–38,June 1999.
34
REFERENCES
[RS02] J. Roddick and IEEE Computer Society. A Survey of Temporal Knowledge DiscoveryParadigms and Methods. 14(4):750–767, 2002.
[THH01] A. Tung, J. Hou, and J. Han. Spatial clustering in the presence of obstacles. In DataEngineering, 2001. Proceedings. 17th International Conference on, pages 359–367,2001.
[Twi13a] Twitter. FAQs About Retweets (RT). Available at https://support.twitter.com/groups/31-twitter-basics/topics/109-tweets-messages/articles/77606-faqs-about-retweets-rt, accessed at 31 of January2013, 2013.
[Twi13b] Twitter. FAQs about Tweet Location. Available at https://support.twitter.com/groups/31-twitter-basics/topics/109-tweets-messages/articles/78525-faqs-about-tweet-location, accessed at 31 of January2013, 2013.
[Twi13c] Twitter. TwitterAPI - Documentation. Available at https://dev.twitter.com/docs, accessed at 31 of January 2013, 2013.
[Twi13d] Twitter. What Are Hashtags ("#" Symbols)? Avail-able at https://support.twitter.com/groups/31-twitter-basics/topics/109-tweets-messages/articles/49309-what-are-hashtags-symbols, accessed at 31 of January 2013, 2013.
[Twi13e] Twitter. What Are @Replies and Mentions? Avail-able at https://support.twitter.com/groups/31-twitter-basics/topics/109-tweets-messages/articles/14023-what-are-replies-and-mentions, accessed at 31 of January 2013,2013.
[WM97] W. Wang and R. Muntz. STING : A Statistical Information Grid Approach to SpatialData Mining. 1997.
35