Mert Ozer arXiv:1904.04994v1 [cs.LG] 10 Apr 2019arXiv:1904.04994v1 [cs.LG] 10 Apr 2019 Discovering...

16
arXiv:1904.04994v1 [cs.LG] 10 Apr 2019 Discovering patterns of online popularity from time series Mert Ozer 1[123456789012] , Anna Sapienza 2 , Andr´ es Abeliuk 2 , Goran Muric 2 , and Emilio Ferrara 2 1 School of Computing, Informatics, and Decision Science Engineering, Arizona State University, Tempe, USA [email protected] 2 Information Sciences Institute, University of Southern California, Marina del Rey, USA {annas,gmuric,aabeliuk,ferrarae}@isi.edu Abstract. How is popularity gained online? Is being successful strictly related to rapidly becoming viral in an online platform or is it possible to acquire popularity in a steady and disciplined fashion? What are other temporal characteristics that can unveil popularity of online content? To answer these questions, we leverage a multi-faceted temporal analysis of the evolution of popular online contents. Here, we present dipm-SC: a multi-dimensional shape-based time-series clustering algorithm with a heuristic to find the optimal number of clusters. First, we validate the ac- curacy of our algorithm on synthetic datasets generated from benchmark time series models. Second, we show that dipm-SC can uncover mean- ingful clusters of popularity behaviors in a real-world Twitter dataset. By clustering the multidimensional time-series of popularity of contents coupled with other domain-specific dimensions, we uncover two main patterns of popularity: bursty and steady temporal behaviors. Moreover, we find that the way popularity is gained over time has no significant impact on the final cumulative popularity. Keywords: multidimensional time series, shape-based clustering, online popularity, social media 1 Introduction How do contents get popular on the Web? Is being successful, strictly constrained to a rapid and bursty virality [1] or is it possible to acquire popularity in a steady and disciplined fashion? What are other temporal characteristics that can unveil popularity of online contents? Temporal dynamics of popularity have been studied extensively in Twit- ter [41], news [24], and videos [9]. Thus far, previous work studying popularity ei- ther explored topological patterns in the networks that lead to cascades [4, 25], or focused on the short time period in which significant popularity is gained [43, 23]. However, popularity is subject to endogenous and exogenous factors [23] that

Transcript of Mert Ozer arXiv:1904.04994v1 [cs.LG] 10 Apr 2019arXiv:1904.04994v1 [cs.LG] 10 Apr 2019 Discovering...

Page 1: Mert Ozer arXiv:1904.04994v1 [cs.LG] 10 Apr 2019arXiv:1904.04994v1 [cs.LG] 10 Apr 2019 Discovering patterns of onlinepopularity from time series Mert Ozer1[1234−5678−9012], Anna

arX

iv:1

904.

0499

4v1

[cs

.LG

] 1

0 A

pr 2

019

Discovering patterns of online popularity from

time series

Mert Ozer1[1234−5678−9012], Anna Sapienza2, Andres Abeliuk2, Goran Muric2,and Emilio Ferrara2

1 School of Computing, Informatics, and Decision Science Engineering, Arizona StateUniversity, Tempe, USA

[email protected] Information Sciences Institute, University of Southern California, Marina del Rey,

USA{annas,gmuric,aabeliuk,ferrarae}@isi.edu

Abstract. How is popularity gained online? Is being successful strictlyrelated to rapidly becoming viral in an online platform or is it possible toacquire popularity in a steady and disciplined fashion? What are othertemporal characteristics that can unveil popularity of online content? Toanswer these questions, we leverage a multi-faceted temporal analysisof the evolution of popular online contents. Here, we present dipm-SC:a multi-dimensional shape-based time-series clustering algorithm with aheuristic to find the optimal number of clusters. First, we validate the ac-curacy of our algorithm on synthetic datasets generated from benchmarktime series models. Second, we show that dipm-SC can uncover mean-ingful clusters of popularity behaviors in a real-world Twitter dataset.By clustering the multidimensional time-series of popularity of contentscoupled with other domain-specific dimensions, we uncover two mainpatterns of popularity: bursty and steady temporal behaviors. Moreover,we find that the way popularity is gained over time has no significantimpact on the final cumulative popularity.

Keywords: multidimensional time series, shape-based clustering, onlinepopularity, social media

1 Introduction

How do contents get popular on the Web? Is being successful, strictly constrainedto a rapid and bursty virality [1] or is it possible to acquire popularity in a steadyand disciplined fashion? What are other temporal characteristics that can unveilpopularity of online contents?

Temporal dynamics of popularity have been studied extensively in Twit-ter [41], news [24], and videos [9]. Thus far, previous work studying popularity ei-ther explored topological patterns in the networks that lead to cascades [4, 25], orfocused on the short time period in which significant popularity is gained [43, 23].However, popularity is subject to endogenous and exogenous factors [23] that

Page 2: Mert Ozer arXiv:1904.04994v1 [cs.LG] 10 Apr 2019arXiv:1904.04994v1 [cs.LG] 10 Apr 2019 Discovering patterns of onlinepopularity from time series Mert Ozer1[1234−5678−9012], Anna

2 M. Ozer et al.

affects its temporal dynamics. For instance, popularity of an item may dependon the amount of external promotions it receives [35]. Analyzing different tempo-ral patterns that arise when considering both endogenous and exogenous factorswill further extend our understanding on temporal collective behavior.

To take into account all these factors at a time and tackle the problem ofuncovering their relation with the dynamics of popularity, we focus on multidi-mensional clustering methods. Multidimensional clustering can indeed allow usto uncover the hidden relations between time series of events having differentevolutions in more than one dimension. For instance, the dynamics of a user con-tent can be not only described by its popularity but also by the online activityand pace of the user, number of social connections in social networks.

A wide range of studies have been focused on solving the problem of efficientlyclustering time series [2]. Some approaches have adopted methods such as sub-sequence techniques [38, 46, 47], and dimensionality reduction models [45, 7].However, we aim at uncovering different patterns of popularity on the basis ofboth their shapes and the time at which the main popularity gain is achieved.Therefore, sub-sequence or subspace solutions are not suitable for this analysisas they lose temporal granularity and order. In this paper, we take inspirationfrom unidimensional clustering approaches [43, 30, 33, 12] that use a shape-baseddistance and a time series similarity measure that is scale and shift invariant.We generalize these methods to multidimensional time series and parametrizedthe temporal shifting, thus allowing us to capture temporal patterns occurringat different times.

In the present work, we design a multi-faceted temporal analysis to studythe timeline of popular online content and provide insights on how popularityis gained over time. To this aim, we extend K-Spectral Centroid (kSC) [43],a unidimensional shape-based time series clustering algorithm into our multi-dimensional version, called dipm-SC. It partitions multidimensional time seriesdata into clusters and uses a heuristic to find the optimal number of clusters.Each cluster is characterized by a centroid, having a unique shape per dimensionidentifying the average temporal behavior of the cluster members.

We validate the accuracy of our algorithm on the datasets generated frombenchmark time series models with varying parameters of time series length,dimensions and number of underlying clusters. To the best of our knowledge,this is the first study which explores multidimensional use of k-SC.

Then, with the help of the dipm-SC, we study temporal multidimensionalbehavior in two real world large datasets containing data from GitHub andTwitter. In Github, we study the popularity of repositories considering a seriesof other temporal dimensions that keep track of the repository development, suchas pushes and pull requests a repository receives, and repository issues, such asissue entries and comments under them. Our algorithm identifies 9 main clustersamong most popular repositories on GitHub in different shapes. In Twitter, westudy how popular hashtags unfold in a window of time where a major eventhappens. We keep track of number of times a hashtag used hourly (as a measureof popularity) alongside with hourly positive and negative sentiment rates and

Page 3: Mert Ozer arXiv:1904.04994v1 [cs.LG] 10 Apr 2019arXiv:1904.04994v1 [cs.LG] 10 Apr 2019 Discovering patterns of onlinepopularity from time series Mert Ozer1[1234−5678−9012], Anna

Discovering patterns of online popularity from time series 3

tweets containing the hashtag posted by bot accounts. Our algorithm identifies6 main clusters among most popular hashtags on Twitter. We show that bothbursty and steady clusters are evident in both datasets. On Twitter, when abursty popularity gain is present, we observe increased sentiment and bot activityrates around the popularity spike. We also note that sentiment and bot activitydecay after the spike can be characterized by the fat-tail of the popularity spike.Moreover, we find that how repositories unfold over time shows no significantimpact on the cumulative popularity at the end of their lifetime. However, inthe Twitter data, our algorithm is able identify different cluster shapes leadingto different cumulative popularity gains.

Our contributions

1. We propose a novel algorithm to cluster multidimensional shape-based timeseries. Our approach detect the interplay of different dimensions that allowto perform a non-supervised classification of different temporal patterns.

2. We validate the accuracy of our algorithm on the datasets generated frombenchmark time series models, and compare its performance with other scal-able multidimensional time series methods.

3. We present two real-world case studies, where we showcase the effective-ness of our algorithm in both finding meaningful popularity dynamics andproviding clusters that are coherent and easy to interpret.

2 Methodology

In this section, we formally define the problem and discuss the need for exten-sions to time series clustering algorithms in the literature. Then, we develop ourmultidimensional extension based on a several unidimensional shape-based timeseries clustering methods.

2.1 Problem Definition

In this work, we aim at finding temporal patterns of popular online content in amulti-faceted manner. Let N be the number of entities to be studied describedby D temporal features. Each temporal feature is represented by a discrete timeseries in M time units (e.g., days or hours). This representation of our data canbe encoded in a tensor X of size N ×D ×M .

Given X , we want to partition the N multivariate time series into an optimalnumber ofK clusters, by taking the temporal dynamics of differentD dimensionsinto account at the same time. To this aim, we perform a clustering process basedon the cumulative similarity of shapes among corresponding dimensions of thetime series. Therefore, each cluster k is represented by a multi-dimensional shapeck ∈ R

D×M , which describes the overall behavior of the cluster across multipledimensions. In this work, solving the aforementioned problem gives us the abilityto uncover similar patterns of popularity gain while differing in other dimensions,and vice versa.

Page 4: Mert Ozer arXiv:1904.04994v1 [cs.LG] 10 Apr 2019arXiv:1904.04994v1 [cs.LG] 10 Apr 2019 Discovering patterns of onlinepopularity from time series Mert Ozer1[1234−5678−9012], Anna

4 M. Ozer et al.

2.2 Proposed Algorithm

The kSC algorithm proposed by Yang and Leskovec [43] stands out as a promi-nent solution for analyzing time series data constructed from online human be-haviors. Along that line, there have been significant contributions to the shape-based time series clustering literature [30, 31]. Here, we extend and compare thefollowing shape-based univariate time series clustering algorithms: kShape[30],kDBA[33], kAVG+ED[12], and kSC[43]. We only report the details about themethodology used to extend kSC[43] as it is the only algorithm specifically de-veloped for online social platforms and, as shown in the supplementary material3, is the one performing better in our synthetic scenario. However, we appliedsimilar extensions to the kShape, kDBA, and kAVG+ED algorithms.

Our extension focuses on two parts. First, we modify the existing kSC algo-rithm to be applicable to multidimensional time series data. This process consistsof the two steps of updating the distance and averaging functions of kSC. Second,we develop an iterative framework to find the optimal number of multidimen-sional shape clusters inspired by a similar work developed for kMeans [20].

kSC framework K-Spectral Centroid (kSC) is an iterative kMeans based timeseries clustering algorithm. The algorithm iteratively assigns time series data toclusters and update cluster centroids until no updates to clusters is made. Itadopts a scale and shift invariant distance function [8] to measure the proximityof two time series. Therefore, each cluster’s centroid represents the average shapeof its members. At each iteration, the method uses a spectral clustering techniqueas its averaging function to update the centroids. However, as it is originallyproposed and designed for unidimensional time series data, we need to extendits distance and averaging functions in order to apply it to a multidimensionalscenario.

Extending the Distance Function There are several ways to extend theunivariate distance function to its multivariate version. First, we can computedistances separately on each dimension and sum them. However, this option isnot optimal, as usual time series distances involve the manipulation of the signalthrough shifting and warping process. Thus, the use of separate distances wouldlead to different manipulations (shifting and warping) in each dimension. As ourtask requires to observe their interplay instead, manipulating each dimensiondifferently is not desirable. Second, we can compute distances by using an optimalshared manipulation across all dimensions. Here, we propose extending the kSCalgorithm’s distance function [8] as follows

dist(ck,x) = minαd,q

1

D

D∑

d=1

‖ck(d, :)− αdxq(d, :)‖

‖ck(d, :)‖, (1)

3 www.public.asu.edu/~mozer/dipm-SC

Page 5: Mert Ozer arXiv:1904.04994v1 [cs.LG] 10 Apr 2019arXiv:1904.04994v1 [cs.LG] 10 Apr 2019 Discovering patterns of onlinepopularity from time series Mert Ozer1[1234−5678−9012], Anna

Discovering patterns of online popularity from time series 5

where x(d, :) is the d’th dimension of time series x ∈ RD×M , xq is the shifting

of x by q steps in time, and ck is the centroid of cluster k. The above equationconstitutes the essential part of calculating the distance of multivariate timeseries x to a centroid. It indeed enforces the same shifting parameter q regardlessof dimensions. Yet, it keeps the scaling parameter αd unique for each dimension.Here, we assume that even if each dimension may experience trends in differentscales, they should have the same temporal order.

Notice that, for any fix q, the optimal αd(q) that solves Eq. (1) is

αd(q) =xq(d, :)

T ck(d, :)

‖xq(d, :)‖2 , (2)

by setting the gradient to zero while q is fixed. Thus, minimizing the distancefunction is reduced to a brute-force searching of the optimal q, where αd iscalculated for each shifting option as in Eq. (2). The algorithm can becomereally slow when full-length shifting is allowed. As we are interested in temporalbehavior of online content, we focus on moments in time when the events happen.Therefore, we aim to keep the shifting parameter to the minimum.

Extending the Averaging Function As a second step in our method, weneed to extend the averaging function of kSC. The role of the averaging functionis to find a suitable centroid for each cluster, where the distance of its membersis minimum to the centroid

ck = argminc

x∈Sk

dist(c,x)2 ,

where Sk is the set of time series belonging to cluster k.A possible way of extending the averaging function consists of finding a uni-

variate centroid which minimizes the distance to every dimension. We avoid thisoption since there may be different behaviors on different dimensions and ourtask involves observing them in parallel. Instead, we extract a unique shapefor each dimension separately. First, we replace the distance function with itsmultivariate version

ck = argminc

x∈Sk

minαd,q

1

D2

D∑

d

‖c(d, :) − αdxq(d, :)‖2

‖c(d, :)‖2 . (3)

To be able to use the spectral properties of Rayleigh quotient in accordancewith the original kSC paper[43], we first need to find the optimal αd and q. αd

has a direct solution as proposed in Eq. (2). However q is dimension invariant.Here we set q as the optimal q found during the distance’s computation, andjointly shift every dimension of the time series x accordingly during the newcentroid calculation. Finally, we invert the order of the two sums in Eq. (3) andget

Page 6: Mert Ozer arXiv:1904.04994v1 [cs.LG] 10 Apr 2019arXiv:1904.04994v1 [cs.LG] 10 Apr 2019 Discovering patterns of onlinepopularity from time series Mert Ozer1[1234−5678−9012], Anna

6 M. Ozer et al.

ck = argminc

1

D2

D∑

d

x∈Sk

‖c(d, :) − αdx(d, :)‖2

‖c(d, :)‖2 . (4)

Our minimization problem does not have dimension independent variablesanymore and we can solve each dimension’s centroid separately by followingsimilar steps to the univariate case

ck(d, :) = argminc′

x∈Sk

‖αdx(d, :) − c′‖2

‖c′‖2

= argminc′

1

‖c′‖2

x∈Sk

x(d, :)T c′x(d, :)

‖x(d, :)‖2− c′

2

= argminc′

1

‖c′‖2

x∈Sk

(

x(d, :)x(d, :)T

‖x(d, :)‖2 − I

)

c′

2

= argminc′

1

‖c′‖2 c

′T∑

x∈Sk

(

I −x(d, :)x(d, :)T

‖x(d, :)‖2

)

c′ ,

and by replacing∑

x∈Sk(I−

x(d, :)x(d, :)T

‖x(d, :)‖2 ) with M , we obtain the following

minimization problem

ck = argminc′

c′TMc′

‖c′‖2,

which achieves its minimum value with the eigenvector of M correspondingto the smallest eigenvalue. We refer further interested readers to the propertiesof the Rayleigh quotient with parameters c′ and M [16].

After successfully constructing the two building blocks of kSC for our multi-variate case, we present the pseudo code of our iterative algorithm multivariate

kSC (m-kSC), cf. Alg. 1.

Finding the optimal K As with many kMeans based approaches, m-kSC alsoneeds the number of clusters as a parameter. A common practice to find theoptimal K involves post-processing, where we run the clustering algorithm sev-eral times for different k values. In this case, a trade off between the final lossfunction value and model complexity decides the optimal K. However, to avoidre-running the algorithm for every possible k, here we adopt dip-dist, the iter-ative strategy proposed by Kalogeratos and Likas [20]. We choose the dip-disttechnique over others [32, 18] as the only assumption dip-dist makes is the uni-modality of pairwise distances of cluster members. It suggests that an optimal

Page 7: Mert Ozer arXiv:1904.04994v1 [cs.LG] 10 Apr 2019arXiv:1904.04994v1 [cs.LG] 10 Apr 2019 Discovering patterns of onlinepopularity from time series Mert Ozer1[1234−5678−9012], Anna

Discovering patterns of online popularity from time series 7

Algorithm 1 m-kSC Algorithm

Input:{X , K} where X ∈ RN×D×M is the tensor containing N multivariate time

series and K is number of clusters.Output: {C, S} where C ∈ R

K×D×M is the tensor of cluster centroids and S

contains each cluster assignments.1: Initialize cluster assignments S randomly2: while S changes on every iteration do

3: for k = 1 : K do

4: for d = 1 : D do

5: M =∑

xn∈Sk(I −

xn(d, :)xn(d, :)T

‖xn(d, :)‖2

)

6: C(k, d, :) = Smallest eigenvector of M .7: end for

8: end for

9: for n = 1 : N do

10: k = argmink=1,...,K

dist(ck,xn) using Eq. 1

11: S(n) = k

12: end for

13: end while

cluster structure should not have more than a single mode among pairwise dis-tances of its members (multimodality). The density of pairwise distance valuesshould reach to maximum around the mode and decay while moving away.

Multimodality of each cluster is checked by its members’ distance to eachother. First, we calculate the distance of each time series to the other membersin the cluster. Then, we sort the distance vector in decreasing order. If the nullhypothesis of unimodality in the distribution of the sorted distances is rejectedby Hartigan’s dip test [19] with a significance level p-value< α, it is consideredto be a splitter in the cluster. Finally, if the number of splitters is higher than agiven threshold v, we split the cluster into two disjoint sub-clusters. Our splittingstrategy involves a local search with m-kSC where k is set to 2. Heuristically,we choose the least squared error objective function centroids over 10 runs.We iteratively split clusters until there is no clusters left with ratio of splittersgreater than the given threshold. As suggested by Kalogeratos and Likas [20], ateach iteration, we only split the cluster with maximum number of splitters. Wereport our complete dip-test based multidimensional spectral centroid algorithmdipm-SC, cf. Alg. 2. We refer further interested readers to Kalogeratos and Likas[20] for kMeans extension and to Hartigan and Hartigan [19] for the Hartigan’sdip test. We make the source code and the experimental settings available atwww.public.asu.edu/~mozer/dipm-SC/dipm_source_code.zip.

The computational cost of m-kSC is dominated by the calculation of matrixM for each xn and for d dimensions (O(m2dn)) and the eigendecomposition ofM for k clusters (O(m3dk)). Multimodality test in dipm-SC for each cluster khas a complexity of O(k(bn logn+n2)). So, the total complexity of the dipm-SCbecomes the maximum of the {O(m2dn),O(m3dk),O(k(bn logn+ n2))}.

Page 8: Mert Ozer arXiv:1904.04994v1 [cs.LG] 10 Apr 2019arXiv:1904.04994v1 [cs.LG] 10 Apr 2019 Discovering patterns of onlinepopularity from time series Mert Ozer1[1234−5678−9012], Anna

8 M. Ozer et al.

Algorithm 2 dipm-SC Algorithm

Input:{X , α, s} where α is the significance level, and v is the split threshold.Output: {C∗, S∗}.

1: Assign time series to one cluster S = ones(N, 1)2: for i = 1 : N do

3: for j = 1 : N do

4: d(i, j) = dist(X (i, :, :),X (j, :, :)) Equation 15: end for

6: end for

7: while max(scorek) > v do

8: for k = 1 : K do

9: scorek = checkClusterModality(d(Sk, Sk), Sk, α)10: end for

11: if max(scorek) > v then

12: k = argmaxk

(scorek)

13: [c1, c2] =splitCluster(X (Sk, :, :))14: C(k, :, :) = c115: C(K + 1, :, :) = c216: K = K + 117: [C, S] =m-kSC(X ,K, C)18: end if

19: end while

3 Evaluation

We evaluate the results of our algorithm applied to one synthetically generatedand two real-world multivariate time series datasets: GitHub and Twitter. We re-fer interested readers to https://www.public.asu.edu/~mozer/dipm-SC/supp_material.pdffor the comparison of the dipm-SC with extension of other shape-based time se-ries clustering algorithms on synthetically generated time series dataset and theGitHub analysis. For the sake of brevity of this paper, here we present our ex-perimental results for Twitter hashtag dataset. First, we perform the qualitativeanalysis by studying different behavioral patterns of the detected multivariatecentroids. Secondly, we characterize the clusters based on their ”periodicity”,locating them on a scale from periodic to viral. We expect that periodic timeseries will behave steadily and that the viral time series will exhibit significantspikes over short time periods. To this aim, we use two previously introducedmetrics; burstiness and memory [15]. These two metrics are computed from thedistribution of the inter-event time P (τ) between consecutive events of time se-ries data. In particular, given the inter-event time distribution P (τ) having meanmτ and standard deviation στ , we calculate burstiness as

B =στ −mτ

στ +mτ

(5)

where B ∈ [−1, 1]: B = 1 corresponds to the most bursty sequence of events,and B = −1 is a completely regular (periodic) sequence.

Page 9: Mert Ozer arXiv:1904.04994v1 [cs.LG] 10 Apr 2019arXiv:1904.04994v1 [cs.LG] 10 Apr 2019 Discovering patterns of onlinepopularity from time series Mert Ozer1[1234−5678−9012], Anna

Discovering patterns of online popularity from time series 9

This measure however is not enough to characterize the correlations tak-ing place between consecutive events. Correlation of consecutive events can bethought as a memory process. A measure that can be used to compute suchmemory coefficient between consecutive events (τi, τi+lag) is defined as

M =1

nτ − 1

nτ−1∑

i=1

(τi −m1) (τi+lag −m2)

σ1σ2(6)

where, nτ is the number of intervals between events in the time series, while m1,2

and σ1,2 are respectively the sample mean and standard deviation of τi,i+lag ’s.We set the value of lag based on the empirically observed value of periodicityfor each case.

4 Datasets description

We collected tweets starting from February 14th until March 6th, 2018 relatedto the Parkland school shooting event from GNIP. To this aim, we provided140 terms (hashtags, words, phrases) related to the event and the broader gundebate in the United States.4 GNIP provides any tweet that contains at leastone of the queried terms, while not suffering from any bias implications of freeAPI streams [29]. Our data contains 23.7M tweets coming from 3.7M users.

We then identify the top 1, 000 most occurring (popular) hashtags in ourdataset and build hourly time series for each of them. Alongside with hourlycounts of each hashtag, we construct the hourly ratio of tweets sent by botaccounts to the total hourly volume. To identify the most obvious bot accounts,we used the free API provided by Botometer [10]. We tested about 5% of themost active accounts (193K approximately) and found that 19K of them arelikely bots—in line with other research on bot frequency [39].

As the third and fourth dimensions, we consider the hourly rate of positiveand negative sentiment tweets. We use an off-the-shelf short text classificationtool SentiStrength [37] to detect positive and negative sentiment of each tweetwhere hashtags occur. SentiStrength provides scores for positive and negativesentiments separately. We aggregate each sentiment’s score on an hourly basisby simply summing them. Finally, we build the hourly sentiment ratio time seriesfor each hashtag by normalizing the sentiment volumes by total tweet volumes.

Therefore, the timeline of each hashtag over time is represented by 4 separatedimensions: total count, bot tweet rate, positive sentiment rate, and negativesentiment rate.

5 Popularity Gain Patterns on Twitter

In this section we present Twitter hashtag timeline analysis with the help ofmultidimensional shape clusters identified by our dipm-SC algorithm. We use 24hours as our shifting parameter. We also apply gaussian smoothing of window

4 https://www.public.asu.edu/~mozer/dipm-SC/GNIP_query_list.txt

Page 10: Mert Ozer arXiv:1904.04994v1 [cs.LG] 10 Apr 2019arXiv:1904.04994v1 [cs.LG] 10 Apr 2019 Discovering patterns of onlinepopularity from time series Mert Ozer1[1234−5678−9012], Anna

10 M. Ozer et al.

100 200 300 400Timeline (in hours)

0

0.2

0.4

0.6

0.8

139% of hashtags

#NRA,#2A,#MAGA,#Trump,#SecondAmendment,#2ADefenders,#Resist,#tcot

total countbot tweet rate(+) sent rate(-) sent rate

(a) Cluster-P

100 200 300 400Timeline (in hours)

0

0.2

0.4

0.6

0.8

117% of hashtags

#Parkland,#FloridaSchoolShooting,#Florida,#FloridaShooting,#SandyHook,#EndGunViolence,#NRAKills,#ParklandSchoolShooting

(b) Cluster-B1

100 200 300 400Timeline (in hours)

0

0.2

0.4

0.6

0.8

16.4% of hashtags

#EEUU,#NRAbloodmoney,#MentalHealth,#floridashooting,#PrayForParkland,#policyandchange,#thoughtsandprayers,#PrayforDouglas

(c) Cluster-B2

100 200 300 400Timeline (in hours)

0

0.2

0.4

0.6

0.8

18% of hashtags

#protectourchildren,#StudentsDemandAction,#StudentsStandUp,#ParklandStudentsSpeak,#CNNTownHall,#oneless,#standwiththekids,#neveragainmovement

(d) Cluster-B3

100 200 300 400Timeline (in hours)

0

0.2

0.4

0.6

0.8

13.9% of hashtags

#NeverSimpliSafe,#NeverLifeLock,#NeverBudget,#NeverAvis,#NeverEnterprise,#BoycottStarbucks,#NeverHertz,#stopNRAmazon

(e) Cluster-Boycott1

100 200 300 400Timeline (in hours)

0

0.2

0.4

0.6

0.8

13.4% of hashtags

#BoycottNRA,#BoycottNRASponsors,#AmazonPrime,#WayneLaPierre,#BoycottDelta,#SaturdayMorning,#Amazon,#StopNRAmazon

(f) Cluster-Boycott2

Fig. 1: Shape of the uncovered clusters of Twitter

size 24 to eliminate noise. In total we are able to identify 13 clusters in differenttimeline shapes. Here we present 6 clusters which spans nearly 78% percent ofpopular hashtags under study. We leave other 7 smaller clusters reachable exter-nally for brevity of the paper (http://www.public.asu.edu/~mozer/dipm-SC).We report these 6 most popular timeline shapes in Fig. 1.

First, we observe that cluster-P in Figure 1a is significantly different in shapewhen compared with others. In particular, it does not show any burst of activityin each of the four dimensions considered (popularity, sentiment ratios, and bottweet ratio). On the contrary, this cluster shows a steady pattern with dailyperiodic activities.

On the other hand, the remaining 5 clusters are characterized by a suddenspike in the popularity signal at a certain point in time and different trendsamong the other dimensions. The popularity burst of Cluster-B1 and B2 occursat the same time. Spike in popularity gain considerably differs in the tail part.Cluster-B1 shows a longer tail while Cluster-B2 dies off sooner. Cluster-B2 tem-poral pattern shows not only a decreasing trend in popularity but also a fasterdecay in both sentiment and bot ratios then the one we observe in cluster-B1.We notice an analogous discrepancy among the last two clusters (i.e., Boycott1and Boycott2).

Finally, we analyze the temporal coherence of hashtags clustered together.We note that our algorithm does not consider semantics while clustering timeline

Page 11: Mert Ozer arXiv:1904.04994v1 [cs.LG] 10 Apr 2019arXiv:1904.04994v1 [cs.LG] 10 Apr 2019 Discovering patterns of onlinepopularity from time series Mert Ozer1[1234−5678−9012], Anna

Discovering patterns of online popularity from time series 11

of hashtags. The detected clusters are indeed identified because they share sim-ilar temporal patterns among different dimensions. We report the top 8 closesthashtags to the cluster centroids in the top part of each panel in Figure 1.

Cluster-P contains broader issue related hashtags such as #NRA, #2A(secondamendment), #MAGA, #Trump, which show steady and periodic behavioramong dimensions. Cluster B1 and B2 exhibits spikes around the time whenthe tragic shooting takes place. In cluster-B1 we observe event-related long last-ing hashtags, while in cluster-B2 we observe event-related expiring hashtags suchas #prayfor variants, #MentalHealth. Cluster B3 is unique in terms of its spiketime when student survivors start the #neveragain movement and appear ina town hall organized by CNN. Lastly, we notice two boycott related hashtagsdominated clusters spiking in popularity gain around the same time. It overlapswith the time of students’ call to boycott NRA and companies that shares fi-nancial interests with it. It is easy to see these two clusters differing in their tailparts. Cluster-Boycott2 (longer popularity gain tail) shows more NRA & Ama-zon related hashtags while Cluster-Boycott1 involves other companies names.We will later present statistically significant difference in cumulative popularitygains between these two clusters.

5.1 Interplay of Dimensions

The key piece of information we acquire through this analysis is how popularityspike of a hashtag correlates with increased sentiment ratios and bot involvement.For every bursty popularity gain, we observe an increase in all other dimensions.Previous studies on bursty attention mechanisms of online content usually focuson the characteristic of the fat-tail [44, 23, 34]. Here, we observe that bothsentiment and bot tweet ratios stay more steady with a longer tail of popularity(e.g., B1, B3, and Boycott2) than dying ones (e.g., B2, Boycott1). We evidenceit by fitting a linear curve and measuring the average slope of cluster members.When compared, hashtags belonging to B1 have a higher slope than the onesin B2 in sentiment and bot involvement dimensions after the spike. We observethe same pattern between Boycott2 and Boycott1 clusters, where Boycott2 hasa higher slope in all three dimensions after the spike.

Another interesting behavior we direct attention to is the sudden increase inbot tweet ratio after the popularity spike in cluster B2, B3 and Boycott1. Theseclusters are characterized by a weaker tail in popularity gain after a burst in pop-ularity. This signals a steady bot involvement for a while although abandonmentof the overall activity takes place.

5.2 Burstiness and Memory of Clusters

First, we quantify popularity gain’s temporal behavior with the described bursti-ness and memory metrics. We present average burstiness and memory of eachcluster in Figure 2a. As expected, cluster-P has the lowest burstiness while en-joying higher values of memory. Lowest memory and highest burstiness belongsto Boycott1 cluster.

Page 12: Mert Ozer arXiv:1904.04994v1 [cs.LG] 10 Apr 2019arXiv:1904.04994v1 [cs.LG] 10 Apr 2019 Discovering patterns of onlinepopularity from time series Mert Ozer1[1234−5678−9012], Anna

12 M. Ozer et al.

0 0.05 0.1 0.15 0.2Memory

0.1

0.2

0.3

0.4

0.5

0.6

Bur

stin

ess

B1

B2Boyc1

P

Boyc2 B3

(a) (b)

Fig. 2: Burstiness and Memory of Clusters. Figure (a): Burstiness and memoryof Twitter clusters. Figure (b): Cumulative popularity of cluster members at theend of the data timeline. Each blue dot represents a hashtag.

Next, we investigate if different multi-dimensional timeline shapes lead tosignificant difference in hashtags’ cumulative popularity gain at the end of thetime window we observe. Since this analysis is based on the timeline of a hashtagrather than a lifespan, we do not align hashtags in their creation date. That iswhy, for example, comparing Cluster-B1 against Cluster-B3 is not a meaning-ful test. However, it is noteworthy that two-sample Kolmogorov-Smirtnov testrejects the null hypothesis for Cluster-Boycott1 and Cluster-Boycott2 of whichwe know the beginning of campaigns. Differing distributions for these two shapeclusters can be also observed from Figure 2b. Moreover, it is also appealing tosee no cumulative popularity distribution difference between hashtags of steadycluster P and bursty cluster B1 (p-value 0.056).

6 Related Work

Our work overlaps with the current literature among two branches. First is thetask of clustering time series, and the second is the task of uncovering temporalpatterns of online content.

Clustering time series has been a challenging task and research area thatproduced vast amount of literature [2]. Usually, when external feature learningor modeling [42, 26, 3] is not imposed, clustering of time series involves choos-ing a proper time series distance function and an algorithm. Numerous workshave introduced various distance measures to calculate proximity of time seriesdata [40, 11]. There have been partitional [30, 43, 33, 12], subsequence match-ing [6, 22], hierarchical [21] approaches using these different time series distancemeasures. Moreover, a major body of work [38, 47] exists in subsequence match-ing based time series clustering where they identify shorter most identifyingportions of time series data also known as shapelets to group them. For themultivariate time series data, same categorizations can be made as modelingbased [17, 14], and variants of generalizing univariate solutions to multivariate

Page 13: Mert Ozer arXiv:1904.04994v1 [cs.LG] 10 Apr 2019arXiv:1904.04994v1 [cs.LG] 10 Apr 2019 Discovering patterns of onlinepopularity from time series Mert Ozer1[1234−5678−9012], Anna

Discovering patterns of online popularity from time series 13

cases [36, 13]. Our approach falls into the second category where we extendexisting distance functions available for univariate time series data and updatecentroid finding (i.e. averaging function) accordingly.

As the second line of work from literature, we compare our effort with analysisof temporal patterns of online content. There is significant amount of literaturefocusing on characterizing and modeling bursts and long tail dynamics [5, 27, 23].First and foremost, our study does not focus on modeling, rather it finds clustersof shapes inherent to online temporal behavior. However, we show that ourcluster shapes can be identified distinctively by their burstiness and periodicitythrough postprocessing. Our work also differentiates itself from others [28, 27, 43]by analyzing lifetime or a longer timeline of online entities rather than focusingon a window where activity peaks occur and omitting the rest.

7 Conclusion

In the present work, we set to study the temporal patterns that lead a user’s con-tent to gain popularity in online platforms. In particular, our study tackled thefollowing problems: (i) uncovering the different temporal patterns of popularityin online platforms; (ii) studying how different dimensions’ interplay; and (iii)proposing dipm-SC, a novel algorithm that extends the state-of-the-art modelsallowing to cluster multidimensional time series of events at a time.

First, we compared our method with other extensions of univariate time seriesclustering algorithms on synthetic data, where we successfully demonstrated itshigher accuracy. We then applied our framework on the multivariate time seriesdata deriving from two major online platforms: GitHub and Twitter. Throughthese applications, we showcased the efficacy of our method in uncovering fun-damental pattern of popularity in online contexts. Our method can indeed beused not only to find the overall difference between time series shapes, such asbursty versus steady behaviors, but it can also uncover differences in these twomain trends depending on the time in which popularity is acquired. Moreover,the results provided by our approach are easily interpretable and sheds lighton analyzing the interplay of multiple dimensions. In the Twitter scenario, wefound that the uncovered clusters are temporally coherent in the hashtags usedand related to certain types of events, such as the Parkland school shooting andboycott campaigns.

In conclusion, we devised a methodology that extend the state-of-the-artliterature in the area of multivariate time series clustering and uses a similaritymetric based on the shape of the time series. Moreover, our method can be usedboth to uncover fundamental patterns based solely on the shapes of the timeseries and on the temporal occurrence of the events. Our approach also providea way of studying popularity of online content and understanding their dynamicsover time.

Acknowledgements The authors are grateful to the Defense Advanced Re-search Projects Agency (DARPA), contract W911NF-17-C-0094, for their sup-port.

Page 14: Mert Ozer arXiv:1904.04994v1 [cs.LG] 10 Apr 2019arXiv:1904.04994v1 [cs.LG] 10 Apr 2019 Discovering patterns of onlinepopularity from time series Mert Ozer1[1234−5678−9012], Anna

Bibliography

[1] Adar, E., Zhang, L., Adamic, L.A., Lukose, R.M.: Implicit structure and thedynamics of blogspace. In: Workshop on the weblogging ecosystem. vol. 13,pp. 16989–16995 (2004)

[2] Aghabozorgi, S.R., Shirkhorshidi, A.S., Teh, Y.W.: Time-series clustering -a decade review. Inf. Syst. 53, 16–38 (2015)

[3] Alon, J., Sclaroff, S., Kollios, G., Pavlovic, V.: Discovering clusters in motiontime-series data. In: 2003 IEEE Computer Society Conference on ComputerVision and Pattern Recognition, 2003. Proceedings. vol. 1 (June 2003)

[4] Bakshy, E., Hofman, J.M., Mason, W.A., Watts, D.J.: Everyone’s an in-fluencer: quantifying influence on twitter. In: Fourth ACM internationalconference on Web search and data mining. pp. 65–74. ACM (2011)

[5] Barabasi, A.L.: The origin of bursts and heavy tails in human dynamics.Nature 435, 207–211 (2005)

[6] Bergroth, L., Hakonen, H., Raita, T.: A survey of longest common subse-quence algorithms. In: Proceedings of the Seventh International Symposiumon String Processing Information Retrieval (SPIRE’00). pp. 39– (2000)

[7] Chakrabarti, K., Keogh, E., Mehrotra, S., Pazzani, M.: Locally adaptivedimensionality reduction for indexing large time series databases. ACMTrans. Database Syst. 27(2), 188–228 (Jun 2002)

[8] Chu, K.K.W., Wong, M.H.: Fast time-series searching with scaling and shift-ing. In: Proc. of the 18th ACM SIGMOD-SIGACT-SIGART Symposiumon Principles of Database Systems. pp. 237–248. PODS ’99, ACM (1999)

[9] Crane, R., Sornette, D.: Robust dynamic classes revealed by measuring theresponse function of a social system. Proceedings of the National Academyof Sciences 105(41), 15649–15653 (2008)

[10] Davis, C.A., Varol, O., Ferrara, E., Flammini, A., Menczer, F.: Botornot:A system to evaluate social bots. In: Proceedings of the 25th InternationalConference Companion on World Wide Web. pp. 273–274 (2016)

[11] Ding, H., Trajcevski, G., Scheuermann, P., Wang, X., Keogh, E.: Queryingand mining of time series data: Experimental comparison of representationsand distance measures. Proc. VLDB Endow. 1(2), 1542–1552 (Aug 2008)

[12] Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast subsequencematching in time-series databases. In: Proc. of the 1994 ACM SIGMODInternational Conference on Management of Data. pp. 419–429 (1994)

[13] Fontes, C.H., Pereira, O.: Pattern recognition in multivariate time series acase study applied to fault detection in a gas turbine. Engineering Appli-cations of Artificial Intelligence 49, 10 – 18 (2016)

[14] Ghassempour, S., Girosi, F., Maeder, A.: Clustering multivariate time se-ries using hidden markov models. International Journal of EnvironmentalResearch and Public Health 11(3), 2741–2763 (2014)

[15] Goh, K.I., Barabasi, A.L.: Burstiness and memory in complex systems. EPL(Europhysics Letters) 81(4), 48002 (2008)

Page 15: Mert Ozer arXiv:1904.04994v1 [cs.LG] 10 Apr 2019arXiv:1904.04994v1 [cs.LG] 10 Apr 2019 Discovering patterns of onlinepopularity from time series Mert Ozer1[1234−5678−9012], Anna

Discovering patterns of online popularity from time series 15

[16] Golub, G.H., van Loan, C.F.: Matrix Computations. JHU Press, fourthedn. (2013)

[17] Hallac, D., Vare, S., Boyd, S., Leskovec, J.: Toeplitz inverse covariance-based clustering of multivariate time series data. In: Proceedings of the23rd ACM SIGKDD International Conference on Knowledge Discovery andData Mining. pp. 215–223 (2017)

[18] Hamerly, G., Elkan, C.: Learning the k in k-means. In: Proceedings of the16th International Conference on Neural Information Processing Systems.pp. 281–288. NIPS’03, MIT Press, Cambridge, MA, USA (2003)

[19] Hartigan, J.A., Hartigan, P.M.: The dip test of unimodality. The Annals ofStatistics 13(1), 70–84 (1985)

[20] Kalogeratos, A., Likas, A.: Dip-means: An incremental clustering methodfor estimating the number of clusters. In: Proceedings of the 25th Inter-national Conference on Neural Information Processing Systems. pp. 2393–2401. NIPS’12, USA (2012)

[21] Kaufman, L., Rousseeuw, P.: Finding Groups in Data: an Introduction toCluster Analysis. Wiley (2008)

[22] Keogh, E., Lin, J.: Clustering of time-series subsequences is meaningless:Implications for previous and future research. Knowl. Inf. Syst. 8(2), 154–177 (Aug 2005)

[23] Lehmann, J., Goncalves, B., Ramasco, J.J., Cattuto, C.: Dynamical classesof collective attention in twitter. In: Proceedings of the 21st internationalconference on World Wide Web. pp. 251–260. ACM (2012)

[24] Lerman, K., Ghosh, R.: Information contagion: An empirical study of thespread of news on digg and twitter social networks. Icwsm 10, 90–97 (2010)

[25] Leskovec, J., McGlohon, M., Faloutsos, C., Glance, N., Hurst, M.: Patternsof cascading behavior in large blog graphs. In: Proceedings of the 2007SIAM international conference on data mining. pp. 551–556. SIAM (2007)

[26] Lngkvist, M., Karlsson, L., Loutfi, A.: A review of unsupervised featurelearning and deep learning for time-series modeling. Pattern RecognitionLetters pp. 11–24 (2014)

[27] Matsubara, Y., Sakurai, Y., Prakash, B.A., Li, L., Faloutsos, C.: Rise andfall patterns of information diffusion: Model and implications. In: Proceed-ings of the 18th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining. pp. 6–14. KDD ’12, ACM (2012)

[28] McGlohon, M., Leskovec, J., Faloutsos, C., Hurst, M., Glance, N.: Findingpatterns in blog shapes and blog evolution (2007)

[29] Morstatter, F., Pfeffer, J., Liu, H., Carley, K.M.: Is the sample good enough?comparing data from Twitter’s streaming API with Twitter’s Firehose. Pro-ceedings of ICWSM (2013)

[30] Paparrizos, J., Gravano, L.: k-shape: Efficient and accurate clustering oftime series. SIGMOD Rec. 45(1), 69–76 (Jun 2016)

[31] Paparrizos, J., Gravano, L.: Fast and accurate time-series clustering. ACMTrans. Database Syst. 42(2), 8:1–8:49 (Jun 2017)

[32] Pelleg, D., Moore, A.: X-means: Extending k-means with efficient estimationof the number of clusters. In: In Proceedings of the 17th International Conf.on Machine Learning. pp. 727–734. Morgan Kaufmann (2000)

Page 16: Mert Ozer arXiv:1904.04994v1 [cs.LG] 10 Apr 2019arXiv:1904.04994v1 [cs.LG] 10 Apr 2019 Discovering patterns of onlinepopularity from time series Mert Ozer1[1234−5678−9012], Anna

16 M. Ozer et al.

[33] Petitjean, F., Ketterlin, A., Gancarski, P.: A global averaging method fordynamic time warping, with applications to clustering. Pattern Recogn.44(3), 678–693 (Mar 2011)

[34] Ratkiewicz, J., Fortunato, S., Flammini, A., Menczer, F., Vespignani, A.:Characterizing and modeling the dynamics of online popularity. Physicalreview letters 105(15), 158701 (2010)

[35] Rizoiu, M.A., Xie, L., Sanner, S., Cebrian, M., Yu, H., Van Hentenryck, P.:Expecting to be hip: Hawkes intensity processes for social media popularity.In: Proceedings of the 26th International Conference on World Wide Web.pp. 735–744 (2017)

[36] Shokoohi-Yekta, M., Hu, B., Jin, H., Wang, J., Keogh, E.: Generalizing dtwto the multi-dimensional case requires an adaptive approach. Data Miningand Knowledge Discovery 31(1), 1–31 (Jan 2017)

[37] Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., Kappas, A.: Sentimentstrength detection in short informal text. J. Am. Soc. Inf. Sci. Technol.61(12), 2544–2558 (Dec 2010)

[38] Ulanova, L., Begum, N., Keogh, E.: Scalable clustering of time series withu-shapelets. In: Proceedings of the 2015 SIAM International Conference onData Mining. pp. 900–908 (????)

[39] Varol, O., Ferrara, E., Davis, C., Menczer, F., Flammini, A.: Online human-bot interactions: Detection, estimation, and characterization (2017)

[40] Wang, X., Mueen, A., Ding, H., Trajcevski, G., Scheuermann, P., Keogh, E.:Experimental comparison of representation methods and distance measuresfor time series data. Data Mining and Knowledge Discovery 26(2), 275–309(Mar 2013)

[41] Wu, F., Huberman, B.A.: Novelty and collective attention. Proceedings ofthe National Academy of Sciences 104(45), 17599–17601 (2007)

[42] Xiong, Y., Yeung, D.Y.: Mixtures of arma models for model-based timeseries clustering. In: Proceedings of the 2002 IEEE International Conferenceon Data Mining. pp. 717–. ICDM ’02, IEEE Computer Society, Washington,DC, USA (2002)

[43] Yang, J., Leskovec, J.: Patterns of temporal variation in online media. In:Proceedings of the Fourth ACM International Conference on Web Searchand Data Mining. pp. 177–186. WSDM ’11, ACM (2011)

[44] Yang, J., Leskovec, J.: Patterns of temporal variation in online media. In:Proceedings of the fourth ACM international conference on Web search anddata mining. pp. 177–186. ACM (2011)

[45] Yang, K., Shahabi, C.: A pca-based similarity measure for multivariate timeseries. In: Proceedings of the 2Nd ACM International Workshop on Multi-media Databases. pp. 65–74. MMDB ’04, ACM, New York, NY, USA (2004)

[46] Ye, L., Keogh, E.: Time series shapelets: A new primitive for data mining.In: Proceedings of the 15th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining. pp. 947–956. KDD ’09, ACM (2009)

[47] Zhang, Q., Wu, J., Yang, H., Tian, Y., Zhang, C.: Unsupervised featurelearning from time series. In: Proceedings of the Twenty-Fifth InternationalJoint Conference on Artificial Intelligence. pp. 2322–2328. IJCAI’16, AAAIPress (2016)