Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5...

69
Mining temporal networks Aristides Gionis Department of Computer Science, Aalto University users.ics.aalto.fi/gionis Workshop on online social networks and media Network properties and dynamics (ONSED) April 24, 2018

Transcript of Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5...

Page 1: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

Mining temporal networks

Aristides Gionis

Department of Computer Science, Aalto Universityusers.ics.aalto.fi/gionis

Workshop on online social networks and mediaNetwork properties and dynamics (ONSED)

April 24, 2018

Page 2: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

interconnected world

• networks model objects and their relations

• many different network types

– social

– informational

– technological

– biological

– . . .

PY

JKPD

VK

HPK

PY

JKPD

VK

HPK

PY

JKPD

VK

HPK

PY

JKPD

VK

HPK

PY

JKPD

VK

HPK

Figure 2: Discovered strong edges of 5 ego-networks of KDD innovation award winners. The �rst 5 �gures containonly strong edges: the colored edges and vertices show 5 topics that were used as input: cluster, classif, pattern,network, distribut. The last topic consisted of 2 connected components which we used as two separated communities.The last �gure shows strong and weak edges. Some of the vertices do no belong to any of the communities. Someedges are strong despite not belonging to any of the communities because we keep edges that do not induce violations.

[17] Gueorgi Kossinets and Duncan Watts. 2006. Empirical analysis of anevolving social network. Science 311, 5757 (2006), 88–90.

[18] Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg. 2010. Predictingpositive and negative links in online social networks. In Proceedings of the19th international conference on World Wide Web. ACM, 641–650.

[19] Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford LargeNetwork Dataset Collection. http://snap.stanford.edu/data. (June 2014).

[20] David Liben-Nowell and Jon Kleinberg. 2007. The link-prediction problemfor social networks. journal of the Association for Information Science andTechnology 58, 7 (2007), 1019–1031.

[21] Linyuan Lü and Tao Zhou. 2010. Link prediction in weighted networks:The role of weak ties. EPL (Europhysics Letters) 89, 1 (2010), 18001.

[22] Haris Memic. 2009. Testing the strength of weak ties theory in smalleducational social networking websites. In International Conference onInformation Technology Interfaces. IEEE, 273–278.

[23] James D Montgomery. 1992. Job search and network composition: Im-plications of the strength-of-weak-ties hypothesis. American SociologicalReview (1992), 586–596.

[24] T. M. Newcomb. 1961. The acquaintance process. Holt, Rinehart & Winston.

[25] J-P Onnela, Jari Saramäki, Jorkki Hyvönen, György Szabó, David Lazer,Kimmo Kaski, János Kertész, and A-L Barabási. 2007. Structure and tiestrengths in mobile communication networks. Proceedings of the NationalAcademy of Sciences 104, 18 (2007), 7332–7336.

[26] James Oxley. 2003. What is a matroid? Cubo Matemática Educacional 5, 3(2003), 179–218.

[27] Stavros Sintos and Panayiotis Tsaparas. 2014. Using strong triadic closureto characterize ties in social networks. In Proceedings of the 20th ACMSIGKDD international conference on Knowledge Discovery and Data Mining.ACM, 1466–1475.

[28] Jie Tang, Tiancheng Lou, and Jon Kleinberg. 2012. Inferring social tiesacross heterogenous networks. In Proceedings of the �fth ACM internationalconference on Web Search and Data Mining. ACM, 743–752.

[29] Rongjing Xiang, Jennifer Neville, and Monica Rogati. 2010. Modelingrelationship strength in online social networks. In Proceedings of the 19thinternational conference on World Wide Web. ACM, 981–990.

[30] Wayne W Zachary. 1977. An information �ow model for con�ict and�ssion in small groups. Journal of anthropological research 33, 4 (1977),452–473.

9

Page 3: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

impact of network science

• online communication networks andsocial media

• implications in

– knowledge creation

– information sharing

– education

– democracy

– society as a whole

O. Kostakis et al.

fifa

germany

brazil2014

url

url

fifaworldcup

userid

iphone

url

worldcup2014

5sosrt

in

messi

fifa2014

bacary sagna

cnblue

cup j

athletics

miami

url

url

world cup brazil

mtvhottest

goals

worldcup

beer

sports

everton

userid

daniellemcxs

world

championsleague

suarez

musicbankbrazil

userid

mbf

userid

croatia

userid

football

usmnt

url

url

fifa world cup app brazil

userid

url

userid

neymar

goalsend

brazil

userid

italy

fifa world cup brazil

cup

fifa14

userid

userid

userid

useridurl

panini wor· · · sticker

tbut

goal

country br

5sos

userid

userid

ola

mbf

arsenal

usa

url

us-sports · · · american

brazil

england

fifa

country gb

israel

futbol

realmadrid

1d

nigeria

liverpool

url

fifateam

userid

userid

argentina

spain

love

utd

bravscro

ronaldo

userid

chelsea

fifa world

portugal

worldcup

fifa brazil

5sos trtsend

fifa world cup

1d 5sos

userid

userid

italy

manchester united

userid

to

url

userid

world cup

userid

argentina

madrid

soccer

fb

germany

favre

userid

userid

userid

athletes

brasil

country us

barcelona

userid

juventus

userid

england

Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, for k = 38 and h = 2.The plotted graphs contain only those vertices with degree greater or equal to the 50th largest degree valueof each graph. Blue edges are unique to a summary, red edges occur in each summary, and the remainingedges are colored green (Color figure online)

notice that the hashtags in the sumamries form communities. For example, those forfootball (soccer) are separated from those of American football. Similarly, there is acluster of hashtags relating to music bands, that also happens to re-appear.

123

Page 4: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

research questions

• structure discovery

– finding communities, events, roles of individuals

• study complex dynamic phenomena

– evolution, information diffusion, opinion formation

• develop novel applications

• design efficient algorithms

Page 5: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

traditional view

• networks represented as pure graph-theory objects

– no additional vertex / edge information

• emphasis on static networks

• dynamic settings model structural changes

– vertex / edge additions / deletions

Page 6: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

temporal networks

• ability to collect and store large volumes of network data

• available data have fine granularity

• lots of additional information associated to vertices/edges

• network topology is relatively stable, whilelots of activity and interaction is taking place

• giving rise to new concepts, new problems, andnew computational challenges

Page 7: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

modeling activity in networks

1. network nodes perform actions (e.g., posting messages)

time

xy

zw

ua c

b c a

c e bd a b

c d a

2. network nodes interact with each other(e.g., a “like”, a repost, or sending a message to each other)

time

xy

zw

u

Page 8: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

many novel and interesting concepts

xy

zw

ua

a

ab

b

b

new pattern types

xy

zw

u

temporal information paths

xy

zw

ua

aaa

new types of events

xy

zw

u

network evolution

Page 9: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

temporal networks — objectives

• identify new concepts and new problems

• develop algorithmic solutions

• demonstrate revelance to real-world applications

Page 10: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

agenda

tracking important nodes

• temporal PageRank

• maintaining neighborhood profiles

reconstruction problems

• reconstructing an epidemic over time

• reconstruction of activity timelines

Page 11: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

tracking important nodes

temporal PageRank

P. Rozenshtein and A. Gionis, ECML PKDD 2016

Page 12: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

PageRank

• classic approach for measuring node importance

• listed in the top-10 most important data-mining algorithms[Wu et al., 2008]

• numerous applications

– ranking web pages– trust and distrust computation– finding experts in social networks– . . .

Page 13: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

PageRank

• PageRank defined as the stationary distribution ofa random walk in the graph

• inherently a static process

• however, many modern networks can be viewed asa sequence (stream) of edges

– temporal network : G = (V ,E), with E = {(u, v , t)}– examples : twitter, instagram, IMs, email, . . .

• what is an appropriate PageRank definition fortemporal networks?

Page 14: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

temporal networks

network nodes interact with each other(e.g., a “like”, a repost, or sending a message to each other)

time

xy

zw

u

Page 15: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

motivating example

a

b

c

g

e

f

hd

a

b

c

g

e

f

hd

1

23

4

5

6

7

8

9

10

11

12

a

b

c

g

e

f

hd

1

2

34

56

7

89

1011

12

(a) (b) (c)

Fig. 1: (a) A static graph, in which hubs a and e have the highest static PageRankscore; (b) and (c) represent two di↵erent temporal networks: in (b) the temporalPageRank score of nodes a and e are expected to be stable over time; in (c)node e becomes more important than a as the time goes by, and the temporalPageRank scores of a and e are expected to change accordingly.

and it has inspired a family of fixed-point computation algorithms, such as,TopicRank [6], TrustRank [8], SimRank [11], and more.

PageRank is defined to be the steady-state distribution of a random walk.As such, it is implied that the underlying network structure is fixed and doesnot change over time. Even though numerous works have studied the problem ofcomputing PageRank on dynamic graphs, the emphasis has been given on main-taining PageRank e�ciently under network updates [12, 19], or on computingPageRank e�ciently in streaming settings [22]. Instead there has not been muchwork on how to incorporate temporal information and network dynamicity inthe PageRank definition.

To make the previous claim more clear imagine that starting from an initialnetwork G we observe k elementary updates in the network structure e

1

, . . . , ek

(such as edge additions or deletions), resulting on a modified network G

0. Atypical question is how to compute the PageRank of G0 e�ciently, possibly bytaking into consideration the PageRank of G, and the incremental updates. Nev-ertheless, the PageRank of G0 is defined as a steady-state distribution and asthe network G

0 would “freeze” at that time instance.Our goal in this paper is to extend PageRank so as to incorporate temporal

information and network dynamics in the definition of node importance. Theproposed measure, called temporal PageRank, is designed to provide estimatesof the importance of a node u at any given time t. If the network dynamics andthe importance of nodes change over time, so does temporal PageRank, and itduly adapts to reflect these changes.

An example illustrating the concept of temporal PageRank, and presentingthe main di↵erence with classic PageRank, is shown in Figure 1. First, a static(directed) graph is shown in Figure 1(a). Vertices a and e are the hubs of thegraph, and thus, the nodes with the highest static PageRank score. Figures 1(b)

static network temporal network temporal network

Page 16: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

research questions and objectives

• extend PageRank to incorporate temporal informationand network dynamics

• adapt PageRank to reflect changes in network dynamicsand node importance

• estimate importance of a node u at any given time t

Page 17: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

dynamic PageRank vs. temporal PageRank

• extensive work on dynamic PageRank

• dynamic PageRank computation :– maintain correct PageRank during network updates– e.g., edge additions / deletions

• computation should return the static PageRank at agiven network snapshot

• for edges present in a snapshot, order does not matter

Page 18: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

static PageRank

• graph G = (V ,E)

• corresponding row-stochastic matrix P ∈ Rn×n

• personalization vector h ∈ Rn

• PageRank is the stationary distribution of a random walk,with restart probability (1− α)

π(u) =∑

v∈V

∞∑

k=0

(1− α)αk∑

z∈Z(v ,u)|z|=k

h(v)Pr[z | v ]

where, Z(v ,u) is the set of all paths from v to u

and Pr[z | v ] = ∏(i,j)∈z P(i , j)

Page 19: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

temporal PageRank

• make a random walk only on temporal paths– e.g., time-respecting paths– time-stamps increase along the path

a

b

c

g

e

f

hd

a

b

c

g

e

f

hd

1

23

4

5

6

7

8

9

10

11

12

a

b

c

g

e

f

hd

1

2

34

56

7

89

1011

12

(a) (b) (c)

Fig. 1: (a) A static graph, in which hubs a and e have the highest static PageRankscore; (b) and (c) represent two di↵erent temporal networks: in (b) the temporalPageRank score of nodes a and e are expected to be stable over time; in (c)node e becomes more important than a as the time goes by, and the temporalPageRank scores of a and e are expected to change accordingly.

and it has inspired a family of fixed-point computation algorithms, such as,TopicRank [6], TrustRank [8], SimRank [11], and more.

PageRank is defined to be the steady-state distribution of a random walk.As such, it is implied that the underlying network structure is fixed and doesnot change over time. Even though numerous works have studied the problem ofcomputing PageRank on dynamic graphs, the emphasis has been given on main-taining PageRank e�ciently under network updates [12, 19], or on computingPageRank e�ciently in streaming settings [22]. Instead there has not been muchwork on how to incorporate temporal information and network dynamicity inthe PageRank definition.

To make the previous claim more clear imagine that starting from an initialnetwork G we observe k elementary updates in the network structure e

1

, . . . , ek

(such as edge additions or deletions), resulting on a modified network G

0. Atypical question is how to compute the PageRank of G0 e�ciently, possibly bytaking into consideration the PageRank of G, and the incremental updates. Nev-ertheless, the PageRank of G0 is defined as a steady-state distribution and asthe network G

0 would “freeze” at that time instance.Our goal in this paper is to extend PageRank so as to incorporate temporal

information and network dynamics in the definition of node importance. Theproposed measure, called temporal PageRank, is designed to provide estimatesof the importance of a node u at any given time t. If the network dynamics andthe importance of nodes change over time, so does temporal PageRank, and itduly adapts to reflect these changes.

An example illustrating the concept of temporal PageRank, and presentingthe main di↵erence with classic PageRank, is shown in Figure 1. First, a static(directed) graph is shown in Figure 1(a). Vertices a and e are the hubs of thegraph, and thus, the nodes with the highest static PageRank score. Figures 1(b)

c → b → a→ c : time respecting

a→ c → b → a : not time respecting

Page 20: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

temporal PageRank

• intuition : probability of visiting node u at time tgiven a random walk on temporal paths

• need to model probability of following next temporal edge– we use an exponential distribution

• temporal PageRank definition

r(u, t) =∑

v∈V

t∑

k=0

(1− α)αk∑

z∈ZT (v ,u|t)|z|=k

Pr′[z| t ]

ZT (v ,u | t) set of temporal paths from v to u until time t

Page 21: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

computation

• simple online algorithm• r(u, t) : temporal PageRank estimate of u at time t• s(u, t) : count of active walks visiting u at time t

Algorithm 2: stream processing

input : E, transition probability �, jumping probability ↵1 r = 0, s = 0;2 foreach (u, v, t) 2 E do3 r(u) = r(u) + (1 � ↵);4 r(v) = r(v) + (s(u) + (1 � ↵))↵;5 s(v) = s(v) + (s(u) + (1 � ↵))(1 � �)↵;6 s(u) = (s(u) + (1 � ↵))�;

7 normalize r;8 return r;

3.2 Temporal vs. static PageRank

Temporal PageRank is defined to handle network dynamics and concept drifts.An intuitive property that one may expect is that if the edge distribution of thetemporal edges remains constant, then temporal PageRank approximates staticPageRank. In this section we show that indeed this is the case.

Consider a weighted directed graph Gs = (V, Es, w) and a time periodT = [1, .., T ]. Without loss of generality assume

Pe2Es

w(e) = 1 and let Nout(u)be the out-link neighbors of u. Let edges e 2 Es be associated with a samplingdistribution SE : p[e = (u, v)] = w(e). A temporal graph G = (V, E) is con-structed by sampling T edges from Gs using SE (probability to pick an edgeinto E is proportional to the weight of this edge in the static graph). We willconsider a simple case of transition probability � = 1: a random walk takes thefirst suitable interaction to continue.

In the setting described above we can prove the following statement.

Proposition 2. The expected values of temporal PageRank on graph G = (V, E)converge to the values of static PageRank on graph Gs = (V, Es, w), with per-sonalization vector h(u) =

Pv2Nout(u) w(e = (u, v)) (weighted out-degree).

Proof. At any time moment t every vertex u 2 V has PageRank score r(u, t)and active mass (number of walkers that wait to continue) equal to s(u, t).

The expected value E(r(v, T )) of the PageRank count of node v at time T isa sum over expected increments of r(v) over time:

E(r(v, T )) =TX

t=1

E(�r(v, t)).

At time t the increment of r(v) can be caused by selecting an edge e(t) =(v, q) with starting point in v and q 2 V . In this case r(v) is incremented by(1 � ↵). Another possibility to increment r(v) is to select an edge e(t) = (q, v)with u as an end point and q 2 V . In this case r(v) is incremented by ↵s(q, t),where s(q, t) is a value of active mass in node q at time t. Let p(u) be a probability

Page 22: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

static vs. temporal PageRank

• temporal PageRank is designed to capture changesin network dynamics and concept drifts

• what if the edge distribution is stable?

Page 23: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

static vs. temporal PageRank

• consider static network GS = (V ,ES,w)

• time period [1, . . . ,T ]

• construct temporal network G = (V ,E) by sampling edgesproportionally to their weight

proposition :

as T →∞, the temporal PageRank on Gconverges to the static PageRank on GS,with personalization vector equal to weighted out-degree

Page 24: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

experiment — adaptation to concept drift

(a) Facebook (b) Twitter (c) Students

Fig. 5: Adaptation for the change of sampling distribution.

(a) Facebook (b) Twitter (c) Students

Fig. 6: Convergence to static PageRank with increasing number of random scansof edges.

Measures. To evaluate the settings in which temporal PageRank is expected toconverge to the static PageRank of a corresponding graph, we compare temporaland static PageRank using three di↵erent measures: we use (i) Spearman’s ⇢ tocompare the induced rankings, we also use (ii) Pearson’s correlation coe�cient r,and (iii) Euclidean distance ✏ on the PageRank vectors.

All the reported experimental results are averaged over 100 runs. Dampingparameter is set of ↵ = 0.85. Waiting probability � for temporal PageRank isset to 0 unless specified otherwise.

4.1 Results

Convergence. In the first set of experiments we test how fast the tempo-ral PageRank algorithm converges to corresponding static PageRank. In thissetting we process datasets with m temporal edges and compare the tempo-ral PageRank ranking with the corresponding static PageRank ranking. In theplots of Figure 2 we report Pearson’s r, Spearman’s ⇢ and Euclidean error ✏.The first column corresponds to the calculation of temporal PageRank withoutany a priori knowledge of personalization vector. Thus, the resulting temporalPageRank corresponds to the static PageRank with out-degree personalization:

Page 25: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

tracking important nodes

maintaining sliding-window neighborhood profiles

R. Kumar, T. Calders, A. Gionis, and N. Tatti, ECML PKDD 2015

Page 26: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

distance distributions in graphs

• given graph G, a node u, and distance r :

how many nodes of G are in distance r from u?

• fundamental graph-mining primitive

– median distance, diameter, effective diameter

• related to small-world phenomena

• a measure of centrality for nodes of G

Page 27: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

distance distributions in graphs

• exact solution requires all-pairs shortest path computation

– Floyd-Warshall algorithm: O(n3)

– or, BFS for unweighted graphs: O(nm)

• clearly non scalable

• resort to approximations based on diffusion methods

Page 28: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

diffusion-based computation

[Palmer et al., 2002]

• let Bt(x) be the ball of radius t around x(the set of nodes at distance ≤ t from x)

• clearly B0(x) = {x}

• moreover Bt+1(x) =⋃

(x ,y) Bt(y)⋃{x}

• so computing Bt+1 from Bt just takes a single (sequential)scan of the graph

Page 29: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

diffusion-based computation

• every set requires O(n) bits, hence O(n2) bits overall

• amount of space is prohibitively large

• instead use sketching for counting distinct elements

• probabilistic counters require very small space (log log)

• HyperANF algorithm [Boldi et al., 2011]

– uses HyperLogLog counters [Flajolet et al., 2007]

– with 40 bits you can count up to 4 billion with– standard deviation 6%

Page 30: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs
Page 31: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

extension to temporal networks

• limitations of existing solutions

– consider static network

– multi-pass algorithm

• in this work

– extension to temporal networks

– streaming algorithm for sliding-window model :

– consider only the most recent interactions (edges)

Page 32: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

setting

• temporal network G = (V ,E)

• stream of edges E = 〈(u1, v1, t1), (u2, v2, t2), . . .〉with t1 ≤ t2 ≤ . . .

• sliding window length w

• snapshot network G(t ,w) at time t contains all edgeswith time-stamps in (t − w , t ]

problem :given node u, window length w , and distance r , how manynodes in G(t ,w) are within distance r from u at time t?

Page 33: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

example

a b

c d

e

1,8

2

3 4,9

5,106

7

a b

c d

e

12

3

G3

a b

c d

e

2

3 4

G4

a b

c d

e3 4

5G5

a toy example, 3 snapshot graphs with a window size of 3

Page 34: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

proposed online algorithms

1. an exact but memory-inefficient streaming algorithm

2. an approximate memory-efficient streaming algorithm

– approximate algorithm uses logic of exact algorithm,combined with hyperloglog sketches

Page 35: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

horizons

• path horizon : time-stamp of the oldest edge on the path

• h(u, v , i) : the horizon for length i between nodes u and v :the maximum horizon of any path of length at most i

Page 36: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

example

a b

c d

e

2

65

43

1

−∞,−∞, 3, 3, 3 ∞,∞,∞,∞,∞

−∞,3, 3, 3, 3 −∞,2, 2, 3, 3

−∞,−∞, 3, 3, 3

a b

c d

e

72

65

43

1

−∞,7, 7, 7, 7 ∞,∞,∞,∞,∞

−∞,3, 4, 4, 4 −∞,2, 2, 3, 4

−∞, −∞, 3, 4, 4

two snapshot graphs along with h(u,b, i) for i = 0, . . . ,4

Page 37: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

neighborhood summaries

• observation : if for a node u we know all horizons h(u, v , i),for all distances i and all nodes v , we can give completeneighborhood profile for u for any window length

• neighborhood summary : Sut = (Su

t [0], . . . ,Sut [r ])

where Sut [i] = {(v ,ht(u, v , i)) | ht(u, v , i) > −∞}

Page 38: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

updating neighborhood summaries

• edge deletion : simply delete entries from summaries

• edge addition : a change in summary at distance i fora node u will introduce a change in the summary of itsneighbors at distance i + 1

– updates propagate in a BFS fashion

Page 39: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

exact algorithm

• update time : O(rmn log n)

• space complexity : O(rn2)

– where r an upper bound on max distance

• quadratic dependence not acceptable for large graphs

– hence approximation algorithm

Page 40: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

approximate algorithm

• sliding HyperLogLog sketch : extension of HyperLogLog tomaintain a distinct set counter over sliding window

• if number of buckets in the HLL counter is k then theworst case complexity changes to

– update time :

– O(rm2k log2 n) from O(rmn log n)

– space complexity :

– O(rn2k log n) from O(rn2)

Page 41: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

empirical evaluation — quality

nodes dist total clus diam eff avg reldataset edges edges coef diam error

(k=7)

Facebook 4 039 88 234 88 234 0.60 8 4.7 0.08Cit-HepTh 27 771 352 801 352 801 0.31 13 5.3 0.10Higgs 166 840 249 030 500 000 0.19 10 4.7 0.14DBLP 192 357 400 000 800 000 0.63 21 8.0 0.09

Page 42: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

empirical evaluation — running time 0 2 4 6 8

10 12 14 16 18 20

0 10 20 30 40 50 60 70 80 90

time

(sec

)edges (in thousands)

k = 4k = 5k = 6k = 7

(a) Facebook

0

50

100

150

200

250

50 100 150 200 250 300 350

time

(sec

)

edges (in thousands)

k = 4k = 5k = 6k = 7

(b) Cit-HepTh

0

10

20

30

40

50

60

100 200 300 400 500

time

(sec

)

edges (in thousands)

k = 4k = 5k = 6k = 7

(c) Higgs

0

1

2

3

4

5

6

7

100 200 300 400 500 600 700 800

time

(sec

)

edges (in thousands)

k = 4k = 5k = 6k = 7

(d) DBLP

Fig. 4. Time needed to process 1 000 edges for di↵erent `

0

50

100

150

200

250

300

350

400

0 10 20 30 40 50 60 70 80 90

time

(sec

)

edges (in thousands)

SerialParallel

Fig. 5. Running times for DBLP with parallelized version of the algorithm.

8 Concluding remarks

We studied the problem of maintaining the neighborhood profile of the nodesof an interaction network—a graph with a sequence of interactions, in the formof a stream of time-stamped edges. The model is appropriate for many moderngraph datasets, like social networks where interaction between users is one of themost important aspects. We focused on the sliding-window data-stream model,which allows to forget past interactions and adapt to new drifts in the data.Thus, the proposed problem and approach can be applied to monitoring largenetworks with fast-evolving interactions, and used to reason how the networkstructure and the centrality of the important nodes change over time.

contrast (DBLP)– offline HyperANF : 3.6 sec / sliding window– proposed approach : 0.003 sec / sliding window

Page 43: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

reconstructing an epidemic over time

P. Rozenshtein, A. Gionis, B.A. Prakash, J. Vreeken, KDD 2016

Page 44: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

motivation

• consider a sequence of timestamped edges– an edge between people represents some interaction– phonecall, email, retweet, . . .

• infection reconstruction :– consider a unknown dynamic propagation process– virus, idea, topic, gossip, . . .– incomplete reported cases of infection

• goal :– reconstruct paths of infection,– which explains cases of reported infection, and– recovers missing infected nodes and interactions

Page 45: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

model

• interaction (temporal) network G = (V ,E)

n nodes V ; m directed interactions E = {(u, v , t)}convenient to consider timestamped nodes V = {(ui , ti)}

Page 46: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

model

• infection (activity)– infection starts externally– it may propagate only via interactions– infected nodes remain infected– no assumption about the model

• reports– reported infections R = {(u, t)}– report can be later than activation– not all infected nodes are reported

Page 47: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

problem definition

EPIDEMICRECOSTRUCTION

• input : given– interactions E = {(u, v , t)}– set of reported infections R = {(u, t)}– set of candidate seeds C ⊆ V– integer k

• find : set of temporal paths P such that– set of paths P spans R– seeds in P are in C– number of seeds in P is at most k– cost(P | R) = ∑

e∈P w(e) minimized

EPIDEMICRECOSTRUCTION is NP-hard

Page 48: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

problem definition

EPIDEMICRECOSTRUCTION

• input : given– interactions E = {(u, v , t)}– set of reported infections R = {(u, t)}– set of candidate seeds C ⊆ V– integer k

• find : set of temporal paths P such that– set of paths P spans R– seeds in P are in C– number of seeds in P is at most k– cost(P | R) = ∑

e∈P w(e) minimized

EPIDEMICRECOSTRUCTION is NP-hard

Page 49: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

related problem

MINDIRSTEINERTREE

• input : given– directed graph H = (U,F ,w) with edge weights w– root node r ∈ U– set of terminal nodes R ⊆ U

• find : directed tree T rooted at r such that– T contais paths from r to all nodes in R–

∑e∈T w(e) is minimized

EPIDEMICRECOSTRUCTION can be mapped toMINDIRSTEINERTREE

Page 50: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

related problem

MINDIRSTEINERTREE

• input : given– directed graph H = (U,F ,w) with edge weights w– root node r ∈ U– set of terminal nodes R ⊆ U

• find : directed tree T rooted at r such that– T contais paths from r to all nodes in R–

∑e∈T w(e) is minimized

EPIDEMICRECOSTRUCTION can be mapped toMINDIRSTEINERTREE

Page 51: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

transformation

add a dummy node, andconnect it with the earliest occurrence of each candidate seed,with zero cost

Page 52: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

solution idea

input

interactions E , reports R, candidates C, integer k

transformation

1. construct a static graph H = (U,F ,w), whereU = V ∪ {d} time-stamped nodes and dummy node d

2. edges from d to earliest occurrence candidate seedsset weight to α

solve MINDIRSTEINERTREE on H

– subtrees of d are temporal paths P

– number of subtrees monotonic on weight α

– binary search on α, until less than k subtrees

Page 53: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

solving MINDIRSTEINERTREE

– MINDIRSTEINERTREE is NP-hard

– recursive algorithm [Charikar et al., 1999]

– defined for recursion depth i > 1

– approximation guarantee i(i − 1)|X | 1i– running time O(|V |i |X |i) [Huang et al., 2015]

we use i = 2

Page 54: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

main result

speedup

• MINDIRSTEINERTREE pre-computes transitive closure of H– running time O(m2)

• need to calculate shortest paths for ‘only’ O(n2) pairs– a scan on E requiring O(nm) time [Huang et al., 2015]

proposition

for the EPIDEMICRECOSTRUCTION problem, we can obtain

approximation 2|n| 12 in time O(mn)

Page 55: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

experimental evaluation

– datasets : synthetic, facebook, tumblr, students, and enron

– weights : w(u, v , t) = 12(|t − tR(u)|+ |t − tR(v)|)

– setting : simulate epidemic cascades with different models– sample infections reports– compare with ground truth

– baseline : one-hop extension

– evaluation metric : Matthews correlation coefficient

MCC =TP · TN− FP · FN√

(TP + FP)(TP + FN)(TN + FP)(TN + FN)

Page 56: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

experimental evaluation — resultsFacebook Students Enron Tumblr

10-3 10-2 10-1 100

fract ion of relevant interact ions

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

CulTreportsbaseline

10-3 10-2 10-1 100

fract ion of relevant interact ions

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

10-3 10-2 10-1 100

fract ion of relevant interact ions

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

10-3 10-2 10-1 100

fract ion of relevant interact ions

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Figure 3: E↵ect of the fraction of interactions in interaction history E that are relevant to the propagation.Quality of reconstruction measured in MCC for four datasets, with the SI model used to simulate propagations.

SI Shortest path FF IC

10-3 10-2 10-1 100

fract ion of relevant interact ions

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

CulTreportsbaseline

10-3 10-2 10-1 100

fract ion of relevant interact ions

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

10-3 10-2 10-1 100

fract ion of relevant interact ions

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

10-3 10-2 10-1 100

fract ion of relevant interact ions

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Figure 4: E↵ect of the fraction of interactions in the interaction history E that are relevant to the propagation.Reconstruction quality measured by MCC on the Facebook dataset, for di↵erent infection models.

Second, we evaluate how well we can reconstruct propaga-tions generated by di↵erent models. We simulate cascadeson the Facebook network using all four models: SI , SP, IC,and FF. We again vary the number of relevant interactions.The results are shown in Figure 4. CulT performs veryconsistently, regardless of the generating model.

Last, we check how the delay between a node activationand its reporting a↵ects performance. We simulate the SImodel on the four real datasets, and vary the delay between 0and 5000 time steps. Results are provided in Figure 5. CulTis almost not a↵ected at all by delays in reports, whereas theperformance of Baseline deteriorates quickly.

5.5 Order of infectionNext, we evaluate how well we can reconstruct the order

of infections in our temporal network. We use the fron-tier (FR) reporting scheme, while varying the fraction ofreported frontier nodes. Figure 6 shows that the precisionfor both temporal order and static order is quite robust. Inboth cases the fraction of reported frontier nodes a↵ects onlythe recall. When the number of the frontier nodes in the re-ports is low, CulT ignores many activated “leaves,” whichare neither in the reports, nor on the path to any other re-ported node. Thus, the number of false negative increases.

Naturally, the accuracy for static order is higher than fortemporal order: an interaction between the same two nodescan occur at several time moments and it is di�cult to selectthe correct one into a reconstructed propagation path.

5.6 Real cascadesNext we present our results on the Flixster dataset. The

sequence of interactions E is divided into epochs (t0, t1, . . . , T ).At the end of each epoch ti we consider a snapshot of thenetwork and compare the set of discovered activated nodes

with the set of ground-truth activated nodes in V (A[t0, ti]).Results shown in Figure 7 indicate that CulT gives highquality solution during the whole time interval, while theperformance of Baseline degrades significantly with time.

5.7 ScalabilityLast, but not least, we evaluate the scalability of CulT.

We conducted these experiments on a 3.30GHz Intel Xeonmachine with 16GB of memory.

First, we consider running time with respect to the num-ber of interactions E in the temporal network. We constructa set E of a required length by increasing � on the Facebookdataset. We show the results in the left-most plot of Fig-ure 8. As the plot shows, CulT scales well with |E|.

In the center plot of Figure 8 we show the running timeon the Facebook dataset, where we vary the number of acti-vated nodes up till all nodes in the network are included. Wesee again a graceful increase in running time, almost linear.

Third, we investigate the applicability of CulT in a stream-ing scenario: we therefore report the running time per up-date, as the number of newly-arrived interactions increases.To this end we use the Facebook data, and vary the � param-eter to create a sequence of 3 000 interactions. We run CulTon the first 1000 interactions, and use the rest to test theupdate times. The total update time consists of two com-ponents, 1) the time needed to update the global shortestpaths, and 2) the time needed for performing binary search.As can be seen in the right-most plot of Figure 8, the timefor binary search remains constant regardless of the size ofthe batch of new interactions. On the other hand, as ex-pected, the time needed for path updates increases with thebatch size. Note, however, that even for larger batch sizesthe total update time is less than a second.

Page 57: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

reconstructing activity timelines in temporal networks

P. Rozenshtein, A. Gionis, N. Tatti, ECML PKDD 2017

Page 58: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

the timeline reconstruction problem

• consider a set of entities

• entities can become active or inactive

• entities interact over time, forming a temporal network

• each interaction is attributed to an active entity

• can we reconstruct the activity timeline that explainsbest the observed temporal network?

• assumption: being active is more costly,thus we want to minimize total activity time

Page 59: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

the timeline reconstruction problem

• consider a set of entities

• entities can become active or inactive

• entities interact over time, forming a temporal network

• each interaction is attributed to an active entity

• can we reconstruct the activity timeline that explainsbest the observed temporal network?

• assumption: being active is more costly,thus we want to minimize total activity time

Page 60: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

the timeline reconstruction problem

• motivating example

• analyze a discussion in twitter about a topic (e.g., brexit)

• entities are hashtags

• two hashtags interact if they appear in the same tweet

• summarize the discussion by reconstructing a timeline

• pick a set of important hashtags and the time intervalsthey are active

Page 61: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

the timeline reconstruction problem

motivating example

time

#economy#brexit

#negotiations#hardbrexit

#tory

Page 62: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

the timeline reconstruction problem

motivating example

time

#economy#brexit

#negotiations#hardbrexit

#tory

Page 63: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

problem formalization

• given a temporal network G = (V ,E) with E = {(u, v , t)}

• find a set of intervals associated with nodes(the intervals that nodes are active)at most k per node

• that cover all edges, and

– k -SUM-SPAN : minimize the sum of interval lengths

– k -MAX-SPAN : minimize the max interval length

Page 64: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

results

1-MAX-SPAN : solvable in linear time (related to 2-SAT)

1-SUM-SPAN : NP-hard

k -MAX-SPAN, k > 1 : inapproximable

k -SUM-SPAN, k > 1 : inapproximable

efficient and practical algorithms for hard problems

Page 65: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

timeline reconstruction — case study

Page 66: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

timeline reconstruction — case study

Page 67: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

summary

• examples of mining temporal networks

– temporal PageRank

– maintaining sliding-window neighborhood profiles

– reconstructing an epidemic over time

– reconstructing activity timelines

• potential for new concepts, new problem definitions,new computational methods, and new applications

Page 68: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

references

Boldi, P., Rosa, M., and Vigna, S. (2011).HyperANF: approximating the neighborhood function of very largegraphs on a budget.In WWW.

Charikar, M., Chekuri, C., Cheung, T.-y., Dai, Z., Goel, A., Guha, S., andLi, M. (1999).Approximation algorithms for directed steiner problems.Journal of Algorithms.

Flajolet, F., Fusy, E., Gandouet, O., and Meunier, F. (2007).Hyperloglog: the analysis of a near-optimal cardinality estimationalgorithm.In Proceedings of the 13th conference on analysis of algorithm (AofA).

Flajolet, P. and Martin, N. G. (1985).Probabilistic counting algorithms for data base applications.Journal of Computer and System Sciences, 31(2):182–209.

Page 69: Mining temporal networksbarcelona userid juventus userid england Fig. 9 Linear layouts of 5 summaries discovered for the Twitter World Cup dataset, fork= 38 andh= 2. The plotted graphs

references (cont.)

Huang, S., Fu, A. W.-C., and Liu, R. (2015).Minimum spanning trees in temporal graphs.In Proceedings of the 2015 ACM SIGMOD International Conference onManagement of Data.

Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002).ANF: a fast and scalable tool for data mining in massive graphs.In Proceedings of the eighth ACM SIGKDD international conference onKnowledge discovery and data mining, pages 81–90, New York, NY,USA. ACM Press.

Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H.,McLachlan, G. J., Ng, A., Liu, B., Philip, S. Y., et al. (2008).Top 10 algorithms in data mining.KAIS.