Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

34
scalable and efficient algorithms for analysis of massive, streaming graphs E. Jason Riedy and David A. Bader MS76 Scalable Network Analysis: Tools, Algorithms, Applications SIAM PP, 15 April 2016 HPC Lab, School of Computational Science and Engineering Georgia Institute of Technology

Transcript of Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

Page 1: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

scalable and efficient algorithms foranalysis of massive, streaming graphs

E. Jason Riedy and David A. BaderMS76 Scalable Network Analysis: Tools, Algorithms, ApplicationsSIAM PP, 15 April 2016

HPC Lab, School of Computational Science and EngineeringGeorgia Institute of Technology

Page 2: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

motivation and applications

Page 3: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

(insert prefix here)-scale data analysis

Cyber-security Identify anomalies, malicious actors

Health care Finding outbreaks, population epidemiology

Social networks Advertising, searching, grouping

Intelligence Decisions at scale, regulating algorithms

Systems biology Understanding interactions, drug design

Power grid Disruptions, conservation

Simulation Discrete events, cracking meshes

• Graphs are a motif / theme in data analysis.• Changing and dynamic graphs are important! 3

Page 4: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

outline

1. Motivation and background2. Incremental PageRank3. Seed set expansion4. Community maintenance5. STINGER: Framework for streaming graph analysis

4

Page 5: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

why graphs?

Another tool, like dense and sparse linear algebra.

• Combine things with pairwiserelationships

• Smaller, more generic than raw data.• Taught (roughly) to all CS students...• Semantic attributions can captureessential relationships.

• Traversals can be faster than filteringDB joins.

• Provide clear phrasing for queriesabout relationships.

5

Page 6: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

potential applications

• Social Networks• Identify communities, influences, bridges, trends,anomalies (trends before they happen)...

• Potential to help social sciences, city planning, andothers with large-scale data.

• Cybersecurity• Determine if new connections can access a device orrepresent new threat in < 5ms...

• Is the transfer by a virus / persistent threat?• Bioinformatics, health

• Construct gene sequences, analyze proteininteractions, map brain interactions

• Credit fraud forensics⇒ detection⇒ monitoring• Integrate all the customer’s data, identify in real-time

6

Page 7: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

streaming graph data

Networks data rates:

• Gigabit ethernet: 81k – 1.5M packets per second• Over 130 000 flows per second on 10 GigE (< 7.7 µs)

Person-level data rates:

• 500M posts per day on Twitter (6k / sec)1• 3M posts per minute on Facebook (50k / sec)2

We need to analyze only changes and not entire graph.

Throughput & latency trade off and expose differentlevels of concurrency.

1www.internetlivestats.com/twitter-statistics/2www.jeffbullas.com/2015/04/17/21-awesome-facebook-facts-and-statistics-you-need-to-check-out/

7

Page 8: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

streaming graph analysis

Terminology:

• Streaming changes into a massive, evolving graph• Not CS streaming algorithm (tiny memory)• Need to handle deletions as well as insertions

Previous throughput results (not comprehensive review):

Data ingest >2M up/sec [Ediger, McColl, Poovey, Campbell, & B2014]

Clustering coefficients >100K up/sec [R, Meyerhenke, Bader,Ediger, & Mattson 2012]

Connected comp. >1M up/sec [McColl, Green, & B 2013]Community clustering >100K up/sec∗ [R & B 2013]

8

Page 9: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

incremental pagerank

Page 10: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

pagerank

Everyone’s “favorite” metric: PageRank.

• Stationary distribution of the random surfer model.• Eigenvalue problem can be re-phrased as a linearsystem (

I− αATD−1) x = kv,with

α teleportation constant, much < 1A adjacency matrixD diagonal matrix of out degrees, withx/0 = x (self-loop)

v personalization vector, here 1/|V|k irrelevant scaling constant

• Amenable to analysis, etc. 10

Page 11: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

incremental pagerank

• Streaming data setting, update PageRank withouttouching the entire graph.

• Existing methods maintain databases of walks, etc.• Let A∆ = A+∆A, D∆ = D+∆D for the new graph,want to solve for x+∆x.

• Simple algebra:(I− αAT∆D−1

)∆x = α

(A∆D−1

∆ − AD−1) x,and the right-hand side is sparse.

• Re-arrange for Jacobi,

∆x(k+1) = αAT∆D−1∆ ∆x(k) + α

(A∆D−1

∆ − AD−1) x,iterate, ...

11

Page 12: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

incremental pagerank: accumulating error

• And fail. The updated solution wanders away fromthe true solution. Top rankings stay the same...

12

Page 13: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

incremental pagerank: think instead

• The old solution x is an ok, not exact, solution to theoriginal problem, now a nearby problem.

• How close? Residual:

r′ = kv− x+ αA∆D−1∆ x

= r+ α(A∆D−1

∆ − AD−1) x.• Solve (I− αA∆D−1

∆ )∆x = r′.• Cheat by not refining all of r′, only region growingaround the changes:

(I− αA∆D−1∆ )∆x = r′|∆

• (Also cheat by updating r rather than recomputing atthe changes.)

13

Page 14: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

incremental pagerank: works

Riedy, GABB at IPDPS 2016, to appear.14

Page 15: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

incremental pagerank: worst latency

●●

● ●

0.001

0.010

0.001

0.010

0.01

0.10

1.00

1

100

1

10

100

power

PG

Pgiantcom

pocaidaR

outerLevelbelgium

.osmcoP

apersCiteseer

10 100 1000Batch size

Upd

ate

time

(s)

Algorithm ● dpr dprheld pr_restart

Riedy, GABB at IPDPS 2016, to appear.15

Page 16: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

seed set expansion

Page 17: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

graphs: big, nasty hairballs

Yifan Hu’s (AT&T) visualization of the in-2004 data sethttp://www2.research.att.com/~yifanhu/gallery.html

17

Page 18: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

but no shortage of structure...

Protein interactions, Giot et al., “A ProteinInteraction Map of Drosophila melanogaster”,

Science 302, 1722-1736, 2003. Jason’s network via LinkedIn Labs

• Locally, there are clusters or communities.• There are methods for global community detection.• Also need local communities around seeds forqueries and targetted analysis.

18

Page 19: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

seed set expansion

• Seed set expansion finds the “best” subgraph orcommunities for a set of vertices of interest• Many quality criteria: Modularity, conductance, etc.

• Can be applied to cryptocurrency to identify andtrack groups of interacting entities

• Dynamic algorithm updates communities faster thanrecomputation, allowing us to keep up with new dataproduced

19

Page 20: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

static seed set expansion

Greedy expansion starting from S = { seed vertices }:

1. Check the fitness of every vertex v neighboring S.• fitness = f(S ∪ {v})− f(S)

2. It any fitness is positive, include most fit v in S.• Currently sequential, could include all sufficientlygood neighbors.

3. Record at which step v is included.• This list is the base for updates.

Now the dynamic version by example...

20

Page 21: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

dynamic seed set example

In preparation with Anita Zakrzewska and Eisha Nathan

21

Page 22: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

dynamic seed set example

In preparation with Anita Zakrzewska and Eisha Nathan

21

Page 23: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

dynamic seed set example

In preparation with Anita Zakrzewska and Eisha Nathan

21

Page 24: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

dynamic seed set quality

Graphs from the Koblenz Network Collection.

22

Page 25: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

dynamic seed set speed-up

Graphs from the Koblenz Network Collection.

23

Page 26: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

global community updates

Page 27: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

community detection

• Partition a graph’svertices into disjointcommunities.

• A community locallyoptimizes somemetric, NP-hard.

• Trying to capture thatvertices are moresimilar within onecommunity thanbetweencommunities. Jason’s network via LinkedIn Labs

25

Page 28: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

what about streaming?

• Simple approach based on agglomeration:1. Extract all vertices touched by an update.2. Re-start agglomeration.

• “Works” and is fast (MTAAP 2013), but nevermentioned quality.• Extracted vertices form bridges and do not re-merge.

• Some methods based on label propagation, notmetric-driven.

• Backtracking (e.g. Görke, et al., JEA 2013) preservesquality at cost of change size.• Ongoing: Can we limit backtracking?

26

Page 29: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

community quality: achievable

Data from Stanford SNAP archive: Facebook.Stream generation: Reversing the graph.

In preparation with Pushkar Godbolé 27

Page 30: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

community quality: change size

Data from Stanford SNAP archive: Facebook.Stream generation: Reversing the graph.

In preparation with Pushkar Godbolé 28

Page 31: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

closing

Page 32: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

future directions

• Of course, continuing to develop streaming /dynamic / incremental algorithms.• For massive graphs, computing small changes isalways a win.

• Improving approximations or replacing expensivemetrics like betweenness centrality would be great.

• Including more external and semantic data.• If vertices are documents or data records, manymore measures of similarity.

• Only now being exploited in concert with static graphalgorithms.

30

Page 33: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

hpc lab people

Faculty:• David A. Bader• Oded Green

Data:• Pushkar Godbolé• Anita Zakrzewska• Eisha Nathan

STINGER:

• Robert McColl,• James Fairbanks,• Adam McLaughlin,• Daniel Henderson,• David Ediger (nowGTRI),

• Jason Poovey (GTRI),• Karl Jiang, and• feedback from users inindustry, government,academia

Support: DoD, DoE, NSF, Intel, IBM, Oracle 31

Page 34: Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

stinger: where do you get it?

Home: www.cc.gatech.edu/stinger/Code: git.cc.gatech.edu/git/u/eriedy3/stinger.git/

Gateway to

• code,

• development,

• documentation,

• presentations...

Remember: Academic code, but maturingwith contributions.Users / contributors / questioners:Georgia Tech, PNNL, CMU, Berkeley, Intel,Cray, NVIDIA, IBM, Federal Government,Ionic Security, Citi, ...

32