Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

scalable and efficient algorithms foranalysis of massive, streaming graphs

E. Jason Riedy and David A. BaderMS76 Scalable Network Analysis: Tools, Algorithms, ApplicationsSIAM PP, 15 April 2016

HPC Lab, School of Computational Science and EngineeringGeorgia Institute of Technology

motivation and applications

(insert prefix here)-scale data analysis

Cyber-security Identify anomalies, malicious actors

Health care Finding outbreaks, population epidemiology

Social networks Advertising, searching, grouping

Intelligence Decisions at scale, regulating algorithms

Systems biology Understanding interactions, drug design

Power grid Disruptions, conservation

Simulation Discrete events, cracking meshes

• Graphs are a motif / theme in data analysis.• Changing and dynamic graphs are important! 3

outline

1. Motivation and background2. Incremental PageRank3. Seed set expansion4. Community maintenance5. STINGER: Framework for streaming graph analysis

4

why graphs?

Another tool, like dense and sparse linear algebra.

• Combine things with pairwiserelationships

• Smaller, more generic than raw data.• Taught (roughly) to all CS students...• Semantic attributions can captureessential relationships.

• Traversals can be faster than filteringDB joins.

• Provide clear phrasing for queriesabout relationships.

5

potential applications

• Social Networks• Identify communities, influences, bridges, trends,anomalies (trends before they happen)...

• Potential to help social sciences, city planning, andothers with large-scale data.

• Cybersecurity• Determine if new connections can access a device orrepresent new threat in < 5ms...

• Is the transfer by a virus / persistent threat?• Bioinformatics, health

• Construct gene sequences, analyze proteininteractions, map brain interactions

• Credit fraud forensics⇒ detection⇒ monitoring• Integrate all the customer’s data, identify in real-time

6

streaming graph data

Networks data rates:

• Gigabit ethernet: 81k – 1.5M packets per second• Over 130 000 flows per second on 10 GigE (< 7.7 µs)

Person-level data rates:

• 500M posts per day on Twitter (6k / sec)1• 3M posts per minute on Facebook (50k / sec)2

We need to analyze only changes and not entire graph.

Throughput & latency trade off and expose differentlevels of concurrency.

1www.internetlivestats.com/twitter-statistics/2www.jeffbullas.com/2015/04/17/21-awesome-facebook-facts-and-statistics-you-need-to-check-out/

7

streaming graph analysis

Terminology:

• Streaming changes into a massive, evolving graph• Not CS streaming algorithm (tiny memory)• Need to handle deletions as well as insertions

Previous throughput results (not comprehensive review):

Data ingest >2M up/sec [Ediger, McColl, Poovey, Campbell, & B2014]

Clustering coefficients >100K up/sec [R, Meyerhenke, Bader,Ediger, & Mattson 2012]

Connected comp. >1M up/sec [McColl, Green, & B 2013]Community clustering >100K up/sec∗ [R & B 2013]

8

incremental pagerank

pagerank

Everyone’s “favorite” metric: PageRank.

• Stationary distribution of the random surfer model.• Eigenvalue problem can be re-phrased as a linearsystem (

I− αATD−1) x = kv,with

α teleportation constant, much < 1A adjacency matrixD diagonal matrix of out degrees, withx/0 = x (self-loop)

v personalization vector, here 1/|V|k irrelevant scaling constant

• Amenable to analysis, etc. 10

incremental pagerank

• Streaming data setting, update PageRank withouttouching the entire graph.

• Existing methods maintain databases of walks, etc.• Let A∆ = A+∆A, D∆ = D+∆D for the new graph,want to solve for x+∆x.

• Simple algebra:(I− αAT∆D−1

∆

)∆x = α

(A∆D−1

∆ − AD−1) x,and the right-hand side is sparse.

• Re-arrange for Jacobi,

∆x(k+1) = αAT∆D−1∆ ∆x(k) + α

(A∆D−1

∆ − AD−1) x,iterate, ...

11

incremental pagerank: accumulating error

• And fail. The updated solution wanders away fromthe true solution. Top rankings stay the same...

12

incremental pagerank: think instead

• The old solution x is an ok, not exact, solution to theoriginal problem, now a nearby problem.

• How close? Residual:

r′ = kv− x+ αA∆D−1∆ x

= r+ α(A∆D−1

∆ − AD−1) x.• Solve (I− αA∆D−1

∆ )∆x = r′.• Cheat by not refining all of r′, only region growingaround the changes:

(I− αA∆D−1∆ )∆x = r′|∆

• (Also cheat by updating r rather than recomputing atthe changes.)

13

incremental pagerank: works

Riedy, GABB at IPDPS 2016, to appear.14

incremental pagerank: worst latency

●

●●

●

●

●

●

●

●

● ●

●

●

●

●

0.001

0.010

0.001

0.010

0.01

0.10

1.00

1

100

1

10

100

power

PG

Pgiantcom

pocaidaR

outerLevelbelgium

.osmcoP

apersCiteseer

10 100 1000Batch size

Upd

ate

time

(s)

Algorithm ● dpr dprheld pr_restart

Riedy, GABB at IPDPS 2016, to appear.15

seed set expansion

graphs: big, nasty hairballs

Yifan Hu’s (AT&T) visualization of the in-2004 data sethttp://www2.research.att.com/~yifanhu/gallery.html

17

http://www2.research.att.com/~yifanhu/GALLERY/GRAPHS/GIF_SMALL/[email protected]

http://www2.research.att.com/~yifanhu/gallery.html

but no shortage of structure...

Protein interactions, Giot et al., “A ProteinInteraction Map of Drosophila melanogaster”,

Science 302, 1722-1736, 2003. Jason’s network via LinkedIn Labs

• Locally, there are clusters or communities.• There are methods for global community detection.• Also need local communities around seeds forqueries and targetted analysis.

18

http://inmaps.linkedinlabs.com/share/Jason_Riedy/135243263311536682812471775171414573322

seed set expansion

• Seed set expansion finds the “best” subgraph orcommunities for a set of vertices of interest• Many quality criteria: Modularity, conductance, etc.

• Can be applied to cryptocurrency to identify andtrack groups of interacting entities

• Dynamic algorithm updates communities faster thanrecomputation, allowing us to keep up with new dataproduced

19

static seed set expansion

Greedy expansion starting from S = { seed vertices }:

1. Check the fitness of every vertex v neighboring S.• fitness = f(S ∪ {v})− f(S)

2. It any fitness is positive, include most fit v in S.• Currently sequential, could include all sufficientlygood neighbors.

3. Record at which step v is included.• This list is the base for updates.

Now the dynamic version by example...

20

dynamic seed set example

In preparation with Anita Zakrzewska and Eisha Nathan

21

dynamic seed set quality

Graphs from the Koblenz Network Collection.

22

dynamic seed set speed-up

Graphs from the Koblenz Network Collection.

23

global community updates

community detection

• Partition a graph’svertices into disjointcommunities.

• A community locallyoptimizes somemetric, NP-hard.

• Trying to capture thatvertices are moresimilar within onecommunity thanbetweencommunities. Jason’s network via LinkedIn Labs

25

http://inmaps.linkedinlabs.com/share/Jason_Riedy/135243263311536682812471775171414573322

what about streaming?

• Simple approach based on agglomeration:1. Extract all vertices touched by an update.2. Re-start agglomeration.

• “Works” and is fast (MTAAP 2013), but nevermentioned quality.• Extracted vertices form bridges and do not re-merge.

• Some methods based on label propagation, notmetric-driven.

• Backtracking (e.g. Görke, et al., JEA 2013) preservesquality at cost of change size.• Ongoing: Can we limit backtracking?

26

community quality: achievable

Data from Stanford SNAP archive: Facebook.Stream generation: Reversing the graph.

In preparation with Pushkar Godbolé 27

community quality: change size

Data from Stanford SNAP archive: Facebook.Stream generation: Reversing the graph.

In preparation with Pushkar Godbolé 28

closing

future directions

• Of course, continuing to develop streaming /dynamic / incremental algorithms.• For massive graphs, computing small changes isalways a win.

• Improving approximations or replacing expensivemetrics like betweenness centrality would be great.

• Including more external and semantic data.• If vertices are documents or data records, manymore measures of similarity.

• Only now being exploited in concert with static graphalgorithms.

30

hpc lab people

Faculty:• David A. Bader• Oded Green

Data:• Pushkar Godbolé• Anita Zakrzewska• Eisha Nathan

STINGER:

• Robert McColl,• James Fairbanks,• Adam McLaughlin,• Daniel Henderson,• David Ediger (nowGTRI),

• Jason Poovey (GTRI),• Karl Jiang, and• feedback from users inindustry, government,academia

Support: DoD, DoE, NSF, Intel, IBM, Oracle 31

stinger: where do you get it?

Home: www.cc.gatech.edu/stinger/Code: git.cc.gatech.edu/git/u/eriedy3/stinger.git/

Gateway to

• code,

• development,

• documentation,

• presentations...

Remember: Academic code, but maturingwith contributions.Users / contributors / questioners:Georgia Tech, PNNL, CMU, Berkeley, Intel,Cray, NVIDIA, IBM, Federal Government,Ionic Security, Citi, ...

32

http://www.cc.gatech.edu/stinger/

http://git.cc.gatech.edu/git/u/eriedy3/stinger.git/

Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs

Data & Analytics

Transcript of Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs