Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs...

21
Streaming Graph Partitioning Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft Research XCG

Transcript of Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs...

Page 1: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft.

Streaming Graph Partitioning KDD 8/15

Streaming Graph Partitioning for Large Distributed Graphs

Isabelle Stanton, UC BerkeleyGabriel Kliot, Microsoft Research XCG

Page 2: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft.

Streaming Graph Partitioning KDD 8/15

• Modern graph datasets are huge– The web graph had over a trillion links in 2011.

Now?– facebook has “more than 901 million users with

average degree 130”– Protein networks

Motivation

Page 3: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft.

Streaming Graph Partitioning KDD 8/15

• We still need to perform computations, so we have to deal with large data– PageRank (and other matrix-multiply problems)– Broadcasting status updates– Database queries– And on and on and on…

Motivation

P QL

Graph has to be distributed across a cluster of machines!

Page 4: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft.

Streaming Graph Partitioning KDD 8/15

Motivation

• Edges cut correspond (approximately) to communication volume required

• Too expensive to move data on the network– Interprocessor communication: nanoseconds– Network communication: microseconds

• The data has to be loaded onto the cluster at some point…

• Can we partition while we load the data?

Page 5: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft.

Streaming Graph Partitioning KDD 8/15

• Graph partitioning is NP-hard on a good day• But then we made it harder:

– Graphs like social networks are notoriously difficult to partition (expander-like)

– Large data sets drastically reduce the amount of computation that is feasible – O(n) or less

– The partitioning algorithms need to be parallel and distributed

High Level Background

Page 6: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft.

Streaming Graph Partitioning KDD 8/15

𝑀𝑘

𝑀 1

𝑀 2

The Streaming Model

Graph Stream →

PartitionerGraph is ordered:• Random• Breadth-First Search• Depth-First Search

Goal: Generate an approximately balanced k-partitioning

Each machine

holds nodes

𝐶=(1+𝜀)𝑛𝑘

Possible Buffer of size

Page 7: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft.

Streaming Graph Partitioning KDD 8/15

Lower Bounds On Orderings

Best balanced -partition cuts edges

Adversarial Ordering- Give every other vertex- See no edges till !- Can’t competeDFS Ordering- Stream is connected- Greedy will do optimally

Random Ordering- Birthday paradox: won’t see edges

until - Still can’t compete with edges cut

Theory says these types of algorithms can’t do well

Page 8: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft.

Streaming Graph Partitioning KDD 8/15

• Totally ignore edges and hash vertex ID• Pro

– Fast to locate data– Doesn’t require a complex DHT or synchronization

• Con– Hashing the vertex ID cuts a fraction of the edges

for any order– Great simple approximation for MAX-CUT

Current Approach in Real Systems

Page 9: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft.

Streaming Graph Partitioning KDD 8/15

• Evaluate 16 natural heuristics on 21 datasets with each of the three orderings with varying numbers of partitions

• Find out which heuristics work on each graph• Compare these with the results of

– Random Hashing to get worst case– METIS to get ‘best’ offline performance

Our Approach

Page 10: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft.

Streaming Graph Partitioning KDD 8/15

Caveats

• METIS is a heuristic, not true lower bound– Does fine in practice– Available online for reproducing results

• Used publicly available datasets– Public graph datasets tend to be much smaller

than what companies have• Using meta-data for partitioning can be good

– partitioning the web graph by URL– Using geographic location for social network users

Page 11: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft.

Streaming Graph Partitioning KDD 8/15

Heuristics

• Balanced• Chunking• Hashing• (weighted)

Deterministic Greedy• (weighted) Randomized

Greedy• Triangles• Balance Big

Uses a Buffer of size • Prefer Big• Avoid Big• Greedy EvoCut

Weight functionsUnweighted

Linear weighted

Exponentially weighted

Page 12: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft.

Streaming Graph Partitioning KDD 8/15

Datasets

• Includes finite element meshes, citation networks, social networks, web graphs, protein networks and synthetically generated graphs

• Sizes: 297 vertices to 41.7 million vertices• Synthetic graph models

– Barabasi-Albert (Preferential Attachment)– RMAT (Kronecker)– Watts-Strogatz– Power law-Clustered

• Biggest graphs: LiveJournal and Twitter

Page 13: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft.

Streaming Graph Partitioning KDD 8/15

Experimental Method

• For each graph, heuristic, and ordering, partition into 2, 4, 8, 16 pieces

• Compare with a random cut – upper bound• Compare with METIS – lower bound

• Performance was measured by:

¿𝑒𝑑𝑔𝑒𝑠𝑐𝑢𝑡𝑏𝑦 𝑟𝑎𝑛𝑑𝑜𝑚𝑐𝑢𝑡 −¿𝑒𝑑𝑔𝑒𝑠𝑐𝑢𝑡𝑏𝑦 h𝑒𝑢𝑟𝑖𝑠𝑡𝑖𝑐 ¿¿𝑒𝑑𝑔𝑒𝑠𝑐𝑢𝑡 𝑏𝑦 𝑟𝑎𝑛𝑑𝑜𝑚𝑐𝑢𝑡− ¿

𝑒𝑑𝑔𝑒𝑠𝑐𝑢𝑡𝑏𝑦𝑀𝐸𝑇𝐼𝑆¿

Page 14: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft.

Streaming Graph Partitioning KDD 8/15

Heuristic ResultsBest heuristic, LDG,

gets an average improvement of 76%

over all datasets!

Synthetic

Social network Finite element mesh

Hash

METIS

BFSDFS

Random

Page 15: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft.

Streaming Graph Partitioning KDD 8/15

Scaling in the Size of Graphs: Exploiting Synthetic Graphs

LDG

Hash

METIS

Page 16: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft.

Streaming Graph Partitioning KDD 8/15

More Observations

• BFS is a superior ordering for all algorithms• Avoid Big does 46% WORSE on average than

Random Cut• Further experiments showed Linear Det.

Greedy has identical performance to Det. Greedy with load-based tie breaking.

Page 17: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft.

Streaming Graph Partitioning KDD 8/15

• Compared the streamed partitioning with random hashing on SPARK, a distributed cluster computation system (http://www.spark-project.org/)

• Used 2 datasets• 4.6 million users, 77 million edges• 41.7 million users, 1.468 billion edges

• Computed the PageRank of each graph

Results on a Real System

Page 18: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft.

Streaming Graph Partitioning KDD 8/15

Results on SPARK

LJ Hash LJ Stream Twitter Hash Twitter Stream

Naïve PR Mean 296.2 s 181.5 s 1199.4 s 969.3 s

Naïve PR STD 5.5 s 2.2 s 81.2 s 16.9 s

Combiner PR Mean

155.1 s 110.4 s 599.4 s 486.8 s

Combiner PR STD

1.5 s 0.8 s 14.4 s 5.9 s

LJ Improvement:Naïve – 38.7%

Combiner – 28.8 %

Twitter Improvement:Naïve – 19.1%

Combiner – 18.8 %

LiveJournal – 4.6 million users, 77 million edgesTwitter – 41.7 million users, 1.468 billion edges

Page 19: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft.

Streaming Graph Partitioning KDD 8/15

Streaming graph partitioning is a really nice, simple,

effective preprocessing step.

Page 20: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft.

Streaming Graph Partitioning KDD 8/15

Where to now?

• Can we explain theoretically why the greedy algorithm performs so well?*

• What heuristics work better?• What heuristics are optimal for different

classes of graphs?• Use multiple parallel streams!• Implement in real systems!

*Work under submission: I. Stanton, Streaming Balanced Graph Partitioning Algorithms for Random Graphs

[email protected]

Page 21: Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft.

Streaming Graph Partitioning KDD 8/15

Acknowledgements

• David B. Wecker• Burton Smith• Reid Andersen• Nikhil Devanur• Sameh Elkinety• Sreenivas Gollapudi• Yuxiong He• Rina Panigrahy• Yuval PeresAll at MSR

• Satish Rao• Virginia Vassilevska Williams• Alexandre Stauffer• Ngoc Mai Tran• Miklos Racz• Matei ZahariaAll at Berkeley - CS and Statistics

Supported by NSF and NDSEG fellowships, NSF grant CCF-0830797, and an internship at Microsoft Research’s eXtreme Computing Group.

[email protected]