Distributed K-Betweenness
Complex Network AnalysisDaniel Marcous and Yotam Sandbank
.dmarcous@gmail com.yotamsandbank@gmail com
Centrality
❖ Core concept in complex network analysis
❖ Different measures:❖ Closeness❖ Degree❖ Betweenness
Betweenness
●
Betweenness computation
●
Betweenness computation
Betweenness computation
Expensive Computation!
Betweenness computation
Distributed Betweenness
❖ Independent computation for each node
❖ Why not run on different machines?
❖ Betweenness computation not implemented in GraphX
Distributed Betweenness
❖ Algorithm:❖ Divide nodes between machines❖ For each machine, compute the Betweenness contribution of each node to
every other node in the graph❖ Aggregate results from all machines
❖ Problems:❖ Can’t get information about a specific node in GraphX❖ Need to copy graph to every machine (goes bad with big graphs)
Distributed Betweenness
❖ Solutions:❖ Can’t get information about a specific node in GraphX
❖ GraphX Pregel API❖ Run 1 iteration, with every node passing its identity to all its neighbors
❖ Need to copy graph to every machine (goes wrong with big graphs)❖ We didn’t find a good solution for this problem❖ How can we avoid copying the whole graph to every machine?
Distributed K-Betweenness
●
Quotes by Adriana Iamnitchi , University of South Florida
Distributed K-Betweenness
●
Distributed K-Betweenness
●
Implementation
❖ Technology :❖ Spark 1.5.2❖ Scala 2.10❖ GraphX 1.5.2 (+ Pregel API)
❖ Steps :❖ Create K-graphlets
❖ Pregel❖ Parallel BC calculation - contribution of vertex X to other vertices BC
❖ Local for each vertex’ graphlet❖ Brandes
❖ Also parallelized for each vertex in k-graphlet❖ BC Aggregation - final kBC score for each vertex
❖ Reduce
Code
Code
Usage
Tuning
Do it yourself
❖ The project can be found in github:❖ https://github.com/dmarcous/spark-betweenness
❖ Accessible as a Spark Package ! ❖ http://spark-packages.org/package/dmarcous/spark-betweenness❖ spark code (scala / java), spark-shell, spark-submit, pySpark APIs
Experiment design
❖ Amazon EMR cluster❖ 1 master❖ 4 worker nodes❖ r3.2xlarge
❖ 8 vcpu❖ 61 GB RAM❖ 160 GB SSD
❖ 6 Datasets❖ Different sizes (|E| / |V|)❖ Different diameters
❖ Implementations❖ spark-betweenness❖ networkX
ResultsSpark Singl
eDescription Type Name
240 31 3 9 5156 3015 Small random generated
Random HW2
601 210 3 8 88234 4039 Social circles Social Facebook
-1 349 4
2160 -1 3 16 428156
58228 Friendship network Social Birghtkite
489 -1 3 44 925872
334863
Customer co-purchases
Social Amazon
5707 -1 4
-1 -1 5
139 -1 3 849 2766607
1965206
Road net of California Infrastructure roadNet-Ca
356 -1 4
638 -1 5
85 -1 3 1054 3843324
1379917
Road net of Texas Infrastructure roadNet-TX
305 -1 4
600 -1 5-1 means it either crashed or didn’t finish in a long time (over
an hour)
Results
Results
❖ Performs great on graphs with large diameter❖ Large K-graphlets are “impossible” to store in memory and send between
machines
❖ Not good for graphs with small diameter (very slow, sometimes crashes)
❖ Very hard to tune (how many cores, memory for each process, and so on..)
Conclusions
❖ Distributed Betweenness – good idea in theory, hard to implement
❖ Multi-threaded on a single strong machine might do the job
❖ Our implementation – great for large diameter graphs (road networks, power grids, and more)
Top Related