Download - Distributed K-Betweenness (Spark)

Distributed K-Betweenness

Complex Network AnalysisDaniel Marcous and Yotam Sandbank

.dmarcous@gmail com.yotamsandbank@gmail com

mailto:[email protected]

mailto:[email protected]

Centrality

❖ Core concept in complex network analysis

❖ Different measures:❖ Closeness❖ Degree❖ Betweenness

Betweenness

●

Betweenness computation

●


Expensive Computation!

Distributed Betweenness

❖ Independent computation for each node

❖ Why not run on different machines?

❖ Betweenness computation not implemented in GraphX


❖ Algorithm:❖ Divide nodes between machines❖ For each machine, compute the Betweenness contribution of each node to

every other node in the graph❖ Aggregate results from all machines

❖ Problems:❖ Can’t get information about a specific node in GraphX❖ Need to copy graph to every machine (goes bad with big graphs)


❖ Solutions:❖ Can’t get information about a specific node in GraphX

❖ GraphX Pregel API❖ Run 1 iteration, with every node passing its identity to all its neighbors

❖ Need to copy graph to every machine (goes wrong with big graphs)❖ We didn’t find a good solution for this problem❖ How can we avoid copying the whole graph to every machine?


●

Quotes by Adriana Iamnitchi , University of South Florida


●

Implementation

❖ Technology :❖ Spark 1.5.2❖ Scala 2.10❖ GraphX 1.5.2 (+ Pregel API)

❖ Steps :❖ Create K-graphlets

❖ Pregel❖ Parallel BC calculation - contribution of vertex X to other vertices BC

❖ Local for each vertex’ graphlet❖ Brandes

❖ Also parallelized for each vertex in k-graphlet❖ BC Aggregation - final kBC score for each vertex

❖ Reduce

Tuning

Do it yourself

❖ The project can be found in github:❖ https://github.com/dmarcous/spark-betweenness

❖ Accessible as a Spark Package ! ❖ http://spark-packages.org/package/dmarcous/spark-betweenness❖ spark code (scala / java), spark-shell, spark-submit, pySpark APIs

https://github.com/dmarcous/spark-beetweenness

http://spark-packages.org/package/dmarcous/spark-beetweenness

Experiment design

❖ Amazon EMR cluster❖ 1 master❖ 4 worker nodes❖ r3.2xlarge

❖ 8 vcpu❖ 61 GB RAM❖ 160 GB SSD

❖ 6 Datasets❖ Different sizes (|E| / |V|)❖ Different diameters

❖ Implementations❖ spark-betweenness❖ networkX

ResultsSpark Singl

eDescription Type Name

240 31 3 9 5156 3015 Small random generated

Random HW2

601 210 3 8 88234 4039 Social circles Social Facebook

-1 349 4

2160 -1 3 16 428156

58228 Friendship network Social Birghtkite

489 -1 3 44 925872

334863

Customer co-purchases

Social Amazon

5707 -1 4

-1 -1 5

139 -1 3 849 2766607

1965206

Road net of California Infrastructure roadNet-Ca

356 -1 4

638 -1 5

85 -1 3 1054 3843324

1379917

Road net of Texas Infrastructure roadNet-TX

305 -1 4

600 -1 5-1 means it either crashed or didn’t finish in a long time (over

an hour)

Results

Results

❖ Performs great on graphs with large diameter❖ Large K-graphlets are “impossible” to store in memory and send between

machines

❖ Not good for graphs with small diameter (very slow, sometimes crashes)

❖ Very hard to tune (how many cores, memory for each process, and so on..)

Conclusions

❖ Distributed Betweenness – good idea in theory, hard to implement

❖ Multi-threaded on a single strong machine might do the job

❖ Our implementation – great for large diameter graphs (road networks, power grids, and more)