Large Scale Centrality Measures in Apache Flink and Apache ... · Programming model of Apache...

Master Thesis

Thesis Advisor: Sebastian Schelter, Research Associate

Thesis Supervisor: Prof. Dr. Markl, Volker

Large Scale Centrality Measures in

Apache Flink and Apache Giraph

Submitted by

Janani Chakkaradhari

5 September 2014

Centrality measures identify the most central

nodes in a network

Targeted advertising minimizes resources and effort

required for marketing

Centrality measures to identify the head of terrorist

network that attacked on 9/11

Krebs, 2002

Different notions of the “most central nodes”

5 Freeman et al, 1977

Degree Closeness Betweenness

The real world networks are very large and

sparse

6 Barabási, 2004

Big Data platforms to analyze large networks

Related work on parallel computation of

centrality measures

Two novel algorithms proposed by Kang, U., et al in the paper

“Centralities in Large Networks: Algorithms and Observations”

for computing closeness and betweenness and implemented in

Hadoop.

• Effective Closeness algorithm

- an approximate technique for closeness

• LineRank algorithm

- random walk betweenness

Comparison of parallel computing platforms by

implementing and evaluating centrality measures

• How well does the

programming model of these

data processing platforms fit

Effective Closeness and

LineRank algorithms?

• Evaluating the performance

of each of these two

platforms

Closeness & Betweenness

of large networks

Parallel data processing

platforms

Apache Flink

Programming model of Apache Giraph & Apache Flink

for iterative graph processing

• Apache Giraph, a vertex centric model

for iterative graph processing based • Apache Flink offers special iteration

operator

10 Stephan Ewen, 2014 Sebastian Schelter, 2012

superstep i superstep i+1 superstep i+2

Programming model of Apache Giraph & Apache Flink

for iterative graph processing

• Apache Giraph, a vertex centric model

for iterative graph processing based • Apache Flink offers special iteration

operator

Iterative

Function

Initial dataset

Result

Iteration

Initial

solutionset

Initial

workset

Result Delta

Iteration

11 Stephan Ewen, 2014 Sebastian Schelter, 2012

superstep i superstep i+1 superstep i+2

Comparison of parallel computing platforms by

implementing and evaluating centrality measures

platforms

of large networks

platforms

Apache Flink

1. Computation logic

2. Implementation

Iterative computation of Effective Closeness Algorithm

• Shortest path between nodes => it counts the node at each

step/shortest path progressively

• Sum of the shortest paths: (2 x 1) + (2 x 2) + (1 x 3)= 9

Step 1 Step 2 Step 3

Computation of Effective Closeness in Apache Giraph

1 0 0 0 0 0 1

0 1 0 0 0 0 2

0 0 1 0 0 0 3

0 0 0 1 0 0 4

0 0 0 0 1 0 5

0 0 0 0 0 1 6

vid bit_string

src des

Illustration of Effective Closeness in Apache Flink using Delta iteration

• Vertices – Initial workset and solution set

• Edges - Pair of source and destination ids

1 0 0 0 0 0 1

0 1 0 0 0 0 2

0 0 1 0 0 0 3

0 0 0 1 0 0 4

0 0 0 0 1 0 5

0 0 0 0 0 1 6

⋈ vid=src

vid bit_string

src des

0 1 0 0 0 0

0 0 1 0 0 0

0 0 0 0 1 0

0 0 0 1 0 0

0 0 0 0 0 1

1 0 0 0 0 0

0 1 0 0 0 0

0 0 0 1 0 0

des bit_string

Illustration of Effective Closeness in Apache Flink using Delta iteration (2/4)

0 1 0 0 0 0

0 0 1 0 0 0

0 0 0 0 1 0

0 0 0 1 0 0

0 0 0 0 0 1

1 0 0 0 0 0

0 1 0 0 0 0

0 0 0 1 0 0

des bit_string

𝛾𝑠𝑟𝑐

0 1 0 0 0 0

0 0 0 0 1 0

0 0 1 0 0 0 2

1 0 0 0 0 0 2

0 0 0 0 1 0 2

0 0 0 1 0 0 5

1 0 0 0 0 0 5

0 1 0 0 0 0 5

0 1 0 0 0 0 3

0 0 0 1 0 0 3

0 0 0 0 0 1 4

0 0 0 0 1 0 4

0 0 1 0 0 0 4

0 0 0 1 0 0 6

1 1 0 0 1 0 1

1 1 1 0 1 0 2

0 1 1 1 0 0 3

0 0 1 1 1 1 4

1 1 0 1 1 0 5

0 0 0 1 0 1 6

Bit-OR

des bit_string

Updated result in current iteration

1 0 0 0 0 0 1

0 1 0 0 0 0 2

0 0 1 0 0 0 3

0 0 0 1 0 0 4

0 0 0 0 1 0 5

0 0 0 0 0 1 6

⋈ vid=des

vid bit_string

Solutionset /previous

iteration’s result

1 1 0 0 1 0 1

1 1 1 0 1 0 2

0 1 1 1 0 0 3

0 0 1 1 1 1 4

1 1 0 1 1 0 5

0 0 0 1 0 1 6

des bit_string

Updated result in

current iteration

Termination condition

check If(prev count != current count)

emit the updated nodes => Next

Workset

else keep calm!

1 1 0 0 1 0 1

1 1 1 0 1 0 2

0 1 1 1 0 0 3

0 0 1 1 1 1 4

1 1 0 1 1 0 5

0 0 0 1 0 1 6

Next Workset

Illustration of Effective Closeness in Apache Flink of delta iteration

REDUCE

Step Function

Update Function

Summary of Effective Closeness implementation

• Both implementations reduces the amount of data to be processed

in the successive iterations

• Hence both the computing models for finding Effective Closeness exploits the sparse nature of the real world graphs

Idea behind LineRank Algorithm

• Betweenness is computed by finding the importance score of

incident edges of a node

kang et al, 2011

Iteration

PageRank

Eigenvector/ Rank

of nodes in L(G)

• Problem: Line graph L(G) is larger than original graph

𝑟 = 𝑇𝑘 𝑟0

Challenges in implementing LineRank in Apache Giraph

• Two step matrix-vector multiplication in the power iteration using two

sparse matrices (incoming and outgoing edges)

• The vertex state value in the LineRank is edge score which contradicts

with the vertex centric computation model

• How to achieve two stage matrix-vector multiplication in Giraph?

𝑣2 ⟵ 𝑆 𝐺 𝑇𝑣1 𝑣3 ⟵ 𝑇 𝐺 𝑣2 𝑣 ⟵ 𝐿(𝐺)𝑣 ↔

Proposed solution for implementing LineRank algorithm in

Apache Giraph (1/2)

• Illustration of “think like vertex”

• Let us compute the step v2 in the first iteration for our example graph

𝑣2 ⟵ 𝑆 𝐺 𝑇𝑣1 𝑣3 ⟵ 𝑇 𝐺 𝑣2

Proposed solution for implementing LineRank algorithm in

Apache Giraph (2/2)

Pseudo-code

• Current state of the vertex is assigned with computation result of v2

• The messages that are distributed or exchanged in the iteration are

considered to be the edge score v3

𝑣2 ⟵ 𝑆 𝐺 𝑇𝑣1 𝑣3 ⟵ 𝑇 𝐺 𝑣2

Illustration of proposed solution to implement LineRank in Apache

Giraph

1 2 3 Input graph

0.2 0.2 0.2

0.2 0.1

superstep 0

0.3 0.3 0.1

0.1 0.15

superstep 1

𝑣2 ⟵ 𝑆 𝐺 𝑇𝑣1 𝑣3 ⟵ 𝑇 𝐺 𝑣2

Implementation of LineRank in Apache Flink

Summary of LineRank implementation

• Two step matrix-vector multiplication is hard to implement in

Apache Giraph

• Remodeling the LineRank computation in Apache Giraph requires

an in-depth knowledge in both platform level and algorithmic level

• Programmability with Apache Flink for computational intensive

iterative algorithms are simple and flexible

Comparison of parallel computing platforms by implementing and

evaluating centrality measures

of large networks

platforms

Apache Flink

1. Computation logic

2. Implementation

platforms

Evaluation – Dataset

Evaluating scalability of Effective Closeness in Apache Giraph &

Apache Flink (Runtime vs Edges)

*Fixed number of parallel tasks 30

Evaluating scalability of LineRank in Apache Giraph & Apache Flink

(Runtime vs Edges)

*Fixed graph data

LineRank in Flink: Runtime vs Number of cores

LineRank in Giraph: Runtime vs Number of cores

Evaluation – Comparing the performance of Apache

Giraph and Apache Flink

LineRank Effective Closeness

No. of cores = 15

Evaluation – Comparing the performance of

Apache Giraph and Apache Flink

• Apache Giraph incorporates hash based aggregations

• Apache Flink uses sorting technique for aggregations

• Efficient mechanism for estimating memory

requirements in Apache Giraph

Conclusion

• Implementation of Effective Closeness exploits the sparse nature of the real world graphs

• The programming model of Apache Giraph is not flexible for

computations that involves multi-step matrix-vector

multiplication whereas Apache Flink is more flexible for these

computations

• Efficient optimizations in Apache Giraph makes it perform better

than Apache Flink

• The implementation of these algorithms are targeted to

contribute to the Apache Flink open source community

Future Works

• This work can be extended to evaluate the computation intensive

centrality algorithms on other parallel data processing systems

such as Apache Spark and Distributed GraphLab

References

[1] Kang, U., et al. "Centralities in Large Networks: Algorithms and Observations."

SDM. Vol. 2011. 2011

[2] Sebastian Schelter, “Introducing Apache Giraph for Large Scale Graph

Processing”, “slideshare.net/sscdotopen/introducing- apache-giraph-for-large-

scale-graph-processing”, 2012

[3] Krebs, Valdis E. "Mapping networks of terrorist cells." Connections 24.3 (2002):

[4] Ewen, Stephan, et al. "Spinning fast iterative data flows." Proceedings of

the VLDB Endowment 5.11 (2012): 1268-1279

[5] Malewicz, Grzegorz, et al. "Pregel: a system for large-scale graph processing"

Proceedings of the 2010 ACM SIGMOD International Conference on Management

of data. ACM, 2010.

References

[6] Freeman, Linton C. "A set of measures of centrality based on betweenness"

Sociometry (1977): 35-41

[9] Stephan Ewen “Stratosphere, Next-Gen Data Analytics Platform”, Hadoop Summit

Europe, 2014

[8] Barabasi, Albert-Laszlo, and Zoltan N. Oltvai. "Network biology: understanding

the cell's functional organization." Nature Reviews Genetics 5.2 (2004): 101-113

Backup Slides

Summary of Effective Closeness implementation

• Both implementations reduces the amount of data to be processed

in the successive iterations

• Hence both the computing models for finding Effective Closeness exploits the sparse nature of the real world graphs

Highly connected

Less connected

LineRank Dataflow in Apache Flink

Final step in the proposed solution to implement LineRank in

Apache Giraph

• Aggregating the computed edge scores

(incoming and outgoing edges )

• Computation of v2 represents aggregation

incoming edges scores

Bet(1) = R(a)+R(b)+R(c)+R(d)

Bet(2) = R(a)+R(b)+R(e)

Bet(3) = R(c)+R(d)+R(e)

LineRank algorithm computes the random-walk

betweenness without constructing line graph

• L(G) is decomposed into two sparse matrices

– Source Incidence Matrix S(G) [Outgoing edges]

– Target Incidence Matrix T(G) [Incoming edges]

– L G = 𝑇 𝐺 𝑆(𝐺)𝑇

a 1 0 0

b 0 1 0

c 1 0 0

d 0 0 1

e 0 1 0

a 0 1 0

b 1 0 0

c 0 0 1

d 1 0 0

e 0 0 1

S(G) T(G)

0 1 0 1 0 0

= 1 0 0

1 0 0 1 0 0 1 0 0 1 0 0 1 0 0

𝑻 𝑮 𝑺(𝑮)𝑻

0 1 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0

Power iteration in LineRank

Referred from [Ukang]

Large Scale Centrality Measures in Apache Flink and Apache ... · Programming model of Apache...

Documents

Transcript of Large Scale Centrality Measures in Apache Flink and Apache ... · Programming model of Apache...

Workshop Apache Flink Madrid

Apache Flink Stream Processing

Writing Apache Spark and Apache Flink Applications Using Apache Bahir

Apache Flink Deep Dive

Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015

Introduction to Apache Flink

SICS: Apache Flink Streaming

Compute "Closeness" in Graphs using Apache Giraph.

Apache Flink - SICS

Advanced topics in Apache Flink™linc.ucy.ac.cy/.../EIT_iSocial_summerschool/slides/flink-advanced.pdf · Apache Flink™ Maximilian Michels mxm@apache.org @stadtlegende EIT ICT

Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

Efficient Datastream Sampling on Apache Flink · For the sampling task all implemented algorithms perform as stream operators in the Apache Flink framework. Apache Flink [2] is an

Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink

Apache Flink Hands On

Apache Flink Training - System Overview

Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are

Flink and Apache Spark Fernanda de Camargo Magano Dylan ... · Flink and Apache Spark Fernanda de Camargo Magano Dylan Guedes. About Flink ... Introduction to Apache Flink Book. Use

Apache Flink - tutorialspoint.comApache Flink was founded by Data Artisans company and is now developed under Apache License by Apache Flink Community. This community has over 479

Apache Giraph on yarn

Introducing Apache Giraph for Large Scale Graph Processingresearcher.watson.ibm.com/researcher/files/us-heq/Large Scale Graph... · Large Scale Graph Processing with Apache Giraph