Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin,...

60
Scalable Big Graph Processing in Map Reduce Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant College of William and Mary February 11, 2015 Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Ma Scalable Big Graph Processing in Map Reduce February 11, 2015 1 / 60

Transcript of Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin,...

Page 1: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Scalable Big Graph Processing in Map Reduce

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang,Xuemin Lin,

Presented by Megan Bryant

College of William and Mary

February 11, 2015

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 1 / 60

Page 2: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Overview

In this presentation, we will be introduced to methods for scalable biggraph processing in MapReduce.

Specifically, we will be introduced with a new class SGC which has thepotential to guide the development of scalable graph processing algorithmin MapReduce.

Two new graph join operators will also be introduced which will greatlyenhance the capabilities of the SGC class.

Finally, we will compare the performance of these three classes on severalscalable graph algorithms.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 2 / 60

Page 3: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Computational Complexity

Computational complexity theory provides a framework and a set ofanalysis tools for gauging the work performed by an algorithm as measuredby the elementary (i.e. basic) operations it performs.

The different basic steps (operations) that an algorithm typically takesare:

Assignment (e.g. assigning some value to a variable)

Arithmetic (e.g. addition, subtraction, multiplication, and division)

Logical (e.g. comparison of two numbers)

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 3 / 60

Page 4: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Big-O Notation

We utilize Big-O notation to define the complexity of an algorithm.

Definition

An algorithm is said to run in O(f(n)) time if for some numbers c and n0,the time taken by the algorithm is at most cf(n) for all n ≥ n0 for someconstant c.

This is an example of worst case analysis, which is independent ofcomputing environment, relatively easy to perform, and providing an upperbound on the maximum number of steps an running time an algorithmmust take.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 4 / 60

Page 5: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Big-O Complexity

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 5 / 60

Page 6: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Common Complexities

The following table contains the complexities of common algorithms.

Algorithm Data Structure Time SpaceComplexity Complexity

Depth First Search Graph w/n nodes O(n+m) O(m)and n nodes

Breadth First Search Graph w/n nodes O(n+m) O(m)and m nodes

Binary Search Sorted array O(log(n)) O(1)

Dijkstra’s Shortest Graph w/m nodes O(n2) O(n)Path (unsorted array) and n nodes

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 6 / 60

Page 7: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Algorithm Classes in Map Reduce

There are currently two main algorithm classes in the MapReduceparadigm:

The MapReduce Class (MRC).

The Minimal MapReduce Class (MMC).

These classes are defined in terms of disk usage, memory usage,communication cost, CPU cost, and number of map reduce rounds.

There is also the popular Parallel Random-Access Machine (PRAM)model, against which performance studies were run.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 7 / 60

Page 8: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Map Reduce Class

Let S be the set of objects in the problem and let t be the number ofmachines in the system.

Fix a ε > 0, a MapReduce algorithm in MRC should have the followingproperties:

Each Machine Total

Disk: O(|S|1−ε) O(|S|2−2ε)Memory: O(|S|1−ε) O(|S|2−2ε)Communication: O(|S|1−1ε)/per round O(|S|2−2ε)CPU: O(

Tseqt )∗

Number of Rounds: O(1)

∗Tseq is the time to solve the same problem on a single sequential machine

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 8 / 60

Page 9: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Minimal Map Reduce Class

Let S be the set of objects in the problem and let t be the number ofmachines in the system.

Fix a ε > 0, a MapReduce algorithm in MRC should have the followingproperties:

Each Machine Total

Disk: O( |S|t ) O(|S|)Memory: O( |S|t ) O(|S|)Communication: O( |S|t )/per round O(|S|)CPU: O(poly(|S|))/per round

Number of Rounds: O(logi |S|), i ≥ 0

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 9 / 60

Page 10: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Parallel Random Access Machine

Parallel Random Access Machine (PRAM) is an algorithm for creating amodel of parallel computation. It is an extension of the RAM model ofsequential computation.

In this model, there are p processors connected to a single sharedmemory and each processor has a unique index 1 ≤ i ≤ p called theprocessor id. A single program is executed in single-instruction stream,multiple-data stream fashion. Meaning that each instruction is carried outby all processors simultaneously and requires unit time, regardless of thenumber of processors. Finally, each processor has a private flag thatcontrols whether it is active in the execution of an instruction. Inactiveprocessors do no participate in the execution of instructions, except forinstructions to reset the flag.

We will later compare the performance of this algorithm to MRC, MMC,and SGC.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 10 / 60

Page 11: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

MRC VS MMC

MRC defines the basic requirements for an algorithm to execute inMapReduce, whereas MMC requires several aspects to achieve optimalitysimultaneously in a MapReduce algorithm.

We will begin by analyzing the problems involved in MRC and MMC ingraph processing.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 11 / 60

Page 12: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Defining a Graph

Let’s consider a graph G = (V,E), where V represents the set of vertices(nodes) and E represents the set of edges (arcs). Further, let n = |V | bethe number of nodes and m = |E| be the number of edges.

A graph can be either directed or undirected, cyclic or acyclic, connectedor unconnected.

We can represent a graph in either a

Adjacency Matrix

Adjacency List

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 12 / 60

Page 13: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Adjacency Matrix

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 13 / 60

Page 14: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Adjacency List

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 14 / 60

Page 15: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Scalable Graph Processing in MMC

For a graph G(V,E), a common graph operation is to exchange dataamong all adjacent nodes (nodes that share a common edge) in the graph.

The memory constraint in MMC requires that all edges/nodes aredistributed evenly among all machines in the system.

This can be formalized as: Let Ei,j be the set of edges (u, v) in G suchthat u is in machine i and v is in machine j.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 15 / 60

Page 16: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Scalable Graph Processing in MMC

The communication constraint in MMC can be formalized as follows:

max1≤i≤t

(∑

1≤j≤t,j 6=i|Ei,j |) ≤ O(

(n+m)

t)

where once again E(i, j) is the set of edges (u, v) ∈ G and u is in machini and v is in machine j.

In order to achieve this inequality, we must minimize the maximum, i.e.

min max1≤i≤t

(∑

1≤j≤t,j 6=i|Ei,j |).

However, this problem is actually NP -Hard, meaning that it is at leastas hard as the hardest problems in NP.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 16 / 60

Page 17: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Scalable Graph Processing in MMC

In addition to being NP -Hard, the optimal solution to

max1≤i≤t

(∑

1≤j≤t,j 6=i|Ei,j |) ≤ O(

(n+m)

t)

is successfully, computed, we can’t guarantee that the inequality≤ O( (n+m)

t ) since it might be as large as O(n+m).

Therefore, MMC is not a suitable class for scalable graph processing.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 17 / 60

Page 18: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Scalable Graph Processing in MRC

MRC has few constraints than MMC as it simply defines the basicconditions that a MapReduce algorithm should satisfy. Thus a graphalgorithm in MapReduce is not an exception. Like MMC, however, we candefine a better class to handle Scalable Graph Processing

Given a graph G(V,E) with n nodes and m edges, assume that m ≥ n1+c,an MRC graph define a class based on MRC for graph processing inMapReduce, in which a MapReduce algorithm has the following properties:

Each Machine Total

Disk: O(n1+c2 ) O(m

1+c2 )

Memory: O(n1+c2 ) O(m

1+c2 )

Communication: O(n1+c2 )/per round O(m

1+c2 )

CPU: O(poly(m))/per round

Number of Rounds: O(1)

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 18 / 60

Page 19: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Scalable Graph Processing in MRC

This class has a good property in that the algorithm runs in constantrounds. However, the memory constraint can cause difficulty as it is largefor even a dense graph.

(Note: Dense graphs are generally easier to solve than sparse graphs.)

Furthermore, if the memory of each machine cannot hold O(n1+c2 ), then

the algorithm will always fail. Thus, the class is not scalable and can’thandle large n.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 19 / 60

Page 20: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Scalable Graph Processing Class

We will now formulate a new algorithm class which counters thisdeficiency. First, we will weaken the bounds on the communication costper machine from O(m+n

t ) to O(mt , D(G, t)).

This is done to account for the fact that graphs, especially large graphs,can have a skewed degree distribution. This is seen in graphs such associal networks, which often have several nodes with a large number ofdegrees (subscribers, followers, etc.) as opposed to lower-level users withonly a few connections.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 20 / 60

Page 21: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Skewed Degree Distribution

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 21 / 60

Page 22: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Scalable Graph Processing Class

Suppose the nodes are uniformly distributed among all machines, denoteby Vi the set of nodes stored in machine i for 1 ≤ i ≤ t, and let dj be the

degree of node vj in the input graph, O(mt , D(G, t)) is defined as:

O(m

t,D(G, t)) =O( max

1≤i≤t(∑vj∈Vi

dj))

D(G, t) =t1

t2

∑vj∈V

d2j

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 22 / 60

Page 23: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Scalable Graph Processing Class

This leads us to the following lemma, the proof of which has beenomitted.

Lemma

Lemma 3.1: Let xi(1 ≤ i ≤ q) be the communication cost upper boundfor machine i, i.e., xi =

∑vj∈Vi

dj , the expected value of xi, E(xi) = 2mt ,

and the variance of xi, V ar(xi) = D(G, t).

The important thing that we want to note here is that the variance ofthe degree distribution of G, denoted V ar(G) is(∑vj∈V

(dj − 2mn )2/n = (n

∑vj∈V

d2j − 4m2)/n2.

For fixed t, n, and m values, minimizing D(G, t) is equivalent tominimizing V ar(G). In other words, the variance of communication costfor each machine is minimized if all nodes in the graph have the samedegree.

This solves the problem experienced by the previous scalable graphprocessing algorithm by reducing communication costs.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 23 / 60

Page 24: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Scalable Graph Processing Class

Thus, we define the Scalable Graph Processing Class (SGC) as follows.

Each Machine Total

Disk: O(m+n2 ) O(m+ n)

Memory: O(1) O(t)

Communication: O(mt , D(G, t))∗/per round O(m+ n)

CPU: O(mt , D(G, t))∗/per round

Number of Rounds: O(log(n))

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 24 / 60

Page 25: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Comparison Between Classes

We examine the upper bounds of the three classes to see how therunning times of SGC compare.

MRC MMC SGC

Disk/machine O(n1+c2 ) O(n+mt ) O(n+mt )

Disk/total O(m1+ c2 ) O(n+m) O(n+m)

Memory/machine O(n1+c2 ) O(n+mt ) O(1)

Memory/total O(m1+ c2 ) O(n+m) O(t)

Communication/machine O(n1+c2 ) O(n+mt) O(mt , D(G, t))

Communication/total O(m1+ c2 ) O(n+m) O(n+m)

CPU/machine O(poly(m)) O(Tseqt ) O(mt , D(G, t))

CPU/total O(poly(m)) O(Tseq) O(n+m)Number of rounds O(1) O(1) O(log(n))

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 25 / 60

Page 26: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Comparison Between Classes

We see that even though SGC requires each machine to use constantmemory. Meaning, if the total memory of the system is smaller than theinput data, the algorithm can still be processed successfully. This is aneven stronger constraint than that defined in MMC.

Given the constraints on memory, communication, and CPU, it is nearlyimpossible for a wide range of graph algorithms to be processed inconstant rounds in MapReduce.

Thus, we relax the O(1) rounds defined in MMC to O(log(n)) rounds.

Since Ω(log(n)) is the processing time lower bound for a large number ofparallel graph algorithms in the parallel random-access machines, it ispractical for the MapReduce framework as evidenced by our experiments.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 26 / 60

Page 27: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Big-O Complexity

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 27 / 60

Page 28: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Graph Operators in SGC

In addition to the normal set of graph operators, such as union,intersection, etc., we have introduced two graph operators in SGC, namely,NE join, and EN join, using which a large range of graph problems can bedesigned.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 28 / 60

Page 29: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Graph Operators in SGC

We assume that a graph G(V,E) is stored in a distributed file system asa node table V and an edge table E.

Each node in the table has a unique id and some other information suchas label and keywords.

Each edge in the table has id1, id2 defining the source and target nodeids of the edge, and some other information such as weight and label.

We use the node id to represent the node if it is obvious. G can beeither directed or undirected.

For an undirected graph, each edge is stored as two edges (id1, id2) and(id2, id1).

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 29 / 60

Page 30: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Graph Operators in SGC

Before we go any further, let’s examine the natural join operation, ./,acting on two sets of data.

Here we see a graphical representation of Employee ./ Dept.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 30 / 60

Page 31: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

NE Join

An NE join aims to propagate the information on nodes into edges.

For each edge (vi, vj) ∈ E, an NE join outputs an edge (vi, vj , F (vi)) (or(vi, vj , F (vj))) where F (vi) (or F (vj)) is a set of functions operated on vi(or vj ) in the node table V .

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 31 / 60

Page 32: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

NE Join

Given node table Vi, & edge table Ej , an NE join of Vi & Ej isrepresented in SQL as:

select id1, id2, f1(c1) as p1, f2(c2) as p2, · · ·from Vi as V NE join Ej as E on V.id = E.id′

where cond(c)count cond′(c′) as cnt

With the following definitions,c, c′, · · · a subset of fields in the two tables Vi and Ejc1, c2 a subset of fields in the two tables Vi and Ejfk a function operated on the fields ckcond a fucntion that retrusn true or false defined on the fields in c.cond′ a fucntion that retrusn true or false defined on the fields in c′.id′ can be either id1 or id2.count counts the number of trues in cond′(c′), assigns it to cnt.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 32 / 60

Page 33: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

EN Join

An EN join aims to aggregate the information on edges into nodes.

For each node vi ∈ V , an EN join outputs a node (vi, G(adj(vi))) whereadj(vi) = (vi, vj) ∈ E, and G is a set of decomposable aggregatefunctions on the edge set adj(vi).

A decomposable aggregate function gk is defined as decomposable iffor any dataset s, and any two subsets of s, s1 and s2, with s1 ∩ s2 = ∅and s1 ∪ s2 = s, gk(s) can be computed using gk(s1) and gk(s2).

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 33 / 60

Page 34: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

EN Join

EN join can be defined in SQL form as

select id, g1(c1) as p1, g2(c2) as p2, · · ·from Vi as V EN join Ej as E on V.id = E.id′

where cond(c)group by idcount cond′(c′) as cnt

With the following definitions,c, c′, · · · a subset of fields in the two tables Vi and Ejc1, c2 a subset of fields in the two tables Vi and Ejid′ either id1 or id2count cond′(c′) as cntgk decomposable aggregate function operated on the fields in ck

by grouping the results using node id

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 34 / 60

Page 35: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Basic Graph Algorithms

The combination of NE join and EN join can solve a wide range of graphproblems in SGC.

In this section, we introduce some basic graph algorithms:

PageRank

Breadth First Search

Graph Keyword Search

We will use MRC, MMC, and SGC versions of these algorithms forperformance testing, which will be covered later.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 35 / 60

Page 36: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Page Rank

PageRank is a key graph operation which computes the rank of eachnode based on the links (directed edges) among them.

Given a directed graph G(V,E), and a page x with inlinks t1, . . . , tn, thepage rank of x can be calculated iteratively as follows

PR(x) = α

(1

|V |

)+ (1− α)

n∑i=1

PR(ti)

C(ti)

with the following definitions

C(t) out-degree of tα probability of random jump|V | total number of nodes

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 36 / 60

Page 37: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Page Rank Algorithm

Graphical overview of the Page Rank algorithm.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 37 / 60

Page 38: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Page Rank in MapReduce

Graphical overview of the Page Rank in MapReduce.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 38 / 60

Page 39: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Breadth First Search

Breadth First Search (BFS) is a fundamental graph operation. Given anundirected graph G(V,E), and a source node s, a BFS computes for everynode v ∈ V the shortest distance (i.e., the minimum number of hops)from s to v in G.

Define: b is reachable from a if b is on adjacency list of aDistanceTo(s) =0For all nodes p reachable from s, DistanceTo(p)= 1For all nodes n reachable from some other set of nodes M ,DistanceTo(n)= 1 + min(DistanceTo(m), m ∈M)

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 39 / 60

Page 40: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Breadth First Search

Graphical overview of the Breadth First Search algorithm.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 40 / 60

Page 41: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Graph Key Word Search

We now investigate a more complex algorithm, namely, keyword search inan undirected graph G(V,E). Suppose for each v ∈ V, t(v) is the textinformation included in v. Given a keyword query with

Q = k1, k2, . . . , kl set of l keywords(r, (p1, d(r, p1)), (p2, d(r, p2)), set of rooted trees. . . , (pl, d(r, pl)))r the root nodepi node that contains keyword ki in t(pi)d(r, pi) shortest distance from r to pi

in G for 1 ≤ i ≤ l

Each answer is uniquely determined by its root node r and rmax is themaximum distance allowed from s to a keyword node in an answer, i.e.,d(r, pi) ≤ rmax for 1 ≤ i ≤ l.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 41 / 60

Page 42: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Connected Component

Given an undirected graph G(V,E) with n nodes and m edges, aConnected Component (CC) is a maximal set of nodes that can reacheach other through paths in G.

Computing all CCs of G is a fundamental graph problem and can besolved efficiently on a sequential machine using O(n+m) time. However,it is non-trivial to solve the problem in MapReduce.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 42 / 60

Page 43: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Existing Algorithms

We present three algorithms for Connected Components computation inMapReduce to compare the success of CC in SGC.

HashToMin

HashGToMin

PRAM-Simulation

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 43 / 60

Page 44: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

HashToMin

HashToMin and HashGToMin are two MapReduce algorithms with asimilar idea to use the smallest node in each CC as the representative ofthe CC, assuming that there is a total order among all nodes in G.

The HashToMin algorithm finishes in O(log(n)) rounds, withO(log(n)(m+ n)) total communication cost in each round.

The algorithm can be optimized to use O(1) memory on each machineusing secondary sort in MapReduce.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 44 / 60

Page 45: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

HashGToMin

The HashGToMin algorithm finishes in O(log(n)).

Meaning, it is expected to finish in O(log(n))) rounds, with O(m+ n)total communication cost in each round.

However, it needs O(n) memory for a single machine to hold a wholeCC in memory.

Thus, HashGToMin is not suitable to handle a graph with large n.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 45 / 60

Page 46: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

PRAM Simulation

PRAM-Simulation is to simulate the algorithm in the Parallel RandomAccess Machine (PRAM) model in MapReduce using simulation. ThePRAM model allows multiple processors to compute in parallel using ashared memory.

A theoretical result shows that an CREW PRAM algorithm in O(t) timecan be simulated in MapReduce in O(t) rounds. For the CC computationproblem, in the literature, the best result in computes CCs in O(log(n))time.

However, it needs to compute the 2-hop node pairs which requires O(n2)communication cost in the worst case in each round. Thus, the simulationalgorithm is impractical.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 46 / 60

Page 47: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Connected Component in SGC

We introduce our algorithm to compute CCs in SGC. Conceptually, thealgorithm shares similar ideas with most deterministic O(log(n)) PRAMalgorithms, but it is non-trivial.

Our algorithm maintains a forest using a parent pointer p(v) for eachv ∈ V . Each rooted tree in the forest represents a partial CC.

A singleton is a tree with one node, and a star is a tree of height 1.

A tree is an isolated tree if there are no edges in E that connect the treeto another tree.

The forest is iteratively updated using two operations: hooking andpointer jumping. Hooking merges several trees into a larger tree, andpointer jumping changes the parent of each node to its grandparent ineach tree.

When the algorithm ends, each tree becomes an isolated star thatrepresents a CC in the graph.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 47 / 60

Page 48: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Comparison

We can now compare the running times of these algorithms. We omitPRAM since it was impractical.

Note that the CC algorithm in SGC class has the best bounds in eachcategory. This indicates the significant improvement that SGC representsfor scalable big graph processing.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 48 / 60

Page 49: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Minimum Spanning Forest

Given a weighted undirected graph G(V,E) of n nodes and m edges,with each edge (u, v) ∈ E assigned a weight w((u, v)), a MinimumSpanning Forest (MSF) is a spanning forest of G with the minimum totaledge weight.

We also use (u, v, w((u, v))) to denote an edge.

Although MSF can be efficiently computed on a sequential machineusing O(m+ nlog(n)) time, it is non-trivial to solve the algorithm inMapReduce.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 49 / 60

Page 50: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Minimum Spanning Forest

The following is an example of a Minimum Spanning Tree. A forest ismade up of many trees.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 50 / 60

Page 51: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

MSF Algorithm in SGC

Suppose there is a total order among all edges as follows. For any twoedges e1 = (u1, v1, w1) and e2 = (u2, v2, w2), e1 < e2 iff one of thefollowing conditions holds:

1 w1 < w2

2 w1 = w2 and min(u1, v1) < min(u2, v2)

3 w1 = w2 and min(u1, v1) = min(u2, v2), andmax(u1, v1) < max(u2, v2)

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 51 / 60

Page 52: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

MSF Comparisons

The comparison of two existing algorithms OneRoundMSF,MultiRoundMSF, and our algorithm MSF is shown below in terms ofmemory consumption per machine, total communication cost per round,and the number of rounds.

As we will show in our performance testing, the high memory requirementof OneRoundMSF and MultiRoundMSF becomes the bottleneck for thealgorithms to achieve high scalability when handling graphs with large n.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 52 / 60

Page 53: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Performance Testing

We tested the performance of the aforementioned algorithms on a clusterof 17 computing nodes, including one master node and 16 slave nodesrunning, each of which has four Intel Xeon 2.4GHz CPUs and 15GB RAMrunning 64-bit Ubuntu Linux.

We implement all algorithms using Hadoop (version 1.2.1) with Java 1.6.

We allow each node to run three mappers and three reducersconcurrently

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 53 / 60

Page 54: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Data Sets

We use two web-scale graphs Twitter-2010 and Friendster with differentgraph characteristics for testing.

Twitter-2010 contains 41,652,230 nodes and 1,468,365,182 edgeswith an average degree of 71. The maximum degree is 3,081,112 andthe diameter of Twitter-2010 is around 24.

Friendster contains 65,608,366 nodes and 1,806,067,135 edges withan average degree of 55. The maximum degree is 5,214 and thediameter of Friendster is around 32.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 54 / 60

Page 55: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Algorithms

Besides the five algorithms PageRank (Algorithm 1), BFS (Algorithm 2),KWS (Algorithm 3), CC (Algorithm 4), and MSF (Algorithm 5), we alsoimplement the algorithms for PageRank, BFS, and graph keyword searchusing the join operations supported by Pig on Hadoop, denotedPageRank-Pig, BFS-Pig and KWS-Pig respectively.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 55 / 60

Page 56: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

PageRank Algorithm

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 56 / 60

Page 57: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

BFS Algorithm

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 57 / 60

Page 58: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

CC Algorithm

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 58 / 60

Page 59: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

MSF Algorithm

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 59 / 60

Page 60: Scalable Big Graph Processing in Map Reduce · Scalable Big Graph Processing in Map Reduce Lu Qin, Je rey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented …

Conclusions

In this paper, we studied scalable big graph processing in MapReduce.

We reviewed previous MapReduce classes, and propose a new class SGCto guide the development of scalable graph processing algorithms inMapReduce. We introduce two graph join operators using which a largerange of graph algorithms can be designed in SGC.

Especially, for two fundamental graph algorithms CC computation andMSF computation, we improve the state-of-the-art algorithms both intheory and practice. We conducted extensive performance studies usingreal web-scale graphs to show the high scalability achieved for ouralgorithms in SGC.

Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin, Presented by Megan Bryant (College of William and Mary)Scalable Big Graph Processing in Map Reduce February 11, 2015 60 / 60