Techniques for Graph Analytics on Big Data

44
Techniques for Graph Analytics on Big Data Usman Nisar, Arash Fard, John A. Miller Computer Science Department University of Georgia, Athens, GA

description

Techniques for Graph Analytics on Big Data. Usman Nisar , Arash Fard , John A. Miller Computer Science Department University of Georgia, Athens, GA. Outline. Introduction Subgraph Pattern Matching Types of Subgraph Pattern Matching Models of Computation Distributed Algorithms - PowerPoint PPT Presentation

Transcript of Techniques for Graph Analytics on Big Data

Page 1: Techniques  for Graph Analytics on Big Data

Techniques for Graph Analytics on Big Data

Usman Nisar, Arash Fard, John A. Miller

Computer Science DepartmentUniversity of Georgia, Athens, GA

Page 2: Techniques  for Graph Analytics on Big Data

Outline

Introduction Subgraph Pattern Matching

Types of Subgraph Pattern Matching Models of Computation Distributed Algorithms

Performance Evaluation Graph Partitioning

Results Analysis Conclusion

Page 3: Techniques  for Graph Analytics on Big Data

Introduction

Directed graph G(V, E, l) V is a set of vertices E ⊆ V x V is a set of directed edges l: V → 𝓝 is a function mapping vertices to labels

Versatility and expressivity Social networks, web search engines, genome sequencing,

etc. Data sizes are growing rapidly

Facebook: 1 billion + users, average degree of 140 Twitter: 200 million + users, 400 million tweets each day

Bigger datasets mean bigger graphs

Page 4: Techniques  for Graph Analytics on Big Data

Subgraph Pattern Matching

A graph G’(V’, E’, l’) is a subgraph of G(V, E, l) if:1. V’ ⊆ V2. E’ ⊆ V’ x V’ ⊆ E3. ∀v ∈ V’ , l’(v) = l(v)

A pattern (or query) is a graph that we want to find in a bigger graph

Subgraph pattern matching: Given a query graph, find all subgraphs of another graph (the data graph) that are similar to the query based on certain criteria

Page 5: Techniques  for Graph Analytics on Big Data

Types of Subgraph Pattern Matching

Exact Subgraph Isomorphism

Heuristic Graph Simulation Dual Simulation Strong Simulation

Page 6: Techniques  for Graph Analytics on Big Data

Subgraph Isomorphism

Exact matching from the pattern to data graph

Labels must be the same Ullmann’s algorithm, VF2 NP-hard problem in general case Current solutions not practical for large

graphs

Page 7: Techniques  for Graph Analytics on Big Data

Example: Subgraph Isomorphism

Matches: {4, 8, 6, 7}, {5, 8, 6, 7}

PM

SA

SD

SA

PM

PM

DB

AIDB

AI

DB AI

DB

SA

PM

SA

PM

AI

DB

AIDB

SAPM

AI1

2

3

4

5

6

7

89

10

11

12

13 14

15

16

17

18

19 20

21

22

23

24

PM

SA DB AI

Query

Data Graph

PM: Product ManagerSD: Software DeveloperSA: System AnalystDB: Database DesignerAI: AI Specialist

Page 8: Techniques  for Graph Analytics on Big Data

Graph Simulation

A vertex in a data graph G matches a vertex in query graph Q via graph simulation iff1. both have the same label2. a subset of its children match all the

children of its corresponding vertex in the query graph

HHK algorithm[1]

[1]  M. R. Henzinger, T. A. Henzinger, and P. W. Kopke, “Computing simulations on finite and infinite graphs,” in Foundations of Computer Science, 1995. Proceedings., 36th Annual Symposium on. IEEE, 1995, pp. 453–462.

Page 9: Techniques  for Graph Analytics on Big Data

Graph Simulation

PM

SA

SD

SA

PM

PM

DB

AIDB

AI

DB AI

DB

SA

PM

SA

PM

AI

DB

AIDB

SAPM

AI1

2

3

4

5

6

7

89

10

11

12

13 14

15

16

17

18

19 20

21

22

23

24

PM

SA DB AI

Data Graph

PM: Product ManagerSD: Software DeveloperSA: System AnalystDB: Database DesignerAI: AI Specialist

Page 10: Techniques  for Graph Analytics on Big Data

Dual Simulation

Adds a duality condition to graph simulation A vertex in a data graph becomes a match with a vertex in

the query graph iff1. Both have the same label2. A subset of its children matches all the children of its

corresponding vertex in the data graph3. A subset of its parents matches all the parents of its

corresponding vertex in the data graph

O((|Vq| + |Eq|) (|V| + |E|)) More restrictive than graph simulation

Page 11: Techniques  for Graph Analytics on Big Data

Example: Graph Dual Simulation

PM

SA

SD

SA

PM

PM

DB

AIDB

AI

DB AI

DB

SA

PM

SA

PM

AI

DB

AIDB

SAPM

AI1

2

3

4

5

6

7

89

10

11

12

13 14

15

16

17

18

19 20

21

22

23

24

PM

SA DB AI

Query

Data Graph

PM: Product ManagerSD: Software DeveloperSA: System AnalystDB: Database DesignerAI: AI Specialist

Page 12: Techniques  for Graph Analytics on Big Data

Example: Graph Dual Simulation

PM

SA

SD

SA

PM

PM

DB

AIDB

AI

DB AI

DB

SA

PM

SA

PM

AI

DB

AIDB

SAPM

AI1

2

3

4

5

6

7

89

10

11

12

13 14

15

16

17

18

19 20

21

22

23

24

PM

SA DB AI

Query

Data Graph

PM: Product ManagerSD: Software DeveloperSA: System AnalystDB: Database DesignerAI: AI Specialist

Page 13: Techniques  for Graph Analytics on Big Data

Strong Simulation

Extends dual simulation with a locality condition

A ball G[v, r] is a subset of a graph G containing: all vertices VB within an undirected distance r

of the vertex v all the edges between the vertices in VB

X86

1432

57

Page 14: Techniques  for Graph Analytics on Big Data

Strong Simulation

Extends dual simulation with a locality condition

A ball G[v, r] is a subset of a graph G containing: all vertices VB within an undirected distance r

of the vertex v all the edges between the vertices in VB

X86

1432

57

X86

1432

57

Page 15: Techniques  for Graph Analytics on Big Data

Strong Simulation

Extends dual simulation with a locality condition A ball G[v, r] is a subset of a graph G containing:

all vertices VB within an undirected distance r of the vertex v all the edges between the vertices in VB

O(|V|(|V| + (|Vq| + |Eq|) (|V| + |E|)) Query graph Q(Vq, Eq, l) matches data graph G(V, E, l) via

strong simulation if there exists a vertex v ∈ V s.t.1. Q matches G[v, dq] via dual simulation with match relation Rb 2. v is contained in Rb

Page 16: Techniques  for Graph Analytics on Big Data

Example: Dual Strong Simulation

PM

SA

SD

SA

PM

PM

DB

AIDB

AI

DB AI

DB

SA

PM

SA

PM

AI

DB

AIDB

SAPM

AI1

2

3

4

5

6

7

89

10

11

12

13 14

15

16

17

18

19 20

21

22

23

24

PM

SA DB AI

Query

Data Graph

PM: Product ManagerSD: Software DeveloperSA: System AnalystDB: Database DesignerAI: AI Specialist

Page 17: Techniques  for Graph Analytics on Big Data

Example: Dual Strong Simulation

PM

SA

SD

SA

PM

PM

DB

AIDB

AI

DB AI

DB

SA

PM

SA

PM

AI

DB

AIDB

SAPM

AI1

2

3

4

5

6

7

89

10

11

12

13 14

15

16

17

18

19 20

21

22

23

24

PM

SA DB AI

Query

Data Graph

PM: Product ManagerSD: Software DeveloperSA: System AnalystDB: Database DesignerAI: AI Specialist

Page 18: Techniques  for Graph Analytics on Big Data

Example: Dual Strong Simulation

PM

SA

SD

SA

PM

PM

DB

AIDB

AI

DB AI

DB

SA

PM

SA

PM

AI

DB

AIDB

SAPM

AI1

2

3

4

5

6

7

89

10

11

12

13 14

15

16

17

18

19 20

21

22

23

24

PM

SA DB AI

Query

Data Graph

PM: Product ManagerSD: Software DeveloperSA: System AnalystDB: Database DesignerAI: AI Specialist

Page 19: Techniques  for Graph Analytics on Big Data

Example: Dual Strong Simulation

PM

SA

SD

SA

PM

PM

DB

AIDB

AI

DB AI

DB

SA

PM

SA

PM

AI

DB

AIDB

SAPM

AI1

2

3

4

5

6

7

89

10

11

12

13 14

15

16

17

18

19 20

21

22

23

24

PM

SA DB AI

Query

Data Graph

PM: Product ManagerSD: Software DeveloperSA: System AnalystDB: Database DesignerAI: AI Specialist

Page 20: Techniques  for Graph Analytics on Big Data

Models of Computation

MapReduce • Useful paradigm for large

batch processing of datasets

• Not ideal for many graph algorithms

Bulk Synchronous Parallel (BSP)• Computation performed in

a series of supersteps • Interleaved with

communication/synchronization phases

Vertex-Centric (VC) BSP• Each vertex treated as a

processing unit• Vertices can communicate

with each other to obtain information

Message Passing• Different threads

communicate with each other via messages

• Programmer has more control over communication and synchronization

Page 21: Techniques  for Graph Analytics on Big Data

Distributed Graph Simulation

Superstep 1: If there is any query vertex with the same label: - Set match to true - Make a match set of potential vertices in the query - Ask children about their statusOtherwise: Vote to halt

Superstep 2: If match is true: Reply back to parent with label Otherwise: Vote to halt

Superstep 3: If match is true: - Evaluate match set based on children’s responses - If there are removals from the match set: - Inform parents - Set match flag accordingly - Otherwise: Vote to haltOtherwise: Vote to halt

Superstep 4: If there is any incoming removal message: - Evaluate match set based on children’s responses - If there are removals from the match set: - Inform parents - Set match flag accordingly - Otherwise: Vote to haltOtherwise: Vote to halt

Page 22: Techniques  for Graph Analytics on Big Data

Distributed Graph Simulation

Page 23: Techniques  for Graph Analytics on Big Data

Other Distributed Graph Algorithms

Dual simulation Similar to graph simulation In addition to storing the match sets of its children, a

vertex also stores the match sets of its parents During match set evaluation, a vertex takes both of these

into account Strong simulation (optimized)

First run dual simulation to obtain match relation R For each matching vertex v in R:

Create a ball centered at v with radius dq containing only vertices in R

Perform dual simulation on the ball

Page 24: Techniques  for Graph Analytics on Big Data

Performance Evaluation

Experimental setup Cluster with 12 machines Each with two 2Ghz Intel Xeon E5-2620

CPUs, each with six cores Ethernet: 1 Gb/s Set up HDFS/GPS on all of the machines

Dataset |V| |E| |l|Synthesize

d100 M 4 B 200

uk-2005 39 M 940 M 200enwiki-2005

4.2 M 101 M 200

Page 25: Techniques  for Graph Analytics on Big Data

Runtime Evaluation

1 2 4 8 1216202428323640440200400600800

100012001400

Graph Simula-tionDual Simula-tion

# Workers

Run

ning

Tim

e (s

ecs)

Synthesized, |V|= 108

Page 26: Techniques  for Graph Analytics on Big Data

Runtime Evaluation

uk-2005, |V|= 3.9 x 107

1 2 4 8 121620242832364044050

100150200250300350

Graph Simu-lationDual Simula-tion

# Workers

Run

ning

Tim

e (s

ecs)

Page 27: Techniques  for Graph Analytics on Big Data

1 2 4 8 12 16 2005

1015202530354045

Graph Simu-lation

# Workers

Run

ning

Tim

e (s

ecs)

Runtime Evaluation

enwiki-2013, |V|= 4.2 x 106

Page 28: Techniques  for Graph Analytics on Big Data

1 2 4 8 12 16 20 24 28 32 36 40 4405

101520253035

Graph Simu-lation

# Workers

Spee

d-U

p

Speedup Evaluation

Synthesized, |V|= 108

Page 29: Techniques  for Graph Analytics on Big Data

Speedup Evaluation

uk-2005, |V|= 3.9 x 107

1 2 4 8 1216202428323640440

5

10

15

20

25 Graph Simula-tionDual Simula-tion

# Workers

Spee

d-U

p

Page 30: Techniques  for Graph Analytics on Big Data

1 2 4 8 12 16 200123456

Graph Simula-tion

# Workers

Spee

d-U

p

Speedup Evaluation

enwiki-2013, |V|= 4.2 x 106

Page 31: Techniques  for Graph Analytics on Big Data

Conclusions

The three algorithms exhibit excellent scalable behavior in terms of speedup and efficiency as we increase the number of workers

Distributed implementations (GPS): Graph simulation Dual simulation Strong simulation

Ongoing work: Distributed implementations with message passing in Akka Our own sequential/distributed isomorphism algorithm

Page 32: Techniques  for Graph Analytics on Big Data

Questions?

Page 33: Techniques  for Graph Analytics on Big Data

Thanks

Page 34: Techniques  for Graph Analytics on Big Data

Depth-First Ball

Algorithm works in a depth-first fashion. Message is generated at the center which is then propagated through the system for ballSize supersteps.

The approach results in an exponential number of messages that slows down the whole system and renders the approach impractical

Page 35: Techniques  for Graph Analytics on Big Data

Breadth-First Ball

Works on a simple ping-reply model. Center vertex starts of by sending a ping message to all of

its adjacent nodes in the first superstep. In the second superstep, all the recipient nodes reply back

with their label and the ids of their children and parents. Center vertex upon receiving this information in the third

superstep, saves the returned labels and then ping the boundary nodes. This process is repeated till we have a ball of size dq.

Page 36: Techniques  for Graph Analytics on Big Data

Breadth-First Ball

X86

1432

57

X8

6

14

32

57

p

p

r

r

X8

6

14

32

57

pX

86

1

43

2

57

r

(a) Ball:{1,7}, Labels: { } (b) Ball: {1,7,2,3,4}, Labels: {1,7}

(c) Ball:{1,7,2,3,4}, Labels:{1,7} (d) Ball:{1,7,2,3,4}, Labels:{1,7,2,3,4}

Ping:Reply:

pr

Page 37: Techniques  for Graph Analytics on Big Data

Efficiency

Synthesized, |V|=108 uk-2005, |V|=3.9x107 enwiki-2013, |V|=4.2x106

1 4 12 20 28 36 440

20406080

100120

Graph Simula-tion

# Workers

Effici

ency

1 2 4 8 12 16 20020406080

100120

Graph Simula-tion

# Workers

Effici

ency

1 4 12 20 28 36 440

20406080

100120 Graph

Simula-tion

# Workers

Effici

ency

• Calculated as Efficiencyk = Speedupk/k

Page 38: Techniques  for Graph Analytics on Big Data

Graph Partitioning

By default, the data graph is partitioned in a round-robin fashion among workers

The goals of min-cut partitioning are two-fold: to create well-balanced partitions to reduce the inter-partition edges

Number of other algorithms have been shown to use min-cut graph partitioning successfully for speed-ups

METIS1 – graph partitioning tool Written in C Takes an undirected graph and outputs the partitions

1 http://glaros.dtc.umn.edu/gkhome/metis/metis/overview

Page 39: Techniques  for Graph Analytics on Big Data

Performance Evaluation

Experimental Setup Cluster with 5 machines Each with two 2Ghz Intel Xeon E5-2620 CPU, each with

six cores Ethernet: 1 Gb/sec

Setup HDFS/GPS on all the machines Datasets:

Labels (l) = 200Datasets |V| |E|Synthesized 107 251 x 106

uk-2002 1.8 x 107 298 x 106

Page 40: Techniques  for Graph Analytics on Big Data

Results - Runtime

5 20 60 100

05

1015202530 Min-cut Round-robin

Query Size

Run

ning

Tim

e (s

ecs)

5 20 60 100

05

10152025303540

Min-cut Round-robin

Query Size

Run

ning

Tim

e (s

ecs)

5 20 60 100

010203040506070

Min-cut Round-robin

Query size

Run

ning

Tim

e (s

ecs)

Synthesized Dataset, |V|=107, = 1.2

Graph Simulation Dual Simulation Strict Simulation

Page 41: Techniques  for Graph Analytics on Big Data

Results - Runtime

uk-2002, |V|=1.8x107

5 20 60 100

0

10

20

30

40 Min-cut Round-robin

Query Size

Run

ning

Tim

e (s

ecs)

5 20 60 100

0102030405060 Min-cut

Round-robin

Query Size

Run

ning

Tim

e (s

ecs)

5 20 60 100

020406080

100120140160 Min-cut Round-robin

Query Size

Run

ning

Tim

e (s

ecs)

Graph Simulation Dual Simulation Strict Simulation

Page 42: Techniques  for Graph Analytics on Big Data

Results – Network I/O

Synthesized Dataset, |V|=107,

Graph Simulation Dual Simulation Strict Simulation

5 20 60 100

0.0E+001.0E+052.0E+053.0E+054.0E+055.0E+056.0E+057.0E+05

Min-cutRound-robin

Query size

Tota

l Net

wor

k I/O

(K

B)

5 20 60 100

0.0E+002.0E+054.0E+056.0E+058.0E+051.0E+061.2E+061.4E+061.6E+06

Min-cut Query Size

Tota

l Net

wor

k I/O

(K

B)

5 20 60 100

0.0E+002.0E+054.0E+056.0E+058.0E+051.0E+061.2E+061.4E+06

Min-cutRound-robin Query size

Tota

l Net

wor

k I/O

(K

B)

Page 43: Techniques  for Graph Analytics on Big Data

Results – Network I/O

uk-2002, |V|=1.8x107

Graph Simulation Dual Simulation Strict Simulation

5 20 60 100

1.0E+021.0E+031.0E+041.0E+051.0E+061.0E+07

Min-cut Query Size

Tota

l Net

wor

k I/O

(K

B)

5 20 60 100

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

Min-cutRound-robin

Query Size

Tota

l Net

wor

k I/O

(K

B)

5 20 60 100

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

Min-cutRound-robin

Query Size

Tota

l Net

wor

k I/O

(K

B)

Page 44: Techniques  for Graph Analytics on Big Data

Planned Pattern Matching

ImplementationsSequent

ialDistribute

d (GPS)Distributed (Akka

VC)

Distributed

(Akka MP)Graph Simulation

0.581 s 2.924 s ---------- ----------

Dual Simulation

0.642 s 3.403 s ---------- ----------

Strong Simulation

---------- 8.453 s ---------- ----------

Subgraph Iso. (ours)

2.837 s ---------- ---------- ----------

Subgraph Iso. (VF2)

---------- ---------- ---------- ----------