Optimization of graph storage using GoFFish

17
GoFFish: A Sub-Graph Centric Framework for Large-Scale Graph Analytics Guide: Prof. Dinkar Sitaram Department of Computer Science and Engineering PES Institute of Technology Ms. Prafullata Department of Computer Science and Engineering PES Institute of Technology Guide: Prof. Yogesh Simmhan Department of SERC Indian Institute of Science Team Members: Anushree P K 1PI11IS017 Bhavani B 1PI11IS027 Mithilesh K G 1PI11IS059

Transcript of Optimization of graph storage using GoFFish

Page 1: Optimization of graph storage using GoFFish

GoFFish: A Sub-Graph Centric Framework for

Large-Scale Graph Analytics

Guide: Prof. Dinkar Sitaram Department of Computer Science and Engineering PES Institute of Technology

Ms. Prafullata Department of Computer Science and Engineering PES Institute of Technology

Guide: Prof. Yogesh Simmhan Department of SERC Indian Institute of Science

Team Members: Anushree P K 1PI11IS017 Bhavani B 1PI11IS027 Mithilesh K G 1PI11IS059

Page 2: Optimization of graph storage using GoFFish

What is GoFFISH?

GoFFish is a scalable software framework for storing graphs, and composing and executing graph

analytics in a Cloud and commodity cluster environment

It consists of :-

1. GoFS – It is a distributed store for partitioning, storing and accessing graph datasets across hosts in a cluster.

2. Gopher - Gopher is a programming framework that offers sub-graph centric abstractions on a a Cloud or cluster in conjunction with GoFS.

GoFFish is implemented in Java.

Page 3: Optimization of graph storage using GoFFish

Existing GoFFISH Storage Architecture

Worker 2 Worker 3

Worker 1 + Head Node

Partition 2

Partition 1

Partition 3

t0 – t10

Slice 1 Slice 1

Same Storage format in worker 1

• Large graph partitioned into subgraphs and distributed across workers.

• GoFFISH default storage 10 bins in a slice. 10 instances of every subgraph.

Page 4: Optimization of graph storage using GoFFish

Using the GoFFish framework: To store the real time graphs in – temporal and spatial formats To compute the efficiencies of both storage formats based on the input algorithm (gopher job)

Intuition :-

The time taken for a gopher job over large graphs should be optimized when the graphs are stored

In the given two formats depending on the algorithm run.

For example for vertex count algorithm temporal format should take lessar computation time

Problem Definition

Page 5: Optimization of graph storage using GoFFish

GoFFISH Storage Architecture – Our Model

Worker 2 Worker 3

Worker 1 + Head Node

Partition 2

Partition 1

Partition 3

t0 – tn

Slice 1

Same Storage format in worker 1

• All instances for a subgraph in one slice.

• One bin per slice.

Slice 2

Slice 3

Slice 4

Page 6: Optimization of graph storage using GoFFish

We used slicing pointers instancegroupingsize and numsubgraphbins to manipulate the way the graphs were partitioned in order to obtain the desired storage format. These slicing pointers correspond the temporal and subgraph bin packing schemes in GoFFISH.

Page 7: Optimization of graph storage using GoFFish

Algorithms • Vertex Count Each subgraph processor calculate number of vertices within a subgraph. They send messages to all other subgraphs with their count where each subgraph calculate the total using the messages received.

• Connected Components Each subgraph finds the smallest vertex id for each subgraph and propagate that smallest value to its connected subgraphs. If incoming value to a subgraph is different from its current value it updates the current value and propagate the changes to its neighbours.

Page 8: Optimization of graph storage using GoFFish

Flow Diagram

Page 9: Optimization of graph storage using GoFFish

Dataset :- Road network graph Vehicle route tracking using traffic cams –Time-series graph of sync camera snapshots –Sensors are vertices –Edges are road connectivity w/ distance weight Graph instance is image metadata every N sec –License plate, vehicle color, direction, speed Urban

Dataset

Page 10: Optimization of graph storage using GoFFish

0

2000

4000

6000

8000

10000

12000

0 1 2 3 4

TIm

e T

ake

n

Partition Id

Partition-Wise Total App Time

S=10,t=10

S=1,T=ALL

4200

4250

4300

4350

4400

4450

0 1 2 3 4

TIm

e T

ake

n

Partition Id

Partition-Wise Total App Time

S=10,t=10

S=1,T=ALL

Vertex Count Connected Components

Performance Analysis

Page 11: Optimization of graph storage using GoFFish

Partition 1

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 1 2 3 4

Tim

e T

ake

n

Superstep

Superstep- wise Compute Task Time

VC(s=10,t=10)

VC(s=1,t=ALL)

0

50

100

150

200

250

300

350

400

0 0.5 1 1.5 2 2.5 3 3.5

Tim

e T

ake

n

Superstep

Superstep- wise Compute Task Time

CC(s=10,t=10)

CC(s=1,t=ALL)

Vertex Count Connected Components

Performance Analysis

Page 12: Optimization of graph storage using GoFFish

DEMO + Loggers

Page 13: Optimization of graph storage using GoFFish

Screenshots

Page 14: Optimization of graph storage using GoFFish

Screenshots

Page 15: Optimization of graph storage using GoFFish

GoFFish: A Sub-Graph Centric Framework for Large-Scale Graph Analytics - Indian

Institute of Science, Bangalore 560012 India, University of Southern California, Los

Angeles CA 90089 USA, November 26, 2013 Scalable Analytics over Distributed Time-series Graphs using GoFFish - Indian

Institute of Science, Bangalore 560012 India, University of Southern California, Los

Angeles CA 90089 USA, June 23, 2014 Chronos: A Graph Engine for Temporal Graph Analysis - Tsinghua University,

University of Science and Technology of China, Microsoft Research

References

Page 16: Optimization of graph storage using GoFFish

The Team

Anushree Prasanna Kumar 8th Sem ISE

Bhavani B 8th Sem ISE

Mithilesh Kumar 8th Sem ISE

Page 17: Optimization of graph storage using GoFFish

Thank you