Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

60
GRADOOP: Scalable Graph Analytics with Apache Flink Martin Junghanns @kc1s Leipzig University

Transcript of Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

Page 1: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

GRADOOP: Scalable Graph Analytics with Apache Flink

Martin Junghanns @kc1s

Leipzig University

Page 2: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

About the speaker and the team

André, PhD Student Martin, PhD Student

Kevin, M.Sc. Student Niklas, M.Sc. Student

Prof. Dr. Erhard Rahm Database Chair

Page 3: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

Motivation

Page 4: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

𝑮𝑟𝑟𝑟𝑟 = (𝑽𝑒𝑟𝑒𝑒𝑒𝑒𝑒,𝑬𝑑𝑑𝑒𝑒)

“Graphs are everywhere”

Page 5: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝐹𝐹𝐹𝐹𝐹𝑒𝑟𝑒)

“Graphs are everywhere”

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Trent

Page 6: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝑟𝑒𝑒𝐹𝑑𝑒𝑟𝑒𝑟𝑒)

“Graphs are everywhere”

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Trent

Page 7: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

“Graphs are heterogeneous”

Alice

Bob

AC/DC

Dave

Carol

Mallory

Peggy

Metallica

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔 ∪ 𝐁𝐁𝐁𝐁𝐔,𝐹𝑟𝑒𝑒𝐹𝑑𝑒𝑟𝑒𝑟𝑒 ∪ 𝐿𝑒𝐿𝑒𝑒)

Page 8: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔 ∪ 𝐁𝐁𝐁𝐁𝐔,𝐹𝑟𝑒𝑒𝐹𝑑𝑒𝑟𝑒𝑟𝑒 ∪ 𝐿𝑒𝐿𝑒𝑒)

“Graphs can be analyzed”

Alice

Bob

AC/DC

Dave

Carol

Mallory

Peggy

Metallica

Page 9: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

0.2

0.28

0.26

0.33

0.25

0.26

“Graphs can be analyzed”

Alice

Bob

AC/DC

Dave

Carol

Mallory

Peggy

Metallica

3.6 2.82

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔 ∪ 𝐁𝐁𝐁𝐁𝐔,𝐹𝑟𝑒𝑒𝐹𝑑𝑒𝑟𝑒𝑟𝑒 ∪ 𝐿𝑒𝐿𝑒𝑒)

Page 10: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

“Graphs can be analyzed“

Assuming a social network

Page 11: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

“Graphs can be analyzed“

Assuming a social network 1. Determine subgraph

Page 12: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

“Graphs can be analyzed“

Assuming a social network 1. Determine subgraph 2. Find communities

Page 13: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

“Graphs can be analyzed“

Assuming a social network 1. Determine subgraph 2. Find communities 3. Filter communities

Page 14: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

“Graphs can be analyzed“

Assuming a social network 1. Determine subgraph 2. Find communities 3. Filter communities 4. Find common subgraph

Page 15: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

“Graph data models must be expressive“

Assuming a social network Heterogeneous data 1. Determine subgraph Apply graph transformation 2. Find communities Handle collections of graphs 3. Filter communities Aggregation, Selection 4. Find common subgraph Apply dedicated algorithms

Page 16: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

„And let‘s not forget …“

Page 17: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

“…Graphs are large”

Page 18: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

A framework and research platform for efficient, distributed and domain independent graph data

management and analytics.

Page 19: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

High Level Architecture

HDFS/YARN Cluster

HBase Distributed Graph Store

Extended Property Graph Model

Flink Operator Implementation

Graph Analytical DSL Java 25K (33K) LOC GPLv3

Page 20: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

Extended Property Graph Model (EPGM)

Page 21: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

EPGM – Graph Representation

0

1

2

3

4

5

6 7 8 9

10

11 12 13 14

15

16

17

18 19 20 21

22

23

knows since : 2014

knows since : 2014

knows since : 2013

hasInterest

hasInterest hasInterest

hasInterest

hasModerator hasModerator

hasMember hasMember hasMember hasMember

hasTag hasTag hasTag hasTag

knows since : 2013

knows since : 2014

knows since : 2014

knows since : 2015

knows since : 2015

knows since : 2015

knows since : 2013

[0] Tag name : Databases

[1] Tag name : Graphs

[2] Tag name : Hadoop

[3] Forum title : Graph Databases

[4] Forum title : Graph Processing

[5] Person name : Alice gender : f city : Leipzig age : 23

[6] Person name : Bob gender : m city : Leipzig age : 30

[7] Person name : Carol gender : f city : Dresden age : 30

[8] Person name : Dave gender : m city : Dresden age : 42

[9] Person name : Eve gender : f city : Dresden age : 35 speaks : en

[10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3

Page 22: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

[2] Community | interest : Graphs | vertexCount : 4

[1] Community | interest : Hadoop| vertexCount : 3 [0] Community | interest : Databases | vertexCount : 3

EPGM – Graph Representation [0] Tag

name : Databases [1] Tag

name : Graphs [2] Tag

name : Hadoop

[3] Forum title : Graph Databases

[4] Forum title : Graph Processing

[5] Person name : Alice gender : f city : Leipzig age : 23

[6] Person name : Bob gender : m city : Leipzig age : 30

[7] Person name : Carol gender : f city : Dresden age : 30

[8] Person name : Dave gender : m city : Dresden age : 42

[9] Person name : Eve gender : f city : Dresden age : 35 speaks : en

[10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3

0

1

2

3

4

5

6 7 8 9

10

11 12 13 14

15

16

17

18 19 20 21

22

23

knows since : 2014

knows since : 2014

knows since : 2013

hasInterest

hasInterest hasInterest

hasInterest

hasModerator hasModerator

hasMember hasMember hasMember hasMember

hasTag hasTag hasTag hasTag

knows since : 2013

knows since : 2014

knows since : 2014

knows since : 2015

knows since : 2015

knows since : 2015

knows since : 2013

Page 23: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

EPGM – Operators and Algorithms Operators

Unary Binary

Grap

h Co

llect

ion

Logi

cal G

raph

Algorithms

Aggregation

Pattern Matching

Transformation

Grouping Equality

Call *

Combination

Overlap

Exclusion

Equality

Union

Intersection Difference

Gelly Library

BTG Extraction

Frequent Subgraphs

Limit

Selection

Distinct Sort

Apply * Reduce *

Call * * auxiliary

Adaptive Partitioning

Subgraph

Page 24: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

EPGM – Operators and Algorithms Operators

Unary Binary

Grap

h Co

llect

ion

Logi

cal G

raph

Algorithms

Aggregation

Pattern Matching

Transformation

Grouping Equality

Call *

Combination

Overlap

Exclusion

Equality

Union

Intersection Difference

Gelly Library

BTG Extraction

Frequent Subgraphs

Limit

Selection

Distinct Sort

Apply * Reduce *

Call * * auxiliary

Adaptive Partitioning

Subgraph

Page 25: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

EPGM – Operators and Algorithms Operators

Unary Binary

Grap

h Co

llect

ion

Logi

cal G

raph

Algorithms

Aggregation

Pattern Matching

Transformation

Grouping Equality

Call *

Combination

Overlap

Exclusion

Equality

Union

Intersection Difference

Gelly Library

BTG Extraction

Frequent Subgraphs

Limit

Selection

Distinct Sort

Apply * Reduce *

Call * * auxiliary

Adaptive Partitioning

Subgraph

Page 26: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

Combination 1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])

[2] Community | interest : Graphs| vertexCount : 4

[1] Community | interest : Hadoop| vertexCount : 3 [0] Community | interest : Databases | vertexCount : 3

[0] Tag name : Databases

[1] Tag name : Graphs

[2] Tag name : Hadoop

[3] Forum title : Graph Databases

[4] Forum title : Graph Processing

[5] Person name : Alice gender : f city : Leipzig age : 23

[6] Person name : Bob gender : m city : Leipzig age : 30

[7] Person name : Carol gender : f city : Dresden age : 30

[8] Person name : Dave gender : m city : Dresden age : 42

[9] Person name : Eve gender : f city : Dresden age : 35 speaks : en

[10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3

0

1

2

3

4

5

6 7 8 9

10

11 12 13 14

15

16

17

18 19 20 21

22

23

knows since : 2014

knows since : 2014

knows since : 2013

hasInterest

hasInterest hasInterest

hasInterest

hasModerator hasModerator

hasMember hasMember hasMember hasMember

hasTag hasTag hasTag hasTag

knows since : 2013

knows since : 2014

knows since : 2014

knows since : 2015

knows since : 2015

knows since : 2015

knows since : 2013

DB

Page 27: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

[0] Community | interest : Databases | vertexCount : 3

[1] Community | interest : Hadoop| vertexCount : 3 [0] Community | interest : Databases | vertexCount : 3

[0] Tag name : Databases

[1] Tag name : Graphs

[2] Tag name : Hadoop

[3] Forum title : Graph Databases

[4] Forum title : Graph Processing

10

11 12 13 14

15

16

17

18 19 20 21

22

23

hasInterest

hasInterest hasInterest

hasInterest

hasModerator hasModerator

hasMember hasMember hasMember hasMember

hasTag hasTag hasTag hasTag

DB

Combination

[4]

[5] Person name : Alice gender : f city : Leipzig age : 23

[6] Person name : Bob gender : m city : Leipzig age : 30

[7] Person name : Carol gender : f city : Dresden age : 30

[8] Person name : Dave gender : m city : Dresden age : 42

[9] Person name : Eve gender : f city : Dresden age : 35 speaks : en

[10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3

0

1

2

3

4

5

6 7 8 9

knows since : 2014

knows since : 2014

knows since : 2013

knows since : 2013

knows since : 2014

knows since : 2014

knows since : 2015

knows since : 2015

knows since : 2015

knows since : 2013

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])

Page 28: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

EPGM – Operators and Algorithms Operators

Unary Binary

Grap

h Co

llect

ion

Logi

cal G

raph

Algorithms

Aggregation

Pattern Matching

Transformation

Grouping Equality

Call *

Combination

Overlap

Exclusion

Equality

Union

Intersection Difference

Gelly Library

BTG Extraction

Frequent Subgraphs

Limit

Selection

Distinct Sort

Apply * Reduce *

Call * * auxiliary

Adaptive Partitioning

Subgraph

Page 29: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

[0] Community | interest : Databases | vertexCount : 3

[1] Community | interest : Hadoop| vertexCount : 3 [0] Community | interest : Databases | vertexCount : 3

[0] Tag name : Databases

[1] Tag name : Graphs

[2] Tag name : Hadoop

[3] Forum title : Graph Databases

[4] Forum title : Graph Processing

10

11 12 13 14

15

16

17

18 19 20 21

22

23

hasInterest

hasInterest hasInterest

hasInterest

hasModerator hasModerator

hasMember hasMember hasMember hasMember

hasTag hasTag hasTag hasTag

DB

Combination + Grouping

[4]

[5] Person name : Alice gender : f city : Leipzig age : 23

[6] Person name : Bob gender : m city : Leipzig age : 30

[7] Person name : Carol gender : f city : Dresden age : 30

[8] Person name : Dave gender : m city : Dresden age : 42

[9] Person name : Eve gender : f city : Dresden age : 35 speaks : en

[10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3

0

1

2

3

4

5

6 7 8 9

knows since : 2014

knows since : 2014

knows since : 2013

knows since : 2013

knows since : 2014

knows since : 2014

knows since : 2015

knows since : 2015

knows since : 2015

knows since : 2013

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) 2: vertexGroupingKeys = [:label, “city”] 3: edgeGroupingKeys = [: label] 4: vertexAggFunc = (superVertex, vertices => superVertex[“count”] = |vertices|) 5: edgeAggFunc = (superEdge, edges => superEdge[“count”] = |edges|) 6: sumGraph = personGraph.groupBy(vertexGroupingKeys, vertexAggFunc, edgeGroupingKeys, edgeAggFunc)

Page 30: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

[0] Community | interest : Databases | vertexCount : 3

[1] Community | interest : Hadoop| vertexCount : 3 [0] Community | interest : Databases | vertexCount : 3

[0] Tag name : Databases

[1] Tag name : Graphs

[2] Tag name : Hadoop

[3] Forum title : Graph Databases

[4] Forum title : Graph Processing

10

11 12 13 14

15

16

17

18 19 20 21

22

23

hasInterest

hasInterest hasInterest

hasInterest

hasModerator hasModerator

hasMember hasMember hasMember hasMember

hasTag hasTag hasTag hasTag

DB

Combination + Grouping

[4]

[5] Person name : Alice gender : f city : Leipzig age : 23

[6] Person name : Bob gender : m city : Leipzig age : 30

[7] Person name : Carol gender : f city : Dresden age : 30

[8] Person name : Dave gender : m city : Dresden age : 42

[9] Person name : Eve gender : f city : Dresden age : 35 speaks : en

[10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3

0

1

2

3

4

5

6 7 8 9

knows since : 2014

knows since : 2014

knows since : 2013

knows since : 2013

knows since : 2014

knows since : 2014

knows since : 2015

knows since : 2015

knows since : 2015

knows since : 2013

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) 2: vertexGroupingKeys = [:label, “city”] 3: edgeGroupingKeys = [: label] 4: vertexAggFunc = (superVertex, vertices => superVertex[“count”] = |vertices|) 5: edgeAggFunc = (superEdge, edges => superEdge[“count”] = |edges|) 6: sumGraph = personGraph.groupBy(vertexGroupingKeys, vertexAggFunc, edgeGroupingKeys, edgeAggFunc)

[5]

[11] Person

city : Leipzig count : 2

[12] Person

city : Dresden count : 3

[13] Person

city : Berlin count : 1

24

25

26

27

28

knows count : 3

knows count : 1 knows

count : 2

knows count : 2

knows count : 2

Page 31: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

[0] Community | interest : Databases | vertexCount : 3

[1] Community | interest : Hadoop| vertexCount : 3 [0] Community | interest : Databases | vertexCount : 3

[0] Tag name : Databases

[1] Tag name : Graphs

[2] Tag name : Hadoop

[3] Forum title : Graph Databases

[4] Forum title : Graph Processing

10

11 12 13 14

15

16

17

18 19 20 21

22

23

hasInterest

hasInterest hasInterest

hasInterest

hasModerator hasModerator

hasMember hasMember hasMember hasMember

hasTag hasTag hasTag hasTag

DB

Combination + Grouping + Aggregation

[4]

[5] Person name : Alice gender : f city : Leipzig age : 23

[6] Person name : Bob gender : m city : Leipzig age : 30

[7] Person name : Carol gender : f city : Dresden age : 30

[8] Person name : Dave gender : m city : Dresden age : 42

[9] Person name : Eve gender : f city : Dresden age : 35 speaks : en

[10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3

0

1

2

3

4

5

6 7 8 9

knows since : 2014

knows since : 2014

knows since : 2013

knows since : 2013

knows since : 2014

knows since : 2014

knows since : 2015

knows since : 2015

knows since : 2015

knows since : 2013

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) 2: vertexGroupingKeys = [:label, “city”] 3: edgeGroupingKeys = [: label] 4: vertexAggFunc = (superVertex, vertices => superVertex[“count”] = |vertices|) 5: edgeAggFunc = (superEdge, edges => superEdge[“count”] = |edges|) 6: sumGraph = personGraph.groupBy(vertexGroupingKeys, vertexAggFunc, edgeGroupingKeys, edgeAggFunc) 7: aggFunc = (g => |g.E|) 8: aggGraph = sumGraph.aggregate(“edgeCount”, aggFunc)

[5]

[11] Person

city : Leipzig count : 2

[12] Person

city : Dresden count : 3

[13] Person

city : Berlin count : 1

24

25

26

27

28

knows count : 3

knows count : 1 knows

count : 2

knows count : 2

knows count : 2

Page 32: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

[0] Community | interest : Databases | vertexCount : 3

[1] Community | interest : Hadoop| vertexCount : 3 [0] Community | interest : Databases | vertexCount : 3

[0] Tag name : Databases

[1] Tag name : Graphs

[2] Tag name : Hadoop

[3] Forum title : Graph Databases

[4] Forum title : Graph Processing

10

11 12 13 14

15

16

17

18 19 20 21

22

23

hasInterest

hasInterest hasInterest

hasInterest

hasModerator hasModerator

hasMember hasMember hasMember hasMember

hasTag hasTag hasTag hasTag

DB

Combination + Grouping + Aggregation

[4]

[5] Person name : Alice gender : f city : Leipzig age : 23

[6] Person name : Bob gender : m city : Leipzig age : 30

[7] Person name : Carol gender : f city : Dresden age : 30

[8] Person name : Dave gender : m city : Dresden age : 42

[9] Person name : Eve gender : f city : Dresden age : 35 speaks : en

[10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3

0

1

2

3

4

5

6 7 8 9

knows since : 2014

knows since : 2014

knows since : 2013

knows since : 2013

knows since : 2014

knows since : 2014

knows since : 2015

knows since : 2015

knows since : 2015

knows since : 2013

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) 2: vertexGroupingKeys = [:label, “city”] 3: edgeGroupingKeys = [: label] 4: vertexAggFunc = (superVertex, vertices => superVertex[“count”] = |vertices|) 5: edgeAggFunc = (superEdge, edges => superEdge[“count”] = |edges|) 6: sumGraph = personGraph.groupBy(vertexGroupingKeys, vertexAggFunc, edgeGroupingKeys, edgeAggFunc) 7: aggFunc = (g => |g.E|) 8: aggGraph = sumGraph.aggregate(“edgeCount”, aggFunc)

[5] edgeCount : 5

[11] Person

city : Leipzig count : 2

[12] Person

city : Dresden count : 3

[13] Person

city : Berlin count : 1

24

25

26

27

28

knows count : 3

knows count : 1 knows

count : 2

knows count : 2

knows count : 2

Page 33: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

EPGM – Operators and Algorithms Operators

Unary Binary

Grap

h Co

llect

ion

Logi

cal G

raph

Algorithms

Aggregation

Pattern Matching

Transformation

Grouping Equality

Call *

Combination

Overlap

Exclusion

Equality

Union

Intersection Difference

Gelly Library

BTG Extraction

Frequent Subgraphs

Limit

Selection

Distinct Sort

Apply * Reduce *

Call * * auxiliary

Adaptive Partitioning

Subgraph

Page 34: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

Selection 1: resultColl = db.G[0,1,2].select((g => g[“vertexCount”] > 3))

[2] Community | interest : Graphs | vertexCount : 4

[1] Community | interest : Hadoop| vertexCount : 3 [0] Community | interest : Databases | vertexCount : 3

[0] Tag name : Databases

[1] Tag name : Graphs

[2] Tag name : Hadoop

[3] Forum title : Graph Databases

[4] Forum title : Graph Processing

[5] Person name : Alice gender : f city : Leipzig age : 23

[6] Person name : Bob gender : m city : Leipzig age : 30

[7] Person name : Carol gender : f city : Dresden age : 30

[8] Person name : Dave gender : m city : Dresden age : 42

[9] Person name : Eve gender : f city : Dresden age : 35 speaks : en

[10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3

0

1

2

3

4

5

6 7 8 9

10

11 12 13 14

15

16

17

18 19 20 21

22

23

knows since : 2014

knows since : 2014

knows since : 2013

hasInterest

hasInterest hasInterest

hasInterest

hasModerator hasModerator

hasMember hasMember hasMember hasMember

hasTag hasTag hasTag hasTag

knows since : 2013

knows since : 2014

knows since : 2014

knows since : 2015

knows since : 2015

knows since : 2015

knows since : 2013

DB

Page 35: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

Selection 1: resultColl = db.G[0,1,2].select((g => g[“vertexCount”] > 3))

[1] Community | interest : Hadoop| vertexCount : 3 [0] Community | interest : Databases | vertexCount : 3

[0] Tag name : Databases

[1] Tag name : Graphs

[2] Tag name : Hadoop

[3] Forum title : Graph Databases

[4] Forum title : Graph Processing

[9] Person name : Eve gender : f city : Dresden age : 35 speaks : en

[10] Person name : Frank gender : m city : Berlin age : 23 IP: 169.32.1.3

6 7 8 9

10

11 12 13 14

15

16

17

18 19 20 21

22

23

hasInterest

hasInterest hasInterest

hasInterest

hasModerator hasModerator

hasMember hasMember hasMember hasMember

hasTag hasTag hasTag hasTag

knows since : 2015

knows since : 2015

knows since : 2015

knows since : 2013

DB

[2] Community | interest : Graphs | vertexCount : 4

[5] Person name : Alice gender : f city : Leipzig age : 23

[6] Person name : Bob gender : m city : Leipzig age : 30

[7] Person name : Carol gender : f city : Dresden age : 30

[8] Person name : Dave gender : m city : Dresden age : 42

0

1

2

3

4

5

knows since : 2014

knows since : 2014

knows since : 2013

knows since : 2013

knows since : 2014

knows since : 2014

Page 36: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

EPGM – Operators and Algorithms Operators

Unary Binary

Grap

h Co

llect

ion

Logi

cal G

raph

Algorithms

Aggregation

Pattern Matching

Transformation

Grouping Equality

Call *

Combination

Overlap

Exclusion

Equality

Union

Intersection Difference

Gelly Library

BTG Extraction

Frequent Subgraphs

Limit

Selection

Distinct Sort

Apply * Reduce *

Call * * auxiliary

Adaptive Partitioning

Subgraph

Page 37: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

Apache Flink

Page 38: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

Apache Flink

http://www.slideshare.net/robertmetzger1/apache-flink-meetup-munich-november-2015-flink-overview-architecture-integrations-and-use-case

„Streaming Dataflow Engine that provides • data distribution, • communication, • and fault tolerance for distributed computations over data streams.“

HDFS LocalFS HBase JDBC Kafka RabbitMQ Flume (Neo4j) Embedded Tez Yarn Cluster Local

Streaming Dataflow Runtime

DataSet DataStream

Hado

op M

R

Tabl

e

Gelly

ML

Tabl

e

Zepp

elin

Casc

adin

g

MRQ

L

Data

flow

Stor

m (w

ip)

Data

flow

(wip

)

SAM

OA

Page 39: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

Apache Flink – DataSet API

DataSet := Distributed Collection of Data Transformation := Operation applied on DataSet Flink Program := Composition of Transformations

DataSet

DataSet

DataSet

Transformation

Transformation

DataSet

DataSet

Transformation DataSet

Flink Program

Page 40: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

Apache Flink – DataSet Transformations

aggregate coGroup cross distinct filter first-N flatMap groupBy

join leftOuterJoin rightOuterJoin fullOuterJoin map mapPartition project reduce reduceGroup union

iterate iterateDelta

Page 41: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

The „Hello World“ of Big Data – Word Count

1: ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); 2: 3: DataSet<String> text = env.fromElements( // or env.readTextFile(„hdfs://…“) 4: „He who controls the past controls the future.“, 5: „He who controls the present controls the past.“); 6: 7: DataSet<Tuple2<String, Integer>> wordCounts = text 8: .flatMap(new LineSplitter()) // splits the line and outputs (word, 1) tuples 9: .groupBy(0) 10: .sum(1); 11: 12: wordCounts.print(); // trigger execution

flatMap

„He who controls the past controls the future.“ „He who controls the present controls the past.“

(He,1) (who,1) (controls,1) (the,1) (past,1) // ...

groupBy(0)

[(He,1),(He,1)] [(who,1),(who,1)] [(future,1)] [(past,1),(past,1)] [(present,1)] // ...

sum(1)

(He,2) (who,2) (future,1) (past,2) (present,1) // ...

Page 42: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

EPGM on Apache Flink

Page 43: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

EPGM on Apache Flink – User facing API

LogicalGraph

fromCollections(…) : LogicalGraph fromDataSets(…) : LogicalGraph fromGellyGraph(…) : LogicalGraph getGraphHead() : DataSet<EPGMGraphHead> toGellyGraph() : Graph combine(…) : LogicalGraph intersect(…) : LogicalGraph groupBy(…) : LogicalGraph match(…) : GraphCollection // ...

GraphCollection

fromCollections(…) : GraphCollection

fromDataSets(…) : GraphCollection getGraphHeads() : DataSet<EPGMGraphHead> getGraph(…) : LogicalGraph getGraphs(…) : GraphCollection select(…) : GraphCollection union(…) : GraphCollection distinct(…) : GraphCollection sortBy(…) : GraphCollection // ...

GraphBase

getVertices() : DataSet<EPGMVertex> getEdges() : DataSet<EPGMEdge> // ...

graphHeads : DataSet<EPGMGraphHead> vertices : DataSet<EPGMVertex> edges : DataSet<EPGMEdge>

EPGMDatabase

fromCollections(…) : EPGMDatabase

fromJSONFile(…) : EPGMDatabase fromHBase(…) : EPGMDatabase writeAsJSON(…) : void writeToHBase(…) : void getDatabaseGraph() : LogicalGraph // ...

Page 44: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

EPGM on Apache Flink – DataSets

Id Label Properties Graphs

Id Label Properties SourceId TargetId Graphs

EPGMGraphHead

EPGMVertex

EPGMEdge

Id Label Properties POJO

POJO

POJO

DataSet<EPGMGraphHead>

DataSet<EPGMVertex>

DataSet<EPGMEdge>

Id Label Properties Graphs

EPGMVertex

GradoopId := UUID 128-bit

String PropertyList := List<Property> Property := (String, PropertyValue) PropertyValue := byte[]

GradoopIdSet := Set<GradoopId>

Page 45: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

EPGM on Apache Flink – Exclusion

// input: firstGraph (G[0]), secondGraph (G[2]) 1: DataSet<GradoopId> graphId = secondGraph.getGraphHead() 2: .map(new Id<G>()); 3: 4: DataSet<V> newVertices = firstGraph.getVertices() 5: .filter(new NotInGraphBroadCast<V>()) 6: .withBroadcastSet(graphId, GRAPH_ID); 7: 8: DataSet<E> newEdges = firstGraph.getEdges() 9: .filter(new NotInGraphBroadCast<E>()) 10: .withBroadcastSet(graphId, GRAPH_ID) 11: .join(newVertices) 12: .where(new SourceId<E>().equalTo(new Id<V>()) 13: .with(new LeftSide<E, V>()) 14: .join(newVertices) 15: .where(new TargetId<E>().equalTo(new Id<V>()) 16: .with(new LeftSide<E, V>());

db.G[0].exclude(db.G[2])

[2] Community | interest : Graphs| vertexCount : 4

[0] Community | interest : Databases | vertexCount : 3

[5] Person name : Alice gender : f city : Leipzig age : 23

[6] Person name : Bob gender : m city : Leipzig age : 30

[7] Person name : Carol gender : f city : Dresden age : 30

[8] Person name : Dave gender : m city : Dresden age : 42

[9] Person name : Eve gender : f city : Dresden age : 35 speaks : en

0

1

2

3

4

5

6 7

knows since : 2014

knows since : 2014

knows since : 2013

knows since : 2013

knows since : 2014

knows since : 2014

knows since : 2015

knows since : 2013

Page 46: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

EPGM on Apache Flink – Exclusion

Id Label Properties

2 Community interest: Graphs vertexCount: 4

graphId = secondGraph.getGraphHead()

Id

2

newVertices = firstGraph.getVertices() Id Label Properties Graphs

5 Person name: Alice gender: f … [0, 2]

6 Person name: Bob gender: m … [0, 2]

9 Person name: Eve gender: f … [0]

Id Label Properties Graphs

9 Person name: Eve gender: f … [0]

.map(new Id<G>());

.filter(new NotInGraphBroadCast<V>())

.withBroadcastSet(graphId, GRAPH_ID);

Page 47: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

EPGM in Apache Flink – Exclusion newEdges = firstGraph.getEdges()

Id Label SourceId TargetId Properties Graphs

0 knows 5 6 since: 2014 [0, 2]

1 knows 6 5 since: 2014 [0, 2]

6 knows 9 5 since: 2013 [0]

7 knows 9 6 since: 2015 [0]

Id Label SourceId TargetId Properties Graphs

6 knows 9 5 since: 2013 [0]

7 knows 9 6 since: 2015 [0]

Id Label SourceId TargetId … Id Label …

6 knows 9 5 … 9 Person …

7 knows 9 6 … 9 Person …

Id Label SourceId TargetId …

6 knows 9 5 …

7 knows 9 6 …

Id Label SourceId TargetId … Id Label …

Id Label SourceId TargetId … .with(new LeftSide<E, V>());

.join(newVertices)

.where(new TargetId<E>().equalTo(new Id<V>())

.with(new LeftSide<E, V>())

.join(newVertices)

.where(new SourceId<E>().equalTo(new Id<V>())

.filter(new NotInGraphBroadCast<E>())

.withBroadcastSet(graphId, GRAPH_ID)

Page 48: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

Social Network Example

Page 49: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

LDBC Social Network Data

http://ldbcouncil.org/

Page 50: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

LDBC Social Network Data socialNetwork .subgraph( (v => v.label == ‘Person‘), (e => e.label == ‘knows‘)) .transform( (gIn, gOut => gOut = gIn), (vIn, vOut => { vOut.label = vIn.label, vOut[‘city‘] = vIn[‘city‘] vOut[‘gender‘] = vIn[‘gender‘] vOut[‘key‘] = vIn[‘birthday‘]}), (eIn, eOut) => eOut.label = eIn.label)) .callForCollection(:LabelPropagation, [‘key‘, 4]) .apply(g => g.aggregate(‘vertexCount‘, (h => |h.V|)) .select(g => g[‘vertexCount‘] > 50000) .reduce(g, h => g.combine(h)) .groupBy( [‘city‘,‘gender‘], (superVertex, vertices => superVertex[‘count‘] = |vertices|), [], (superEdge, edges => superEdge[‘count‘] = |edges|) .aggregate(‘vCount‘, (g => |g.V|)) .aggregate(‘eCount‘, (g => |g.E|))

Page 51: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

LDBC Social Network Data socialNetwork .subgraph( (v => v.label == ‘Person‘), (e => e.label == ‘knows‘)) .transform( (gIn, gOut => gOut = gIn), (vIn, vOut => { vOut.label = vIn.label, vOut[‘city‘] = vIn[‘city‘] vOut[‘gender‘] = vIn[‘gender‘] vOut[‘key‘] = vIn[‘birthday‘]}), (eIn, eOut) => eOut.label = eIn.label)) .callForCollection(:LabelPropagation, [‘key‘, 4]) .apply(g => g.aggregate(‘vertexCount‘, (h => |h.V|)) .select(g => g[‘vertexCount‘] > 50000) .reduce(g, h => g.combine(h)) .groupBy( [‘city‘,‘gender‘], (superVertex, vertices => superVertex[‘count‘] = |vertices|), [], (superEdge, edges => superEdge[‘count‘] = |edges|) .aggregate(‘vCount‘, (g => |g.V|)) .aggregate(‘eCount‘, (g => |g.E|))

Page 52: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

Benchmark Results

Dataset # Vertices # Edges

Graphalytics.1 61,613 2,026,082

Graphalytics.10 260,613 16,600,778

Graphalytics.100 1,695,613 147,437,275

Graphalytics.1000 12,775,613 1,363,747,260

Graphalytics.10000 90,025,613 10,872,109,028

16x Intel(R) Xeon(R) 2.50GHz (6 Cores) 16x 48 GB RAM 1 Gigabit Ethernet Hadoop 2.6.0 Flink 1.0-SNAPSHOT

slots (per worker) 12 jobmanager.heap.mb 2048 taskmanager.heap.mb 40960

0

200

400

600

800

1000

1200

1 2 4 8 16

Runt

ime

[s]

Number of workers

Runtime Graphalytics.100

Page 53: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

Benchmark Results

Dataset # Vertices # Edges

Graphalytics.1 61,613 2,026,082

Graphalytics.10 260,613 16,600,778

Graphalytics.100 1,695,613 147,437,275

Graphalytics.1000 12,775,613 1,363,747,260

Graphalytics.10000 90,025,613 10,872,109,028

1

2

4

8

16

1 2 4 8 16

Spee

dup

Number of workers

Speedup Graphalytics.100 Linear

16x Intel(R) Xeon(R) 2.50GHz (6 Cores) 16x 48 GB RAM 1 Gigabit Ethernet Hadoop 2.6.0 Flink 1.0-SNAPSHOT

slots (per worker) 12 jobmanager.heap.mb 2048 taskmanager.heap.mb 40960

Page 54: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

Benchmark Results

Dataset # Vertices # Edges

Graphalytics.1 61,613 2,026,082

Graphalytics.10 260,613 16,600,778

Graphalytics.100 1,695,613 147,437,275

Graphalytics.1000 12,775,613 1,363,747,260

Graphalytics.10000 90,025,613 10,872,109,028

1

10

100

1000

10000

Runt

ime

[s]

Datasets

16x Intel(R) Xeon(R) 2.50GHz (6 Cores) 16x 48 GB RAM 1 Gigabit Ethernet Hadoop 2.6.0 Flink 1.0-SNAPSHOT

slots (per worker) 12 jobmanager.heap.mb 2048 taskmanager.heap.mb 40960

Page 55: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

Current State & Future Work

Page 56: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

Current State Operators

Unary Binary

Grap

h Co

llect

ion

Logi

cal G

raph

Algorithms

Aggregation

Pattern Matching

Transformation

Grouping Equality

Call *

Combination

Overlap

Exclusion

Equality

Union

Intersection Difference

Gelly Library

BTG Extraction

Frequent Subgraphs

Limit

Selection

Distinct Sort

Apply * Reduce *

Call * * auxiliary

Adaptive Partitioning

Subgraph

Page 57: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

Current State

0.0.1 First Prototype (May 2015) Hadoop MapReduce and Giraph for operator implementations Too much complexity Performance loss through serialization in HDFS/HBase

0.0.2 Using Flink as execution layer (June 2015) Basic operators

0.1 December 2015 System-side identifiers (UUID) Improved property handling More operator implementations (e.g., Equality, Bool operators) Code refactoring

0.2-SNAPSHOT Graph Pattern Matching Frequent Subgraph Mining Memory optimization (96-bit ID, Dictionary Encoding, …) Tuple Implementation

Page 58: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

Contributions to Flink

FLINK-2411 Add basic graph summarization algorithm FLINK-2590 DataSetUtils.zipWithUniqueID creates duplicate Ids FLINK-2905 Add intersect method to Graph class FLINK-2910 Combine tests for binary graph operators FLINK-2941 Implement a neo4j - Flink/Gelly connector FLINK-2981 Update README for building docs FLINK-3064 Missing size check in GroupReduceOperatorBase leads to NPE FLINK-3118 Check if MessageFunction implements ResultTypeQueryable FLINK-3122 Generalize value type in LabelPropagation FLINK-3272 Generalize vertex value type in ConnectedComponents

Flink Forward (October 2015) Meetup Big Data Usergroup Saxony (December 2015)

Page 59: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

Contributions welcome!

Code Operator implementations Performance Tuning Extend HBase Storage

People

Bachelor / Master Thesis Open PhD positions in Leipzig, Germany

Data! and Use Cases We are researchers, we assume ...

Page 60: Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016

Thank you! www.gradoop.com https://flink.apache.org http://ldbcouncil.org/ http://dbs.uni-leipzig.de/file/GradoopTR.pdf http://dbs.uni-leipzig.de/file/biiig-vldb2014.pdf https://github.com/dbs-leipzig/gradoop https://github.com/s1ck/gdl https://github.com/s1ck/ldbc-flink-import https://github.com/s1ck/flink-neo4j