Distributed Graph Analytics with Gradoop

69
Distributed Graph Analytics with Gradoop inovex Meetup Munich Let‘s talk about Graph Databases July 2016 Martin Junghanns (@kc1s) University of Leipzig – Database Research Group

Transcript of Distributed Graph Analytics with Gradoop

Distributed Graph Analytics with Gradoop

inovex Meetup Munich

Let‘s talk about Graph Databases

July 2016

Martin Junghanns (@kc1s) University of Leipzig – Database Research Group

Motivation

Extended Property Graph Model

Operators

Benchmark

Implementation

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 3

Motivation EPGM Operators Benchmark Implementation

3

Motivation

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 4

Motivation EPGM Operators Benchmark Implementation

4

Motivation

𝑮𝑟𝑎𝑝ℎ = (𝑽𝑒𝑟𝑡𝑖𝑐𝑒𝑠, 𝑬𝑑𝑔𝑒𝑠)

„Graphs are everywhere“

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 5

Motivation EPGM Operators Benchmark Implementation

5

Motivation

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Trent

𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬, 𝐹𝑜𝑙𝑙𝑜𝑤𝑒𝑟𝑠)

„Graphs are everywhere“

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 6

Motivation EPGM Operators Benchmark Implementation

6

Motivation

𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠)

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Trent

„Graphs are everywhere“

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 7

Motivation EPGM Operators Benchmark Implementation

7

Motivation

Alice

Bob

AC/DC

Dave

Carol

Mallory

Peggy

Metallica

𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠)

„Graphs are heterogeneous“

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 8

Motivation EPGM Operators Benchmark Implementation

8

Motivation

Alice

Bob

AC/DC

Dave

Carol

Mallory

Peggy

Metallica

𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠)

„Graphs can be analyzed“

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 9

Motivation EPGM Operators Benchmark Implementation

9

Motivation

0.2

0.28

0.26

0.33

0.25

0.26

Alice

Bob

AC/DC

Dave

Carol

Mallory

Peggy

Metallica

3.6

2.82

𝐺𝑟𝑎𝑝ℎ = (𝐔𝐬𝐞𝐫𝐬 ∪ 𝐁𝐚𝐧𝐝𝐬, 𝐹𝑟𝑖𝑒𝑛𝑑𝑠ℎ𝑖𝑝𝑠 ∪ 𝐿𝑖𝑘𝑒𝑠)

„Graphs can be analyzed“

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 10

Motivation EPGM Operators Benchmark Implementation

10

Motivation

Assuming a social network

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 11

Motivation EPGM Operators Benchmark Implementation

11

Motivation

Assuming a social network

1. Determine subgraph

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 12

Motivation EPGM Operators Benchmark Implementation

12

Motivation

Assuming a social network

1. Determine subgraph

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 13

Motivation EPGM Operators Benchmark Implementation

13

Motivation

Assuming a social network

1. Determine subgraph

2. Find communities

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 14

Motivation EPGM Operators Benchmark Implementation

14

Motivation

Assuming a social network

1. Determine subgraph

2. Find communities

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 15

Motivation EPGM Operators Benchmark Implementation

15

Motivation

Assuming a social network

1. Determine subgraph

2. Find communities

3. Filter communities

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 16

Motivation EPGM Operators Benchmark Implementation

16

Motivation

Assuming a social network

1. Determine subgraph

2. Find communities

3. Filter communities

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 17

Motivation EPGM Operators Benchmark Implementation

17

Motivation

Assuming a social network

1. Determine subgraph

2. Find communities

3. Filter communities

4. Find common subgraph

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 18

Motivation EPGM Operators Benchmark Implementation

18

Motivation

Assuming a social network

1. Determine subgraph

2. Find communities

3. Filter communities

4. Find common subgraph

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 19

Motivation EPGM Operators Benchmark Implementation

19

Motivation

Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 20

Motivation EPGM Operators Benchmark Implementation

20

Motivation

Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 21

Motivation EPGM Operators Benchmark Implementation

21

Motivation

Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 22

Motivation EPGM Operators Benchmark Implementation

22

Motivation

Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 23

Motivation EPGM Operators Benchmark Implementation

23

Motivation

Assuming a social network • Heterogeneous data 1. Determine subgraph • Apply graph transformation 2. Find communities • Handle collections of graphs 3. Filter communities • Aggregation, Selection 4. Find common subgraph • Apply dedicated algorithm

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 24

Motivation EPGM Operators Benchmark Implementation

24

Motivation

„And let‘s not forget …“

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 25

Motivation EPGM Operators Benchmark Implementation

25

Motivation

“...Graphs are large.”

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 26

Motivation EPGM Operators Benchmark Implementation

26

Motivation

„An open-source framework and research platform for efficient, distributed and domain independent

management and analytics of heterogeneous graph data.“

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 27

Motivation EPGM Operators Benchmark Implementation

27

Motivation

Data Volume and Problem Complexity

Ease

-of-

use

Graph Processing Systems

Graph Databases

Graph Dataflow Systems Gelly

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 28

Motivation EPGM Operators Benchmark Implementation

28

Motivation

Distributed Graph Store (Apache HBase)

Apache Flink Operator Implementation

Apache Flink Distributed Operator Execution

Extended Property Graph Model (EPGM)

Graph Analytical Language (GrALa)

I/O

Distributed File System (Apache HDFS)

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 29

Motivation EPGM Operators Benchmark Implementation

29

Extended Property Graph Model

(EPGM)

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 30

Motivation EPGM Operators Benchmark Implementation EPGM

• Vertices and directed Edges

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 31

Motivation EPGM Operators Benchmark Implementation EPGM

• Vertices and directed Edges

• Logical Graphs

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 32

Motivation EPGM Operators Benchmark Implementation EPGM

• Vertices and directed Edges

• Logical Graphs

• Identifiers

1 3

4

5

2 1 2

3

4

5

1

2

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 33

Motivation EPGM Operators Benchmark Implementation EPGM

• Vertices and directed Edges

• Logical Graphs

• Identifiers

• Type Labels

1 3

4

5

2 1 2

3

4

5

Person Band

Person

Person

Band

likes likes

likes

knows

likes

1|Community

2|Community

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 34

Motivation EPGM Operators Benchmark Implementation EPGM

• Vertices and directed Edges

• Logical Graphs

• Identifiers

• Type Labels

• Properties 1 3

4

5

2 1 2

3

4

5

Person name : Alice born : 1984

Band name : Metallica founded : 1981

Person name : Bob

Person name : Eve

Band name : AC/DC founded : 1973

likes since : 2014

likes since : 2013

likes since : 2015

knows

likes since : 2014

1|Community|interest:Heavy Metal

2|Community|interest:Hard Rock

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 35

Motivation EPGM Operators Benchmark Implementation

35

Operators

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 36

Motivation EPGM Operators Benchmark Implementation

36

Operators

Operators

Unary Binary

Gra

ph

Co

llect

ion

Lo

gica

l Gra

ph

Algorithms

Aggregation

Pattern Matching

Transformation

Grouping Equality

Call

Combination

Overlap

Exclusion

Equality

Union

Intersection

Difference

Flink Gelly Library

BTG Extraction

Frequent Subgraphs

Limit

Selection

Distinct

Sort

Apply

Reduce

Call

Adaptive Partitioning

Subgraph

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 37

Motivation EPGM Operators Benchmark Implementation

37

Operators

1 3

4

5 2

3

1 2

1 3

4

5 2

1 2 4

5

Combination

Overlap

Exclusion

3 Basic Binary Operators

LogicalGraph graph3 = graph1.combine(graph2); LogicalGraph graph4 = graph1.overlap(graph2); LogicalGraph graph5 = graph1.exclude(graph2);

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 38

Motivation EPGM Operators Benchmark Implementation

38

Operators

1 3

4

5 2

3

1 3

4

5 2

3 | vertexCount: 5

UDF

Aggregation

graph3 = graph3.aggregate(“vertexCount”, new AggregateFunction<Long>() { public DataSet<Long> execute(LogicalGraph g) { return Count.count(g.getVertices()); } });

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 39

Motivation EPGM Operators Benchmark Implementation

39

Operators

UDF

3 | vertexCount: 5

name:Alice

f_name:Bob 1 3

4

5 2

3 | Community| vCount: 5

f_name:Alice

f_name:Bob 1 3

4

5 2

Transformation

graph3 = graph3.transformEdges(new TransformationFunction<Edge>() { public Edge execute(Edge e) { e.setLabel(e.getLabel().equals(“orange”) ? “red” : e.getLabel()); return e; }});

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 40

Motivation EPGM Operators Benchmark Implementation

40

Operators

3

1 3

4

5 2

3

4

1 2

3

4

1 2

4

3 5

2 UDF

UDF

UDF

Subgraph

LogicalGraph graph4 = graph3.subgraph( new FilterFunction<Vertex>() { public boolean execute(Vertex v) { return v.getLabel().equals(“green”); }}, new FilterFunction<Edge>() { public boolean execute(Edge e) { return e.getLabel().equals(“orange”); }});

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 41

Motivation EPGM Operators Benchmark Implementation

41

Operators

3

1 3

4

5 2 Pattern

4 5

1 3

4

2

Graph Collection

Pattern Matching

GraphCollection collection = graph3.match(“(:Green)-[:orange]->(:Orange)”);

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 42

Motivation EPGM Operators Benchmark Implementation

42

Operators

Keys

3

1 3

4

5 2

+Aggregate

3

a:23 a:84

a:42

a:12

1 3

4

5 2

a:13

a:21

4

count:2 count:3

max(a):42

max(a):84

max(a):13 max(a):21

6 7

4

6 7

Grouping

LogicalGraph grouped = graph3.groupBy() .useVertexLabel() .useEdgeLabel() .addVertexAggregate(new CountAggregator()) .addEdgeAggregate(new MaxAggregator(“a”));

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 43

Motivation EPGM Operators Benchmark Implementation

43

Operators

Operator

1

2

0 2

3

4 1

5 7 8 6

1 | vertexCount: 5

2 | vertexCount: 4

0 2

3

4 1

5 7 8 6

Apply (e.g. Aggregation)

collection = collection.apply(new Aggregation<>(“vertexCount”, new VertexCount()));

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 44

Motivation EPGM Operators Benchmark Implementation

44

Operators

UDF

vertexCount > 4

1 | vertexCount: 5

2 | vertexCount: 4

0 2

3

4 1

5 7 8 6

1 | vertexCount: 5

0 2

3

4 1

Selection

GraphCollection filtered = collection.select(new FilterFunction<GraphHead>() { public boolean filter(GraphHead g) { return g.getPropertyValue(“vertexCount”).getLong() > 4L; } });

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 45

Motivation EPGM Operators Benchmark Implementation

45

Operators

Algorithm

1

0 2

3

4 1

5 7 8 6

2

3

0 2

3

4 1

5 7 8 6

Call (e.g. Clustering)

GraphCollection clustering = graph.callForCollection(new ClusteringAlgorithm());

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 46

Motivation EPGM Operators Benchmark Implementation

46

Operators

Algorithm

2

rank:0.11 rank:0.25

rank:0.11 rank:1.29

rank:1.29

rank:1.58 rank:0.11

rank:0.75 rank:0.11

0 2

3

4 1

5 7 8 6

1

0 2

3

4 1

5 7 8 6

Call (e.g. Page Rank)

LogicalGraph pageRankGraph = graph.callForGraph(new PageRankAlgorithm());

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 47

Motivation EPGM Operators Benchmark Implementation

47

Implementation Apache Flink Gradoop on Flink

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 48

Motivation EPGM Operators Benchmark Implementation

48

Implementation

Apache Flink Gradoop on Flink

„Streaming Dataflow Engine that provides data distribution, communication and fault tolerance for distributed computations over data streams.“

https://flink.apache.org/

Streaming Dataflow Runtime

DataSet DataStream

Hadoop M

R

Table

Gelly

Flin

kML

Table

Zep

pel

in

Cas

cad

ing

MR

QL

Dat

aflo

w

Sto

rm

Dat

aflo

w

SAM

OA

GR

AD

OO

P

Cluster (e.g. YARN) Local Cloud (e.g. EC2)

Batch Stream

Data Storage (e.g. Files, HDFS, S3, JDBC, HBase, Kafka, …)

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 49

Motivation EPGM Operators Benchmark Implementation

49

Implementation

Apache Flink Gradoop on Flink

DataSet DataSet DataSet

DataSet DataSet DataSet

DataSet DataSet DataSet

DataSet DataSet DataSet

DataSet DataSet DataSet

DataSet DataSet DataSet

• DataSet := Distributed Collection of Data Objects

• Transformation := Operation on DataSets (Higher-order function)

• Flink Programm := Composition of Transformations

DataSet

DataSet

DataSet

Transformation

Transformation

DataSet

DataSet

Transformation DataSet

Flink Program

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 50

Motivation EPGM Operators Benchmark Implementation

50

Implementation

Apache Flink Gradoop on Flink

Hadoop-like Transformations

• map

• flatMap

• mapPartition

• reduce

• reduceGroup

• coGroup

Special Flink Operations

• iterate

• iterateDelta

SQL-like Transformations

• filter

• project

• cross

• union

• distinct

• first-N (limit)

• groupBy

• aggregate

• join

• leftOuterJoin

• rightOuterJoin

• fullOuterJoin

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 51

Motivation EPGM Operators Benchmark Implementation

51

Implementation

Apache Flink Gradoop on Flink

1: ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); 2: 3: DataSet<String> text = env.fromElements( // or env.readTextFile(„hdfs://…“) 4: „He who controls the past controls the future.“, 5: „He who controls the present controls the past.“); 6: 7: DataSet<Tuple2<String, Integer>> wordCounts = text 8: .flatMap(new LineSplitter()) // splits the line and outputs (word, 1) tuples 9: .groupBy(0) 10: .sum(1); 11: 12: wordCounts.print(); // trigger execution

flatMap

„He who controls the past controls the future.“ „He who controls the present controls the past.“

(He,1) (who,1) (controls,1) (the,1) (past,1) // ...

groupBy(0)

[(He,1),(He,1)] [(who,1),(who,1)] [(future,1)] [(past,1),(past,1)] [(present,1)] // ...

sum(1)

(He,2) (who,2) (future,1) (past,2) (present,1) // ...

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 52

Motivation EPGM Operators Benchmark Implementation

52

Implementation

Apache Flink Gradoop on Flink

flatMap

(He,1) (who,1) (controls,1)

groupBy(0)

[(He,1),(He,1)] [(who,1),(who,1)]

sum(1)

(He,2) (who,2)

Source flatMap

(the,1) (past,1)

groupBy(0)

[(future,1)] [(past,1),(past,1)]

sum(1)

(future,1) (past,2)

flatMap

(future,1) (past,1)

groupBy(0)

[(present,1)]

sum(1)

(present,1)

Sink

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 53

Motivation EPGM Operators Benchmark Implementation

53

Implementation

Apache Flink Gradoop on Flink

Id Label Properties Graphs

Id Label Properties SourceId TargetId Graphs

EPGMGraphHead

EPGMVertex

EPGMEdge

Id Label Properties POJO

POJO

POJO

DataSet<EPGMGraphHead>

DataSet<EPGMVertex>

DataSet<EPGMEdge>

Id Label Properties Graphs

EPGMVertex

GradoopId := UUID 128-bit

String PropertyList := List<Property> Property := (String, PropertyValue) PropertyValue := byte[]

GradoopIdSet := Set<GradoopId>

EPGM Graph Representation

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 54

Motivation EPGM Operators Benchmark Implementation

54

Implementation

Apache Flink Gradoop on Flink

Id Label Properties

1 Community {interest:Heavy Metal}

2 Community {interest:Hard Rock}

Id Label Properties Graphs

1 Person {name:Alice, born:1984} {1}

2 Band {name:Metallica,founded:1981} {1}

3 Person {name:Bob} {1,2}

4 Band {name:AC/DC,founded:1973} {2}

5 Person {name:Eve} {2}

Id Label Source Target Properties Graphs

1 likes 1 2 {since:2014} {1}

2 likes 3 2 {since:2013} {1}

3 likes 3 4 {since:2015} {2}

4 knows 3 5 {} {2}

5 likes 5 4 {since:2014} {2}

likes since : 2014

likes since : 2013

1 3

4

5

2

1|Community|interest:Heavy Metal

2|Community|interest:Hard Rock

Person name : Alice born : 1984

Band name : Metallica founded : 1981

Person name : Bob

Person name : Eve

Band name : AC/DC founded : 1973 likes

since : 2015

knows

likes since : 2014

1 2

3

4

5

DataSet<EPGMGraphHead>

DataSet<EPGMVertex> DataSet<EPGMEdge>

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 55

Motivation EPGM Operators Benchmark Implementation

55

Implementation

Apache Flink Gradoop on Flink

LogicalGraph grouped = graph1.combine(graph2).groupBy() .useVertexLabel() .useEdgeLabel() .addVertexAggregate(new CountAggregator()) .addEdgeAggregate(new CountAggregator());

6 7 Person count : 3

Band count : 2

likes count : 4

knows count : 1

6

7

4

likes since : 2014

likes since : 2013

1 3

4

5

2

1|Community|interest:Heavy Metal

2|Community|interest:Hard Rock

Person name : Alice born : 1984

Band name : Metallica founded : 1981

Person name : Bob

Person name : Eve

Band name : AC/DC founded : 1973 likes

since : 2015

knows

likes since : 2014

1 2

3

4

5

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 56

Motivation EPGM Operators Benchmark Implementation

56

Implementation

Apache Flink Gradoop on Flink

GroupBy(1,2,3) + GC + GR* + Map Assign edges to groups Compute aggregates Build super edges

Filter + Map Extract super vertex tuples Build super vertices

GroupBy(1) + GroupReduce* Assign vertices to groups Compute aggregates Create super vertex tuples Forward updated group members

V

E

(1,[Person],[])

(2,[Band],[])

(3,[Person],[])

(4,[Band],[])

(5,[Person],[])

(-,6,[Person],[3])

(1,6,[],[])

(-,7,[Band],[2])

(2,7,[],[])

(3,6,[],[])

(4,7,[],[])

(5,6,[],[])

v6

v7

(1,6)

(2,7)

(3,6)

(4,7)

(5,6)

(1,1,2,[likes],[])

(2,3,2,[likes],[])

(3,3,4,[likes],[])

(4,3,5,[knows],[])

(5,5,4,[likes],[])

(1,6,7,[likes],[])

(2,6,7,[likes],[])

(3,6,7,[likes],[])

(4,6,6,[knows],[])

(5,6,7,[likes],[])

e6

e7

Map Extract attributes

Filter + Map Extract group members Reduce memory footprint

Join* Replace Source/TargetId with corresponding super vertex id

Map Extract attributes

*requires worker communication

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 57

Motivation EPGM Operators Benchmark Implementation

57

Implementation

Apache Flink Gradoop on Flink

class LogicalGraph<G extends EPGMGraphHead, V extends EPGMVertex, E extends EPGMEdge> { fromCollections(...) : LogicalGraph<G, V, E> fromDataSets(...) : LogicalGraph<G, V, E> fromGellyGraph(...) : LogicalGraph<G, V, E> getGraphHead() : DataSet<G> getVertices() : DataSet<V> getEdges() : DataSet<E> aggregate(...) : LogicalGraph<G, V, E> match(...) : GraphCollection<G, V, E> groupBy(...) : LogicalGraph<G, V, E> subgraph(...) : LogicalGraph<G, V, E> combine(...) : LogicalGraph<G, V, E> // ... }

class GraphCollection<G extends EPGMGraphHead, V extends EPGMVertex, E extends EPGMEdge > { fromCollections(...) : GraphCollection<G, V, E> fromDataSets(...) : GraphCollection<G, V, E> getGraphHeads() : DataSet<G> getVertices() : DataSet<V> getEdges() : DataSet<E> select(...) : GraphCollection<G, V, E> distinct( ) : GraphCollection<G, V, E> sortBy(...) : GraphCollection<G, V, E> union(...) : GraphCollection<G, V, E> difference(...) : GraphCollection<G, V, E> // ... }

EPGM API (Operators)

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 58

Motivation EPGM Operators Benchmark Implementation

58

Implementation

Apache Flink Gradoop on Flink

interface DataSource<G extends EPGMGraphHead, V extends EPGMVertex, E extends EPGMEdge> { getLogicalGraph(...) : LogicalGraph<G, V, E> getGraphCollection(...) : GraphCollection<G, V, E> }

interface DataSink<G extends EPGMGraphHead, V extends EPGMVertex, E extends EPGMEdge > { write(LogicalGraph<G, V, E>) : void write(GraphCollection<G, V, E>) : void }

class GraphDataSource<...> implements DataSource<...> { } class HBaseDataSource<...> implements DataSource<...> { } class JSONDataSource<...> implements DataSource<...> { } class TLFDataSource<...> implements DataSource<...> { }

class HBaseDataSink<...> implements DataSink<...> { } class JSONDataSink<...> implements DataSink<...> { } class TLFDataSink<...> implements DataSource<...> { }

EPGM API (I/O)

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 59

Motivation EPGM Operators Benchmark Implementation

59

Benchmark

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 60

Motivation EPGM Operators Benchmark Implementation

60

Benchmark

1. Extract subgraph containing only Persons and knows relations

2. Transform Persons to necessary information

3. Find communities using Label Propagation

4. Aggregate vertex count for each community

5. Select communities with more than 50K users

6. Combine large communities to a single graph

7. Group graph by Persons location and gender

8. Aggregate vertex and edge count of grouped graph

http://ldbcouncil.org/

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 61

Motivation EPGM Operators Benchmark Implementation

61

Benchmark

1. Extract subgraph containing only Persons and knows relations

2. Transform Persons to necessary information

3. Find communities using Label Propagation

4. Aggregate vertex count for each community

5. Select communities with more than 50K users

6. Combine large communities to a single graph

7. Group graph by Persons location and gender

8. Aggregate vertex and edge count of grouped graph

https://git.io/vgozj

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 62

Motivation EPGM Operators Benchmark Implementation

62

Benchmark

Dataset # Vertices # Edges Disk size

Graphalytics.1 61,613 2,026,082 570 MB

Graphalytics.10 260,613 16,600,778 4.5 GB

Graphalytics.100 1,695,613 147,437,275 40.2 GB

Graphalytics.1000 12,775,613 1,363,747,260 372 GB

Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB

• 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT

• slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 63

Motivation EPGM Operators Benchmark Implementation

63

Benchmark

Dataset # Vertices # Edges Disk size

Graphalytics.1 61,613 2,026,082 570 MB

Graphalytics.10 260,613 16,600,778 4.5 GB

Graphalytics.100 1,695,613 147,437,275 40.2 GB

Graphalytics.1000 12,775,613 1,363,747,260 372 GB

Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB

• 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT

• slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960

0

200

400

600

800

1000

1200

1 2 4 8 16R

un

tim

e [

s]

Number of workers

Graphalytics.100

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 64

Motivation EPGM Operators Benchmark Implementation

64

Benchmark

1

2

4

8

16

1 2 4 8 16

Spe

ed

up

Number of workers

Graphalytics.100 Linear

Dataset # Vertices # Edges Disk size

Graphalytics.1 61,613 2,026,082 570 MB

Graphalytics.10 260,613 16,600,778 4.5 GB

Graphalytics.100 1,695,613 147,437,275 40.2 GB

Graphalytics.1000 12,775,613 1,363,747,260 372 GB

Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB

• 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT

• slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960

Distributed Graph Analytics with Gradoop – inovex Meetup Munich – July 2016 65

Motivation EPGM Operators Benchmark Implementation

65

Benchmark

1

10

100

1000

10000

Ru

nti

me

[s]

Dataset # Vertices # Edges Disk size

Graphalytics.1 61,613 2,026,082 570 MB

Graphalytics.10 260,613 16,600,778 4.5 GB

Graphalytics.100 1,695,613 147,437,275 40.2 GB

Graphalytics.1000 12,775,613 1,363,747,260 372 GB

Graphalytics.10000 90,025,613 10,872,109,028 2.9 TB

• 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT

• slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960

Summary

• 0.0.1 First Prototype (May 2015)

– Hadoop MapReduce and Giraph for operator implementations

– Too much complexity

– Performance loss through serialization in HDFS/HBase

• 0.0.2 Using Flink as execution layer (June 2015)

– Basic operators

• 0.1 December 2015

– System-side identifiers (UUID)

– Improved property handling

– More operator implementations (e.g., Equality, Bool operators)

– Code refactoring

• 0.2-SNAPSHOT August 2016

– Graph Pattern Matching

– Frequent Subgraph Mining

– Memory optimization (96-bit ID, Dictionary Encoding, …)

– Refactoring

Release History

Summary

Contributions welcome!

• Code • I/O Formats (GraphML, DOT, …) • Operators and Algorithms • Tuning (Memory consumption, serialization, …) • API improvements

• Use cases and data • Business Intelligence • Fraud Detection • Pattern Mining • …

• Extended Property Graph Model • Schema flexible: Type Labels and Properties • Logical Graphs / Graphs Collection

• Graph and Collection Operators • Combination to analytical workflows

• Implemented on Apache Flink • Built-in scalability • Combine with other libraries

Summary

www.gradoop.com

[1] Junghanns, M.; Petermann, A.; Teichmann, N.; Gomez, K.; Rahm, E., „Analyzing Extended Property Graphs with Apache Flink“, Int. Workshop on Network Data Analytics (NDA), SIGMOD 2016.

[2] Petermann, A.; Junghanns, M.,

„Scalable Business Intelligence with Graph Collections“, it – Special Issue on Big Data Analytics, 2016.

[3] Petermann, A.; Junghanns, M.; Müller, M.; Rahm, E.,

„Graph-based Data Integration and Business Intelligence with BIIIG“, Proc. VLDB Conf. (Demo), 2014.