Distributed Graph Analytics with Gradoopwiki.ldbcouncil.org/.../20170209-LDBC-Gradoop.pdf ·...

20
Distributed Graph Analytics with Gradoop 9th LDBC TUC Meeting 9-10 February 2017 Walldorf, Germany Martin Junghanns University of Leipzig – Database Research Group

Transcript of Distributed Graph Analytics with Gradoopwiki.ldbcouncil.org/.../20170209-LDBC-Gradoop.pdf ·...

Page 1: Distributed Graph Analytics with Gradoopwiki.ldbcouncil.org/.../20170209-LDBC-Gradoop.pdf · Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns

Distributed Graph Analytics with Gradoop

9th LDBC TUC Meeting

9-10 February 2017

Walldorf, Germany

Martin Junghanns University of Leipzig – Database Research Group

Page 2: Distributed Graph Analytics with Gradoopwiki.ldbcouncil.org/.../20170209-LDBC-Gradoop.pdf · Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns

Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 2

Gradoop Extended Property Graphs Summary Evaluation

2

Gradoop

„An open-source graph dataflow system for declarative analytics of heterogeneous graph data.“

Page 3: Distributed Graph Analytics with Gradoopwiki.ldbcouncil.org/.../20170209-LDBC-Gradoop.pdf · Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns

Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 3

Gradoop Extended Property Graphs Summary Evaluation

3

Gradoop

Distributed Graph Storage (Apache HDFS)

Apache Flink Operator Implementation

Distributed Operator Execution (Apache Flink)

Extended Property Graph Model (EPGM)

Graph Dataflow Operators

I/O

Page 4: Distributed Graph Analytics with Gradoopwiki.ldbcouncil.org/.../20170209-LDBC-Gradoop.pdf · Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns

Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 4

Gradoop Extended Property Graphs Summary Evaluation

4

Extended Property Graphs

• Vertices and directed Edges

• Logical Graphs

• Identifiers

• Type Labels

• Properties

1 2

3

4

5 1 2 3

4

5

2

1

Extended Property Graph Model (EPGM)

Person name : Eve

Person name : Alice yob : 1984

Band name : Metallica founded : 1981

Person name : Bob

Band name : AC/DC founded : 1973

likes since : 2014

likes since : 2013

likes since : 2015

knows

likes

|Community|interest:Heavy Metal

|Community|interest:Hard Rock

Page 5: Distributed Graph Analytics with Gradoopwiki.ldbcouncil.org/.../20170209-LDBC-Gradoop.pdf · Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns

Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 5

Gradoop Extended Property Graphs Summary Evaluation

5

Extended Property Graphs

EPGM Operators/Transformations

Unary Binary

Gra

ph

Co

llect

ion

Lo

gica

l Gra

ph

Equality

Union

Intersection

Difference

Limit

Selection

Pattern Matching

Distinct

Apply

Reduce

Call

Aggregation

Pattern Matching

Transformation

Grouping

Call

Subgraph

Equality

Combination

Overlap

Exclusion

Fusion

Algorithms

Flink Gelly Library

BTG Extraction

Frequent Subgraph Mining

Adaptive Partitioning

Page 6: Distributed Graph Analytics with Gradoopwiki.ldbcouncil.org/.../20170209-LDBC-Gradoop.pdf · Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns

Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 6

Gradoop Extended Property Graphs Summary Evaluation

6

Extended Property Graphs

1 3

4

5 2

3

1 2

1 3

4

5 2

1 2 4

5

Combination

Overlap

Exclusion

3 Basic Binary Operators

LogicalGraph graph3 = graph1.combine(graph2); LogicalGraph graph4 = graph1.overlap(graph2); LogicalGraph graph5 = graph1.exclude(graph2);

Page 7: Distributed Graph Analytics with Gradoopwiki.ldbcouncil.org/.../20170209-LDBC-Gradoop.pdf · Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns

Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 7

Gradoop Extended Property Graphs Summary Evaluation

7

Extended Property Graphs

3

1 3

4

5 2 3

4

1 2 UDF

LogicalGraph graph4 = graph3.subgraph( (vertex => vertex.getLabel().equals(‘Green’)), (edge => edge.getLabel().equals(‘orange’)));

Subgraph

Page 8: Distributed Graph Analytics with Gradoopwiki.ldbcouncil.org/.../20170209-LDBC-Gradoop.pdf · Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns

Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 8

Gradoop Extended Property Graphs Summary Evaluation

8

Extended Property Graphs

1 3

4

5 2

3

1 3

4

5 2

3 | vertexCount: 5

UDF

Aggregation

graph3 = graph3.aggregate(new VertexCount());

Page 9: Distributed Graph Analytics with Gradoopwiki.ldbcouncil.org/.../20170209-LDBC-Gradoop.pdf · Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns

Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 9

Gradoop Extended Property Graphs Summary Evaluation

9

Extended Property Graphs

Keys

3

1 3

4

5 2

+Aggregate

3

a:23 a:84

a:42

a:12

1 3

4

5 2

a:13

a:21

4

count:2 count:3

max(a):42

max(a):84

max(a):13 max(a):21

6 7

4

6 7

Grouping

LogicalGraph grouped = graph3.groupBy() .useVertexLabel() .useEdgeLabel() .addVertexAggregate(new CountAggregator()) .addEdgeAggregate(new MaxAggregator(‘a’));

Page 10: Distributed Graph Analytics with Gradoopwiki.ldbcouncil.org/.../20170209-LDBC-Gradoop.pdf · Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns

Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 10

Gradoop Extended Property Graphs Summary Evaluation

10

Extended Property Graphs

3

1 3

4

5 2 Pattern

4 5

1 3

4

2

Graph Collection

GraphCollection collection = graph3.match(‘(:Green)-[:orange]->(:Orange)’);

Pattern Matching (Single Graph Input)

Page 11: Distributed Graph Analytics with Gradoopwiki.ldbcouncil.org/.../20170209-LDBC-Gradoop.pdf · Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns

Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 11

Gradoop Extended Property Graphs Summary Evaluation

11

Extended Property Graphs

Pattern

Graph Collection

Pattern Matching (Graph-Transaction Setting)

GraphCollection filtered = coll.match(‘(v0:Green)-[:blue]->(:Orange),(v0)-[:blue]->(:Orange)’);

1 |

2 |

0 2

3

4 1

5 7 8 6

Graph Collection

1 | contained = true

2 | contained = false

0 2

3

4 1

5 7 8 6

Page 12: Distributed Graph Analytics with Gradoopwiki.ldbcouncil.org/.../20170209-LDBC-Gradoop.pdf · Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns

Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 12

Gradoop Extended Property Graphs Summary Evaluation

12

Extended Property Graphs

Algorithm

1

0 2

3

4 1

5 7 8 6

2

3

0 2

3

4 1

5 7 8 6

Call (e.g. Clustering)

GraphCollection clustering = graph.callForCollection(new ClusteringAlgorithm())

Page 13: Distributed Graph Analytics with Gradoopwiki.ldbcouncil.org/.../20170209-LDBC-Gradoop.pdf · Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns

Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 13

Gradoop Extended Property Graphs Summary Evaluation

13

Evaluation

1. Extract subgraph containing only Persons and knows relations

2. Transform Persons to necessary information

3. Find communities using Label Propagation

4. Aggregate vertex count for each community

5. Select communities with more than 50K users

6. Combine large communities to a single graph

7. Group graph by Persons location and gender

8. Aggregate vertex and edge count of grouped graph

http://ldbcouncil.org/

Overall performance [3]

Page 14: Distributed Graph Analytics with Gradoopwiki.ldbcouncil.org/.../20170209-LDBC-Gradoop.pdf · Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns

Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 14

Gradoop Extended Property Graphs Summary Evaluation

14

Evaluation

1. Extract subgraph containing only Persons and knows relations

2. Transform Persons to necessary information

3. Find communities using Label Propagation

4. Aggregate vertex count for each community

5. Select communities with more than 50K users

6. Combine large communities to a single graph

7. Group graph by Persons location and gender

8. Aggregate vertex and edge count of grouped graph

https://git.io/vD2Ii

Overall performance [3]

Page 15: Distributed Graph Analytics with Gradoopwiki.ldbcouncil.org/.../20170209-LDBC-Gradoop.pdf · Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns

Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 15

Gradoop Extended Property Graphs Summary Evaluation

15

Evaluation

Dataset # Vertices # Edges

Graphalytics.1 61,613 2,026,082

Graphalytics.10 260,613 16,600,778

Graphalytics.100 1,695,613 147,437,275

Graphalytics.1000 12,775,613 1,363,747,260

Graphalytics.10000 90,025,613 10,872,109,028

• 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT

• slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960

Overall performance [3]

Page 16: Distributed Graph Analytics with Gradoopwiki.ldbcouncil.org/.../20170209-LDBC-Gradoop.pdf · Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns

Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 16

Gradoop Extended Property Graphs Summary Evaluation

16

Evaluation

Dataset # Vertices # Edges

Graphalytics.1 61,613 2,026,082

Graphalytics.10 260,613 16,600,778

Graphalytics.100 1,695,613 147,437,275

Graphalytics.1000 12,775,613 1,363,747,260

Graphalytics.10000 90,025,613 10,872,109,028

• 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT

• slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960

0

200

400

600

800

1000

1200

1 2 4 8 16

Ru

nti

me

[s]

Number of workers

Graphalytics.100

1

2

4

8

16

1 2 4 8 16

Spe

ed

up

Number of workers

Graphalytics.100 Linear

Overall performance [3]

Page 17: Distributed Graph Analytics with Gradoopwiki.ldbcouncil.org/.../20170209-LDBC-Gradoop.pdf · Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns

Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 17

Gradoop Extended Property Graphs Summary Evaluation

17

Evaluation

1

10

100

1000

10000

Ru

nti

me

[s]

Dataset # Vertices # Edges

Graphalytics.1 61,613 2,026,082

Graphalytics.10 260,613 16,600,778

Graphalytics.100 1,695,613 147,437,275

Graphalytics.1000 12,775,613 1,363,747,260

Graphalytics.10000 90,025,613 10,872,109,028

• 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0-SNAPSHOT

• slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960

Overall performance [3]

Page 18: Distributed Graph Analytics with Gradoopwiki.ldbcouncil.org/.../20170209-LDBC-Gradoop.pdf · Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns

Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 18

Gradoop Extended Property Graphs Summary Evaluation

18

Evaluation

Grouping [1]

• 16x Intel(R) Xeon(R) 2.50GHz 6 (12) • 16x 48 GB RAM • 1 Gigabit Ethernet • Hadoop 2.6.0 • Flink 1.0.3

• slots (per worker) 12 • jobmanager.heap.mb 2048 • taskmanager.heap.mb 40960

Dataset # Vertices # Edges

Graphalytics.1 61,613 2,026,082

Graphalytics.10 260,613 16,600,778

Graphalytics.100 1,695,613 147,437,275

Graphalytics.1000 12,775,613 1,363,747,260

Pokec 1,632,803 30,622,564

Page 19: Distributed Graph Analytics with Gradoopwiki.ldbcouncil.org/.../20170209-LDBC-Gradoop.pdf · Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns

Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 19

Gradoop Extended Property Graphs Summary Evaluation

19

Summary

• Extended Property Graph Model • Schema flexible: Type Labels and Properties • Logical Graphs / Graphs Collection

• Graph and Collection Operators • Combination to analytical workflows

• Implemented on Apache Flink • Built-in scalability • Combine with other libraries

Page 20: Distributed Graph Analytics with Gradoopwiki.ldbcouncil.org/.../20170209-LDBC-Gradoop.pdf · Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns

Distributed Graph Analytics with Gradoop – 9th LDBC TUC Meeting – Martin Junghanns 20

Gradoop Extended Property Graphs Summary Evaluation Summary

www.gradoop.com

[1] Junghanns, M.; Petermann, A.; K.; Rahm, E., „Distributed Grouping of Property Graphs with Gradoop“, Proc. BTW Conf. , 2017.

[2] Petermann, A.; Junghanns, M.; Kemper, S.; Gomez, K.; Teichmann, N.; Rahm, E., „Graph Mining for Complex Data Analytics “,

Proc. ICDM Conf. (Demo), 2016.

[3] Junghanns, M.; Petermann, A.; Teichmann, N.; Gomez, K.; Rahm, E., „Analyzing Extended Property Graphs with Apache Flink“,

Int. Workshop on Network Data Analytics (NDA), SIGMOD, 2016.

[4] Petermann, A.; Junghanns, M., „Scalable Business Intelligence with Graph Collections“,

it – Special Issue on Big Data Analytics, 2016.

[5] Petermann, A.; Junghanns, M.; Müller, M.; Rahm, E., „Graph-based Data Integration and Business Intelligence with BIIIG“,

Proc. VLDB Conf. (Demo), 2014.