Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

71
GRADOOP: Scalable Graph Analytics with Apache Flink Martin Junghanns University of Leipzig

Transcript of Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Page 1: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

GRADOOP: Scalable Graph Analytics with Apache Flink

Martin Junghanns University of Leipzig

Page 2: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

About the speaker and the team

2011 Bachelor of Engineering Thesis: Partitioning of Dynamic Graphs

2014 Master of Science

Thesis: Graph Database Systems for Business Intelligence

Now: PhD Student, Database Group, University of Leipzig

Distributed Systems Distributed Graph Data Management Graph Theory & Algorithms

Professional Experience: sones GraphDB, SAP

André, PhD Student

Martin, PhD Student

Kevin, M.Sc. Student Niklas, M.Sc. Student

Page 3: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Motivation

Page 4: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

𝑮𝑟𝑟𝑟𝑟 = (𝑽𝑒𝑟𝑒𝑒𝑒𝑒𝑒,𝑬𝑑𝑑𝑒𝑒)

“Graphs are everywhere”

Page 5: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝑟𝑒𝑒𝐹𝑑𝑒𝑟𝑒𝑟𝑒)

“Graphs are everywhere”

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Page 6: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝑟𝑒𝑒𝐹𝑑𝑒𝑟𝑒𝑟𝑒)

“Graphs are everywhere”

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Page 7: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝑟𝑒𝑒𝐹𝑑𝑒𝑟𝑒𝑟𝑒)

“Graphs are everywhere”

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Page 8: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝑟𝑒𝑒𝐹𝑑𝑒𝑟𝑒𝑟𝑒)

“Graphs are everywhere”

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Page 9: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝐹𝐹𝐹𝐹𝐹𝑒𝑟𝑒)

“Graphs are everywhere”

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Trent

Page 10: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

𝐺𝑟𝑟𝑟𝑟 = (𝐔𝐔𝐔𝐔𝐔,𝐹𝐹𝐹𝐹𝐹𝐹𝑒𝑟𝑒)

“Graphs are everywhere”

Alice

Bob

Eve

Dave

Carol

Mallory

Peggy

Trent

Page 11: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

𝐺𝑟𝑟𝑟𝑟 = (𝐂𝐂𝐂𝐂𝐔𝐔,𝐶𝐹𝐹𝐹𝑒𝑒𝑒𝑒𝐹𝐹𝑒)

“Graphs are everywhere”

Leipzig pop: 544K

Dresden pop: 536K

Berlin pop: 3.5M

Hamburg pop: 1.7M

Munich pop: 1.4M

Chemnitz pop: 243K

Nuremberg pop: 500K

Cologne pop: 1M

Page 12: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

World Wide Web ca. 1 billion websites

“Graphs are large”

Facebook ca. 1.49 billion active users ca. 340 friends per user

Page 13: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

End-to-End Graph Analytics

Data Integration Graph Analytics Representation

Page 14: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

End-to-End Graph Analytics

Data Integration Graph Analytics Representation

Integrate data from one or more sources into a dedicated graph storage with common graph data model

Page 15: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

End-to-End Graph Analytics

Data Integration Graph Analytics Representation

Integrate data from one or more sources into a dedicated graph storage with common graph data model

Definition of analytical workflows from operator algebra

Page 16: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

End-to-End Graph Analytics

Data Integration Graph Analytics Representation

Integrate data from one or more sources into a dedicated graph storage with common graph data model

Definition of analytical workflows from operator algebra Result representation in a meaningful way

Page 17: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Graph Data Management Graph Database Systems Neo4j, OrientDB

Graph Processing Systems Pregel, Giraph

Distributed Workflow Systems Flink Gelly, Spark GraphX

Data Model Rich Graph Models

Generic Graph Models Generic Graph Models

Focus Local ACID Operations

Global Graph Operations Global Data and Graph Operations

Query Language Yes No No

Persistency Yes No No

Scalability Vertical Horizontal Horizontal

Workflows No No Yes

Data Integration No No No

Graph Analytics No Yes Yes

Representation Yes No No

Page 18: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Graph Data Management Graph Database Systems Neo4j, OrientDB

Graph Processing Systems Pregel, Giraph

Distributed Workflow Systems Flink Gelly, Spark GraphX

Data Model Rich Graph Models

Generic Graph Models Generic Graph Models

Focus Local ACID Operations

Global Graph Operations Global Data and Graph Operations

Query Language Yes No No

Persistency Yes No No

Scalability Vertical Horizontal Horizontal

Workflows No No Yes

Data Integration No No No

Graph Analytics No Yes Yes

Representation Yes No No

Page 19: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Graph Data Management Graph Database Systems Neo4j, OrientDB

Graph Processing Systems Pregel, Giraph

Distributed Workflow Systems Flink Gelly, Spark GraphX

Data Model Rich Graph Models

Generic Graph Models Generic Graph Models

Focus Local ACID Operations

Global Graph Operations Global Data and Graph Operations

Query Language Yes No No

Persistency Yes No No

Scalability Vertical Horizontal Horizontal

Workflows No No Yes

Data Integration No No No

Graph Analytics No Yes Yes

Representation Yes No No

Page 20: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Graph Data Management Graph Database Systems Neo4j, OrientDB

Graph Processing Systems Pregel, Giraph

Distributed Workflow Systems Flink Gelly, Spark GraphX

Data Model Rich Graph Models

Generic Graph Models Generic Graph Models

Focus Local ACID Operations

Global Graph Operations Global Data and Graph Operations

Query Language Yes No No

Persistency Yes No No

Scalability Vertical Horizontal Horizontal

Workflows No No Yes

Data Integration No No No

Graph Analytics No Yes Yes

Representation Yes No No

Page 21: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

What‘s missing?

An end-to-end framework and research platform for efficient, distributed and domain independent

graph data management and analytics.

Page 22: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

What‘s missing?

An end-to-end framework and research platform for efficient, distributed and domain independent

graph data management and analytics.

Page 23: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Gradoop Architecture & Data Model

Page 24: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

High Level Architecture

HDFS/YARN Cluster

HBase Distributed Graph Store

Extended Property Graph Model

Flink Operator Implementations

Data Integration

Flink Operator Execution

Workflow Declaration

Visual

GrALa DSL Representation

Data flow

Control flow

Graph Analytics Representation

Workflow Execution

Page 25: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

High Level Architecture

HBase Distributed Graph Store

Extended Property Graph Model

Flink Operator Implementations

Data Integration

Flink Operator Execution

Workflow Declaration

Visual

GrALa DSL Representation

Data flow

Control flow

Graph Analytics Representation

Workflow Execution

HDFS/YARN Cluster

Page 26: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Extended Property Graph Model

Page 27: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Extended Property Graph Model

Page 28: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Extended Property Graph Model

Page 29: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Graph Operators

Operator GrALa notation

Binary

Combination graph.combine(otherGraph) : Graph

Overlap graph.overlap(otherGraph) : Graph

Exclusion graph.exclude(otherGraph) : Graph

Isomorphism graph.isIsomorphicTo(otherGraph) : Boolean

Unary

Pattern Matching graph.match(patternGraph,predicate) : Collection

Aggregation graph.aggregate(propertyKey,aggregateFunction) : Graph

Projection graph.project(vertexFunction,edgeFunction) : Graph

Summarization graph.summarize( vertexGroupKeys,vertexAggregateFunction, edgeGroupKeys,edgeAggregateFunction) : Graph

Page 30: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Combination

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])

Page 31: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Combination

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2])

Page 32: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Graph Operators

Operator GrALa notation

Binary

Combination graph.combine(otherGraph) : Graph

Overlap graph.overlap(otherGraph) : Graph

Exclusion graph.exclude(otherGraph) : Graph

Isomorphism graph.isIsomorphicTo(otherGraph) : Boolean

Unary

Pattern Matching graph.match(patternGraph,predicate) : Collection

Aggregation graph.aggregate(propertyKey,aggregateFunction) : Graph

Projection graph.project(vertexFunction,edgeFunction) : Graph

Summarization graph.summarize( vertexGroupKeys,vertexAggregateFunction, edgeGroupKeys,edgeAggregateFunction) : Graph

Page 33: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Summarization

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) 2: vertexGroupingKeys = {:type, “city”} 3: edgeGroupingKeys = {:type} 4: vertexAggFunc = (Vertex vSum, Set vertices => vSum[“count”] = |vertices|) 5: edgeAggFunc = (Edge eSum, Set edges => eSum[“count”] = |edges|) 6: sumGraph = personGraph.summarize(vertexGroupingKeys, vertexAggFunc, edgeGroupingKeys, edgeAggFunc)

Page 34: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Summarization

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) 2: vertexGroupingKeys = {:type, “city”} 3: edgeGroupingKeys = {:type} 4: vertexAggFunc = (Vertex vSum, Set vertices => vSum[“count”] = |vertices|) 5: edgeAggFunc = (Edge eSum, Set edges => eSum[“count”] = |edges|) 6: sumGraph = personGraph.summarize(vertexGroupingKeys, vertexAggFunc, edgeGroupingKeys, edgeAggFunc)

Page 35: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Graph Collection Operators

Operator GrALa notation Collection

Selection collection.select(predicate) : Collection

Distinct collection.distinct() : Collection

Sort by collection.sortBy(key, [:asc|:desc]) : Collection

Top collection.top(limit) : Collection

Union collection.union(otherCollection) : Collection

Intersection collection.intersect(otherCollection) : Collection

Difference collection.difference(otherCollection) : Collection

Auxiliary

Apply collection.apply(unaryGraphOperator) : Collection

Reduce collection.reduce(binaryGraphOperator) : Graph

Call [graph|collection].callFor[Graph|Collection]( algorithm,parameters) : [Graph|Collection]

Page 36: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Selection

1: collection = <db.G[0],db.G[1],db.G[2]> 2: predicate = (Graph g => |g.V| > 3) 3: result = collection.select(predicate)

Page 37: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Selection

1: collection = <db.G[0],db.G[1],db.G[2]> 2: predicate = (Graph g => |g.V| > 3) 3: result = collection.select(predicate)

Page 38: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Graph Collection Operators

Operator GrALa notation Collection

Selection collection.select(predicate) : Collection

Distinct collection.distinct() : Collection

Sort by collection.sortBy(key, [:asc|:desc]) : Collection

Top collection.top(limit) : Collection

Union collection.union(otherCollection) : Collection

Intersection collection.intersect(otherCollection) : Collection

Difference collection.difference(otherCollection) : Collection

Auxiliary

Apply collection.apply(unaryGraphOperator) : Collection

Reduce collection.reduce(binaryGraphOperator) : Graph

Call [graph|collection].callFor[Graph|Collection]( algorithm,parameters) : [Graph|Collection]

Page 39: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Extended Property Graph Model in Flink

ID Label Properties Graphs

ID Label Properties Source Vertex

Target Vertex

Graphs

VertexData

EdgeData

GraphData

ID Label Properties

POJO

POJO

POJO

DataSet<Vertex<ID,VertexData>>

DataSet<Edge<ID,EdgeData>>

DataSet<Subgraph<ID,GraphData>>

Gelly

𝒱

𝒢

Pojo Representation

Page 40: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Extended Property Graph Model in Flink

VertexData

EdgeData

GraphData

POJO

POJO

POJO

DataSet<Vertex<ID,VertexData>>

DataSet<Edge<ID,EdgeData>>

DataSet<Subgraph<ID,GraphData>>

Gelly

VertexData

EdgeData

GraphData

Tuple

Tuple

Tuple

DataSet<VertexData>

DataSet<EdgeData>

DataSet<GraphData>

𝒱

𝒱

𝒢

𝒢

Pojo Representation

Tuple Representation

ID Label Properties Graphs

ID Label Properties Source Vertex

Target Vertex

Graphs

ID Label Properties

ID Label Properties Graphs

ID Label Properties Source Vertex

Target Vertex

Graphs

ID Label Properties

Page 41: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Summarization in Flink

VID City

0 L

1 L

2 D

3 D

4 D

5 B

EID S T

0 0 1

1 1 0

2 1 2

3 2 1

4 2 3

5 3 2

6 4 0

7 4 1

8 5 2

9 5 3

L [0,1]

D [2,3,4]

B [5]

VID City Count

0 L 2

2 D 3

5 B 1

VID Rep

0 0

1 0

2 2

3 2

4 2

5 5

ID S T

0 0 1

1 0 0

2 0 2

3 2 1

4 2 3

5 2 2

6 2 0

7 2 1

8 5 2

9 5 3

ID S T

0 0 0

1 0 0

2 0 2

3 2 0

4 2 2

5 2 2

6 2 0

7 2 0

8 5 2

9 5 2

0,0 [0,1]

0,2 [2]

2,0 [3,6,7]

2,2 [4,5]

5,2 [8,9]

EID S T Count

0 0 1 2

2 0 2 1

3 2 0 3

4 2 2 2

8 5 2 2

join(VID==S)

𝒱

ℰ’

𝒱′

groupBy(City)

reduceGroup + filter + map

reduceGroup + filter + map

groupBy(S,T)

join(VID==T)

Page 42: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Use Case: Graph Business Intelligence

Page 43: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Use Case: Graph Business Intelligence

Business intelligence usually based on relational data warehouses Enterprise data is integrated within dimensional schema Analysis limited to predefined relationships No support for relationship-oriented data mining

Facts

Dim 1

Dim 2

Dim 3

Page 44: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Use Case: Graph Business Intelligence

Business intelligence usually based on relational data warehouses Enterprise data is integrated within dimensional schema Analysis limited to predefined relationships No support for relationship-oriented data mining

Graph-based approach Integrate data sources within an instance graph by preserving original

relationships between data objects (transactional and master data) Determine subgraphs (business transaction graphs) related to business

activities Analyze subgraphs or entire graphs with aggregation queries, mining

relationship patterns, etc.

Facts

Dim 1

Dim 2

Dim 3

Page 45: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Prerequisites: Data Integration

Page 46: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Business Transaction Graphs

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesRevenue Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

basedOn serves

serves

bills

bills

bills

processedBy

Page 47: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Business Transaction Graphs

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesRevenue Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 48: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Business Transaction Graphs

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesRevenue Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 49: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Business Transaction Graphs

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesRevenue Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 50: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Business Transaction Graphs

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesRevenue Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 51: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Business Transaction Graphs

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesRevenue Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 52: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

BTG 1

(1) BTG Extraction

BTG 2

BTG 3

BTG 4

BTG 5

BTG n

Page 53: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

(1) BTG Extraction

// generate base collection btgs = iig.callForCollection( :BusinessTransactionGraphs , {} )

Page 54: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

(2) Profit Aggregation

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesRevenue Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 55: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

(2) Profit Aggregation

// generate base collection btgs = iig.callForCollection( :BusinessTransactionGraphs , {} ) // define profit aggregate function aggFunc = ( Graph g => g.V.values(“Revenue").sum() - g.V.values(“Expense").sum() )

Page 56: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

(2) Profit Aggregation

BTG 1

BTG 2

BTG 3

BTG 4

BTG 5

BTG n

∑ Revenue ∑ Expenses Net Profit

5,000 -3,000 2,000

9,000 -3,000 6,000

2,000 -1,500 500

5,000 -7,000 -2,000

10,000 -15,000 -5,000

… … …

8,000 -4,000 4,000

Page 57: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

(2) Profit Aggregation

// generate base collection btgs = iig.callForCollection( :BusinessTransactionGraphs , {} ) // define profit aggregate function aggFunc = ( Graph g => g.V.values(“Revenue").sum() - g.V.values(“Expense").sum() ) // apply aggregate function and store result at new property btgs = btgs.apply( Graph g => g.aggregate( “Profit“ , aggFunc ) )

Page 58: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

(3) BTG Clustering

BTG 1

BTG 2

BTG 3

BTG 4

BTG 5

BTG n

∑ Revenue ∑ Expenses Net Profit

5,000 -3,000 2,000

9,000 -3,000 6,000

2,000 -1,500 500

5,000 -7,000 -2,000

10,000 -15,000 -5,000

… … …

8,000 -4,000 4,000

Page 59: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

(3) BTG Clustering

// select profit and loss clusters profitBtgs = btgs.select( Graph g => g[“Profit”] >= 0 ) lossBtgs = btgs.difference(profitBtgs)

Page 60: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

(4) Cluster Characteristic Patterns

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesRevenue Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 61: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

(4) Cluster Characteristic Patterns

CIT ERP

Employee Name: Dave

Employee Name: Alice

Employee Name: Bob

Employee Name: Carol

Ticket Expense: 500

SalesQuotation

SalesOrder PurchaseOrder

PurchaseOrder

SalesRevenue Revenue: 5,000

PurchaseInvoice Expense: 2,000

PurchaseInvoice Expense: 1,500

sentBy

createdBy

processedBy

createdBy

openedFor

processedBy

processedBy

basedOn serves

serves

bills

bills

bills

Page 62: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

(4) Cluster Characteristic Patterns

BTG 1

BTG 2

BTG 3

BTG 4

BTG 5

BTG n

∑ Revenue ∑ Expenses Net Profit

5,000 -3,000 2,000

9,000 -3,000 6,000

2,000 -1,500 500

5,000 -7,000 -2,000

10,000 -15,000 -5,000

… … …

8,000 -4,000 4,000

Ticket Alice

processedBy

Bob

createdBy

PurchaseOrder

Page 63: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

(4) Cluster Characteristic Patterns

// select profit and loss clusters profitBtgs = btgs.select( Graph g => g[“Profit”] >= 0 ) lossBtgs = btgs.difference(profitBtgs) // apply magic profitFreqPats = profitBtgs.callForCollection( :FrequentSubgraphs , {“Threshold”:0.7} ) lossFreqPats = lossBtgs.callForCollection( :FrequentSubgraphs , {“Threshold”:0.7} ) // determine cluster characteristic patterns trivialPats = profitFreqPats.intersect(lossFreqPats) profitCharPatterns = profitFreqPats.difference(trivialPats) lossCharPatterns = lossFreqPats.difference(trivialPats)

Page 64: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Current State & Future Work

Page 65: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Current State

0.0.1 First Prototype (May 2015) Hadoop MapReduce and Giraph for operator implementations Too much complexity Performance loss through serialization in HDFS/HBase

0.0.2 Using Flink as execution layer (June 2015) Basic operators

Currently 0.0.3-SNAPSHOT Performance improvements More operator implementations

Page 66: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Operator implementations (0.0.3-SNAPSHOT)

Unary Pattern Matching Collection Selection Algorithms LabelPropagation

Aggregation Distinct BTG Extraction

Projection Sort by FSM

Summarization Top

Binary Combination Union

Overlap Intersection

Exclusion Difference

Isomorphism Auxiliary Apply

Reduce

Call

Page 67: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Future Work

Operator integration into Gelly Summarization FLINK-2411 Graph Sampling …

Graph Operations on streams (Flink) Graph Partitioning (maybe together with the Gelly people) Graph Versioning (Storage) Benchmarking GrALa Interpreter / Web UI

Page 68: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Benchmarks Sneak Preview

0

200

400

600

800

1000

1200

1400

1 2 4 8 16

Time [s]

# Worker

Summarization (Vertex and Edge Labels)

16x Intel(R) Xeon(R) CPU E5-2430 v2 @ 2.50GHz (12 Cores), 48 GB RAM Hadoop 2.5.2, Flink 0.9.0

slots (per node) 12 jobmanager.heap.mb 2048 taskmanager.heap.mb 40960

Foodbroker Graph (https://github.com/dbs-leipzig/foodbroker) Generates BI process data 858,624,267 Vertices, 4,406,445,007 Edges, 663GB Payload

Page 69: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Web UI Sneak Preview

Page 70: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Contributions welcome

Code Operator implementations Performance Tuning Storage layout

Data! and Use Cases

We are researchers, we assume ... Getting real data (especially BI data) is nearly impossible

People Bachelor / Master / PhD Thesis

Page 71: Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Thank you for building Flink!

www.gradoop.com

https://github.com/dbs-leipzig/gradoop http://dbs.uni-leipzig.de/file/GradoopTR.pdf

http://dbs.uni-leipzig.de/file/biiig-vldb2014.pdf