# Apache Flink & Graph Processing

date post

08-Jan-2017Category

## Data & Analytics

view

116download

2

Embed Size (px)

### Transcript of Apache Flink & Graph Processing

Batch & Stream Graph Processing with Apache Flink

Vasia Kalavri [email protected]

@vkalavri

Apache Flink Meetup London October 5th, 2016

mailto:[email protected]?subject=

2

Graphs capture relationships between data items

connections, interactions, purchases, dependencies, friendships, etc.

Recommenders

Social networks

Bioinformatics

Web search

Outline

Distributed Graph Processing 101

Gelly: Batch Graph Processing with Apache Flink

BREAK!

Gelly-Stream: Continuous Graph Processing with Apache Flink

Apache Flink An open-source, distributed data analysis framework

True streaming at its core

Streaming & Batch API

4

Historic data

Kafka, RabbitMQ, ...

HDFS, JDBC, ...

Event logsETL, Graphs,Machine LearningRelational,

Low latency,windowing, aggregations, ...

WHEN DO YOU NEED DISTRIBUTED GRAPH PROCESSING?

MY GRAPH IS SO BIG, IT DOESNT FIT IN A SINGLE MACHINE

Big Data Ninja

MISCONCEPTION #1

A SOCIAL NETWORK

NAIVE WHO(M)-T0-FOLLOW

Naive Who(m) to Follow:

compute a friends-of-friends list per user

exclude existing friends

rank by common connections

DONT JUST CONSIDER YOUR INPUT GRAPH SIZE. INTERMEDIATE DATA MATTERS TOO!

DISTRIBUTED PROCESSING IS ALWAYS FASTER THAN SINGLE-NODE

Data Science Rockstar

MISCONCEPTION #2

GRAPHS DONT APPEAR OUT OF THIN AIR

Expectation

GRAPHS DONT APPEAR OUT OF THIN AIR

Reality!

WHEN DO YOU NEED DISTRIBUTED GRAPH PROCESSING?

When you do have really big graphs

When the intermediate data is big

When your data is already distributed

When you want to build end-to-end graph pipelines

HOW DO WE EXPRESS A DISTRIBUTED GRAPH ANALYSIS TASK?

RECENT DISTRIBUTED GRAPH PROCESSING HISTORY

2004

MapReduce

Pegasus

2009

Pregel

2010

Signal-Collect

PowerGraph

2012

Iterative value propagation

Giraph++

2013

Graph Traversals

NScale

2014

Ego-network analysis

Arabesque

2015

Pattern Matching

Tinkerpop

PREGEL: THINK LIKE A VERTEX

1

5

4

3

2 1 3, 4

2 1, 4

5 3

. . .

PREGEL: SUPERSTEPS

(Vi+1, outbox)

PAGERANK: THE WORD COUNT OF GRAPH PROCESSING

VertexID Out-degree Transition Probability

1 2 1/2

2 2 1/2

3 0 -

4 3 1/3

5 1 1

1

5

4

3

2

PAGERANK: THE WORD COUNT OF GRAPH PROCESSING

VertexID Out-degree Transition Probability

1 2 1/2

2 2 1/2

3 0 -

4 3 1/3

5 1 1

PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)

1

5

4

3

2

1

5

4

3

2

PAGERANK: THE WORD COUNT OF GRAPH PROCESSING

VertexID Out-degree Transition Probability

1 2 1/2

2 2 1/2

3 0 -

4 3 1/3

5 1 1

PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)

PAGERANK: THE WORD COUNT OF GRAPH PROCESSING

VertexID Out-degree Transition Probability

1 2 1/2

2 2 1/2

3 0 -

4 3 1/3

5 1 1

PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)

1

5

4

3

2

PAGERANK: THE WORD COUNT OF GRAPH PROCESSING

VertexID Out-degree Transition Probability

1 2 1/2

2 2 1/2

3 0 -

4 3 1/3

5 1 1

PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)

1

5

4

3

2

PREGEL EXAMPLE: PAGERANK

void compute(messages): sum = 0.0

for (m

SIGNAL-COLLECT

outbox

SIGNAL-COLLECT EXAMPLE: PAGERANK

void signal(): for (edge

GATHER-SUM-APPLY (POWERGRAPH)

1

. . .. . .

Gather Sum

1

2

5

. . .

Apply

3

1 5

5 3

1

. . .

Gather

3

1 5

5 3

Superstep i Superstep i+1

GSA EXAMPLE: PAGERANK

double gather(source, edge, target): return target.value() / target.numEdges()

double sum(rank1, rank2): return rank1 + rank2

double apply(sum, currentRank): return 0.15 + 0.85*sum

compute partial rank

combine partial ranks

update rank

PREGEL VS. SIGNAL-COLLECT VS. GSA

Update Function Properties

Update Function Logic

Communication Scope

Communication Logic

Pregel arbitrary arbitrary any vertex arbitrary

Signal-Collect arbitrarybased on received

messagesany vertex based on vertex state

GSA associative & commutativebased on neighbors

valuesneighborhood based on vertex state

CAN WE HAVE IT ALL?

Data pipeline integration: built on top of an efficient distributed processing engine

Graph ETL: high-level API with abstractions and methods to transform graphs

Familiar programming model: support popular programming abstractions

Gelly the Apache Flink Graph API

Apache Flink Stack

Gel

ly

Tabl

e/SQ

L

ML

SAM

OA

DataSet (Java/Scala) DataStream (Java/Scala)

Had

oop

M/R

Local Remote Yarn Embedded

Dat

aflo

w

Dat

aflo

w

Tabl

e/SQ

L

Cas

cadi

ngStreaming dataflow runtime

CEP

Meet Gelly Java & Scala Graph APIs on top of Flinks DataSet API

Flink Core

Scala API(batch and streaming)

Java API(batch and streaming)

FlinkML GellyTable API ...

Transformations and Utilities

Iterative Graph Processing

Graph Library

34

Gelly is NOT

a graph database

a specialized graph processor

35

Hello, Gelly!ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

DataSet edges = getEdgesDataSet(env);

Graph graph = Graph.fromDataSet(edges, env);

DataSet verticesWithMinIds = graph.run(

new ConnectedComponents(maxIterations));

val env = ExecutionEnvironment.getExecutionEnvironment

val edges: DataSet[Edge[Long, NullValue]] = getEdgesDataSet(env)

val graph = Graph.fromDataSet(edges, env)

val components = graph.run(new ConnectedComponents(maxIterations))

Java

Scala

Graph MethodsGraph Properties

getVertexIds getEdgeIds numberOfVertices numberOfEdges getDegrees

Mutationsadd vertex/edge remove vertex/edge

Transformationsmap, filter, join subgraph, union, difference reverse, undirected getTriplets

GeneratorsR-Mat (power-law) Grid Star Complete

Example: mapVertices// increment each vertex value by one val graph = Graph.fromDataSet(...) // increment each vertex value by one val updatedGraph = graph.mapVertices(v => v.getValue + 1)

4

28

5

53

17

4

5

Example: subGraphval graph: Graph[Long, Long, Long] = ... // keep only vertices with positive values // and only edges with negative values val subGraph = graph.subgraph(

vertex => vertex.getValue > 0,

edge => edge.getValue < 0

)

Neighborhood MethodsApply a reduce function to the 1st-hop neighborhood of each vertex in parallelgraph.reduceOnNeighbors(

new MinValue, EdgeDirection.OUT)

What makes Gelly unique? Batch graph processing on top of a streaming

dataflow engine

Built for end-to-end analytics

Support for multiple iteration abstractions

Graph algorithm building blocks

A large open-source library of graph algorithms

Why streaming dataflow?

Batch engines materialize data even if they dont have to the graph is always loaded and materialized in memory,

even if not needed, e.g. mapping, filtering, transformation

Communication and computation overlap

We can do continuous graph processing (more after the break!)

End-to-end analytics

Graphs dont appear out of thin air

We need to support pre- and post-processing

Gelly can be easily mixed with the DataSet API: pre-processing, graph analysis, and post-processing in the same Flink program

Iterative Graph Processing

Gelly offers iterative graph processing abstractions on top of Flinks Delta iterations

vertex-centric

scatter-gather

gather-sum-apply

partition-centric*

Flink Iteration Operators

Input

Iterative Update Function

Result ReplaceWorkset

Iterative Update Function

Result

Solution Set

State

Optimization the runtime is aware of the iterative execution no scheduling overhead between iterations caching and state maintenance are handled automaticallyPush work

out of the loopMaintain state as indexCache Loop-invariant Data

Vertex-Centric SSSPfinal class SSSPComputeFunction extends ComputeFunction {

override def compute(vertex: Vertex, messages: MessageIterator) = {

var minDistance = if (vertex.getId == srcId) 0 else Double.MaxValue

while (messages.hasNext) { val msg = messages.next if (msg < minDistance) minDistance = msg }

if (vertex.getValue > minDistance) { setNewVertexValue(minDistance) for (edge: Edge

Algorithms building blocks Allow operator re-use across graph algorithms

when processing the same input wit