Apache Flink & Graph Processing

Batch & Stream Graph Processing with Apache Flink

Vasia Kalavri vasia@apache.org

@vkalavri

Apache Flink Meetup London October 5th, 2016

Graphs capture relationships between data items

connections, interactions, purchases, dependencies, friendships, etc.

Recommenders

Social networks

Bioinformatics

Web search

Outline

• Distributed Graph Processing 101

• Gelly: Batch Graph Processing with Apache Flink

• BREAK!

• Gelly-Stream: Continuous Graph Processing with Apache Flink

Apache Flink• An open-source, distributed data analysis framework

• True streaming at its core

• Streaming & Batch API

Historic data

Kafka, RabbitMQ, ...

HDFS, JDBC, ...

Event logsETL, Graphs,Machine LearningRelational, …

Low latency,windowing, aggregations, ...

WHEN DO YOU NEED DISTRIBUTED GRAPH PROCESSING?

MY GRAPH IS SO BIG, IT DOESN’T FIT IN A SINGLE MACHINE

Big Data Ninja

MISCONCEPTION #1

A SOCIAL NETWORK

NAIVE WHO(M)-T0-FOLLOW

▸Naive Who(m) to Follow:

▸ compute a friends-of-friends list per user

▸ exclude existing friends

▸ rank by common connections

DON’T JUST CONSIDER YOUR INPUT GRAPH SIZE. INTERMEDIATE DATA MATTERS TOO!

DISTRIBUTED PROCESSING IS ALWAYS FASTER THAN SINGLE-NODE

Data Science Rockstar

MISCONCEPTION #2

GRAPHS DON’T APPEAR OUT OF THIN AIR

Expectation…

GRAPHS DON’T APPEAR OUT OF THIN AIR

Reality!

WHEN DO YOU NEED DISTRIBUTED GRAPH PROCESSING?

▸When you do have really big graphs

▸When the intermediate data is big

▸When your data is already distributed

▸When you want to build end-to-end graph pipelines

HOW DO WE EXPRESS A DISTRIBUTED GRAPH ANALYSIS TASK?

RECENT DISTRIBUTED GRAPH PROCESSING HISTORY

MapReduce

Pegasus

Pregel

Signal-Collect

PowerGraph

Iterative value propagation

Giraph++

Graph Traversals

NScale

Ego-network analysis

Arabesque

Pattern Matching

Tinkerpop

PREGEL: THINK LIKE A VERTEX

2 1 3, 4

2 1, 4

PREGEL: SUPERSTEPS

(Vi+1, outbox) <— compute(Vi, inbox)

1 3, 4

2 1, 4

1 3, 4

2 1, 4

Superstep i Superstep i+1

PAGERANK: THE WORD COUNT OF GRAPH PROCESSING

VertexID Out-degree Transition Probability

1 2 1/2

2 2 1/2

4 3 1/3

1 2 1/2

2 2 1/2

4 3 1/3

PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)

1 2 1/2

2 2 1/2

4 3 1/3

PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)

1 2 1/2

2 2 1/2

4 3 1/3

PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)

1 2 1/2

2 2 1/2

4 3 1/3

PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)

PREGEL EXAMPLE: PAGERANK

void compute(messages): sum = 0.0

for (m <- messages) do sum = sum + m end for

setValue(0.15/numVertices() + 0.85*sum)

for (edge <- getOutEdges()) do sendMessageTo( edge.target(), getValue()/numEdges) end for

sum up received messages

update vertex rank

distribute rank to neighbors

SIGNAL-COLLECT

outbox <— signal(Vi)

1 3, 4

2 1, 4

1 3, 4

2 1, 4

Superstep i

Vi+1 <— collect(inbox)

1 3, 4

2 1, 4

Signal Collect Superstep i+1

SIGNAL-COLLECT EXAMPLE: PAGERANK

void signal(): for (edge <- getOutEdges()) do sendMessageTo( edge.target(), getValue()/numEdges) end for

void collect(messages): sum = 0.0 for (m <- messages) do sum = sum + m end for

setValue(0.15/numVertices() + 0.85*sum)

distribute rank to neighbors

sum up messages

update vertex rank

GATHER-SUM-APPLY (POWERGRAPH)

. . .. . .

Gather Sum

Gather

Superstep i Superstep i+1

GSA EXAMPLE: PAGERANK

double gather(source, edge, target): return target.value() / target.numEdges()

double sum(rank1, rank2): return rank1 + rank2

double apply(sum, currentRank): return 0.15 + 0.85*sum

compute partial rank

combine partial ranks

update rank

PREGEL VS. SIGNAL-COLLECT VS. GSA

Update Function Properties

Update Function Logic

Communication Scope

Communication Logic

Pregel arbitrary arbitrary any vertex arbitrary

Signal-Collect arbitrarybased on received

messagesany vertex based on vertex

GSA associative & commutative

based on neighbors’

valuesneighborhood based on vertex

CAN WE HAVE IT ALL?

▸ Data pipeline integration: built on top of an efficient distributed processing engine

▸ Graph ETL: high-level API with abstractions and methods to transform graphs

▸ Familiar programming model: support popular programming abstractions

Gelly the Apache Flink Graph API

Apache Flink Stack

DataSet (Java/Scala) DataStream (Java/Scala)

Local Remote Yarn Embedded

ngStreaming dataflow runtime

Meet Gelly• Java & Scala Graph APIs on top of Flink’s DataSet API

Flink Core

Scala API(batch and streaming)

Java API(batch and streaming)

FlinkML GellyTable API ...

Transformations and Utilities

Iterative Graph Processing

Graph Library

Gelly is NOT

• a graph database

• a specialized graph processor

Hello, Gelly!ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

DataSet<Edge<Long, NullValue>> edges = getEdgesDataSet(env);

Graph<Long, Long, NullValue> graph = Graph.fromDataSet(edges, env);

DataSet<Vertex<Long, Long>> verticesWithMinIds = graph.run(

new ConnectedComponents(maxIterations));

val env = ExecutionEnvironment.getExecutionEnvironment

val edges: DataSet[Edge[Long, NullValue]] = getEdgesDataSet(env)

val graph = Graph.fromDataSet(edges, env)

val components = graph.run(new ConnectedComponents(maxIterations))

Graph MethodsGraph Properties

getVertexIds getEdgeIds numberOfVertices numberOfEdges getDegrees

Mutationsadd vertex/edge remove vertex/edge

Transformationsmap, filter, join subgraph, union, difference reverse, undirected getTriplets

GeneratorsR-Mat (power-law) Grid Star Complete …

Example: mapVertices// increment each vertex value by one val graph = Graph.fromDataSet(...) // increment each vertex value by one val updatedGraph = graph.mapVertices(v => v.getValue + 1)

Example: subGraphval graph: Graph[Long, Long, Long] = ... // keep only vertices with positive values // and only edges with negative values val subGraph = graph.subgraph(

vertex => vertex.getValue > 0,

edge => edge.getValue < 0

Neighborhood MethodsApply a reduce function to the 1st-hop neighborhood of each vertex in parallelgraph.reduceOnNeighbors(

new MinValue, EdgeDirection.OUT)

What makes Gelly unique?• Batch graph processing on top of a streaming

dataflow engine

• Built for end-to-end analytics

• Support for multiple iteration abstractions

• Graph algorithm building blocks

• A large open-source library of graph algorithms

Why streaming dataflow?

• Batch engines materialize data… even if they don’t have to • the graph is always loaded and materialized in memory,

even if not needed, e.g. mapping, filtering, transformation

• Communication and computation overlap

• We can do continuous graph processing (more after the break!)

End-to-end analytics

• Graphs don’t appear out of thin air…

• We need to support pre- and post-processing

• Gelly can be easily mixed with the DataSet API: pre-processing, graph analysis, and post-processing in the same Flink program

Iterative Graph Processing

• Gelly offers iterative graph processing abstractions on top of Flink’s Delta iterations

• vertex-centric

• scatter-gather

• gather-sum-apply

• partition-centric*

Flink Iteration Operators

Iterative Update Function

Result ReplaceWorkset

Iterative Update Function

Result

Solution Set

Optimization• the runtime is aware of the iterative execution • no scheduling overhead between iterations • caching and state maintenance are handled automaticallyPush work

“out of the loop”Maintain state as indexCache Loop-invariant Data

Vertex-Centric SSSPfinal class SSSPComputeFunction extends ComputeFunction {

override def compute(vertex: Vertex, messages: MessageIterator) = {

var minDistance = if (vertex.getId == srcId) 0 else Double.MaxValue

while (messages.hasNext) { val msg = messages.next if (msg < minDistance) minDistance = msg }

if (vertex.getValue > minDistance) { setNewVertexValue(minDistance) for (edge: Edge <- getEdges) sendMessageTo(edge.getTarget, vertex.getValue + edge.getValue) }

Algorithms building blocks• Allow operator re-use across graph algorithms

when processing the same input with a similar configuration

Library of Algorithms• PageRank • Single Source Shortest Paths • Label Propagation • Weakly Connected Components • Community Detection • Triangle Count & Enumeration • Local and Global Clustering Coefficient • HITS • Jaccard & Adamic-Adar Similarity • Graph Summarization

• val ranks = inputGraph.run(new PageRank(0.85, 20))

Tracker

Ad Server

display relevant ads

cookie exchange

profiling

Web Tracking

Can’t we block them?

Tracker

Ad Server

Legitimate site

• not frequently updated• not sure who or based on what criteria URLs are

blacklisted• miss “hidden” trackers or dual-role nodes• blocking requires manual matching against the list• can you buy your way into the whitelist?

Available SolutionsCrowd-sourced “black lists” of tracker URLs: - AdBlock, DoNotTrack, EasyPrivacy

DataSet

• 6 months (Nov 2014 - April 2015) of augmented Apache logs from a web proxy

• 80m requests, 2m distinct URLs, 3k users

r3 r3 r4

hosts-projection graph

: referer: non-tracker host: tracker host: unlabeled host

The Hosts-Projection Graph

U: Referers

referer-hosts graph

V: hosts

Classification via Label Propagation

non-tracker tracker unlabeled

Data Pipeline

raw logs cleaned logs

1: logs pre-processing

2: bipartite graph creation

3: largest connected component extraction

4: hosts-projection

graph creation

5: community detection

google-analytics.com: Tbscored-research.com: Tfacebook.com: NTgithub.com: NTcdn.cxense.com: NT...

6: results

DataSet API

GellyDataSet API

Feeling Gelly?• Gelly Guide

https://ci.apache.org/projects/flink/flink-docs-master/libs/gelly_guide.html • To Petascale and Beyond @Flink Forward ‘16

http://flink-forward.org/kb_sessions/to-petascale-and-beyond-apache-flink-in-the-clouds/

• Web Tracker Detection @Flink Forward ’15

https://www.youtube.com/watch?v=ZBCXXiDr3TU

paper: Kalavri, Vasiliki, et al. "Like a pack of wolves: Community structure of web trackers." International Conference on Passive and Active Network Measurement, 2016.

Gelly-Stream single-pass stream graph

processing with Flink

Real Graphs are dynamic

Graphs are created from events happening in real-time

How we’ve done graph processing so far

1. Load: read the graph from disk and partition it in memory

2. Compute: read and mutate the graph state

3. Store: write the final graph state back to disk

2. Compute: read and mutate the graph state

What’s wrong with this model?

• It is slow • wait until the computation is over before you see

any result • pre-processing and partitioning

• It is expensive • lots of memory and CPU required in order to

scale • It requires re-computation for graph changes

• no efficient way to deal with updates

Can we do graph processing on streams?

• Maintain the dynamic graph structure

• Provide up-to-date results with low latency

• Compute on fresh state only

Single-pass graph streaming

• Each event is an edge addition

• Maintain only a graph summary

• Recent events are grouped in graph windows

Graph Summaries

• spanners for distance estimation • sparsifiers for cut estimation • sketches for homomorphic properties

graph summary

algorithm algorithm~R1 R2

Batch Connected Components

Stream Connected Components

Graph Summary: Disjoint Set (Union-Find)

• Only store component IDs and vertex IDs

Cid = 1

Cid = 2

Cid = 1

Cid = 2

Cid = 1

Cid = 2

7Cid = 6

Cid = 1

Cid = 2

7Cid = 6

Cid = 1

Cid = 2

7Cid = 6

Cid = 1

Cid = 2

Cid = 1

7Cid = 6

Distributed Stream Connected Components

Stream Connected Components with Flink

DataStream<DisjointSet> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .fold(new DisjointSet(), new UpdateCC()) .flatMap(new Merger()) .setParallelism(1);

Partition the edge stream

Define the merging frequency

merge locally

DataStream<DisjointSet> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .fold(new DisjointSet(), new UpdateCC()) .flatMap(new Merger()) .setParallelism(1); merge globally

Gelly on Streams

DataStreamDataSet

Distributed Dataflow

Deployment

Gelly Gelly-Stream

• Static Graphs • Multi-Pass Algorithms • Full Computations

• Dynamic Graphs • Single-Pass Algorithms • Approximate Computations

DataStream

Introducing Gelly-StreamGelly-Stream enriches the DataStream API with two new additional ADTs:

• GraphStream: • A representation of a data stream of edges.

• Edges can have state (e.g. weights).

• Supports property streams, transformations and aggregations.

• GraphWindow: • A “time-slice” of a graph stream.

• It enables neighborhood aggregations

GraphStream Operations

.getEdges()

.getVertices()

.numberOfVertices()

.numberOfEdges()

.getDegrees()

.inDegrees()

.outDegrees()

GraphStream -> DataStream

.mapEdges();

.distinct();

.filterVertices();

.filterEdges();

.reverse();

.undirected();

.union();

GraphStream -> GraphStream

Property Streams Transformations

Graph Stream Aggregations

result aggregate

property streamgraph stream

(window) fold

combine

reduce

local summaries

global summary

global aggregates can be persistent or transient

graphStream.aggregate( new MyGraphAggregation(window, fold, combine, transform))

Slicing Graph StreamsgraphStream.slice(Time.of(1, MINUTE));

11:40 11:41 11:42 11:43

Aggregating SlicesgraphStream.slice(Time.of(1, MINUTE), direction)

.reduceOnEdges();

.foldNeighbors();

.applyOnNeighbors();

• Slicing collocates edges by vertex information

• Neighborhood aggregations on sliced graphs

source

target

Aggregations

Finding Matches NearbygraphStream.filterVertices(GraphGeeks())

.slice(Time.of(15, MINUTE), EdgeDirection.IN)

.applyOnNeighbors(FindPairs())

GraphStream :: graph geek check-ins

wendy checked_in soap_bar steve checked_in soap_bar

tom checked_in joe’s_grill sandra checked_in soap_bar

rafa checked_in joe’s_grill

sandra

soapbar

joe’sgrill

FindPairs

{wendy, steve} {steve, sandra} {wendy, sandra} {tom, rafa}

GraphWindow :: user-place

Feeling Gelly?• Gelly Guide

https://ci.apache.org/projects/flink/flink-docs-master/libs/gelly_guide.html • Gelly-Stream Repository

https://github.com/vasia/gelly-streaming • Gelly-Stream talk @FOSDEM16

https://fosdem.org/2016/schedule/event/graph_processing_apache_flink/

• Related Papers

http://www.citeulike.org/user/vasiakalavri/tag/graph-streaming

Batch & Stream Graph Processing with Apache Flink

Vasia Kalavri vasia@apache.org

@vkalavri

Apache Flink Meetup London October 5th, 2016

Apache Flink & Graph Processing

Data & Analytics

Transcript of Apache Flink & Graph Processing

Apache Flink Stream Processing

Apache Flink Training - System Overview

Stream Processing with Apache Flink - qconlondon.com · Apache Flink Apache Flink is an open source stream processing framework • Low latency • High throughput • Stateful •

Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015

Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

The Flink - Apache Bigtop integration

Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink

Writing Apache Spark and Apache Flink Applications Using Apache Bahir

Implementing BigPetStore with Apache Flink

Integrating Apache NiFi and Apache Flink

Meetup Apache Flink en Madrid. Futuro de Apache Flink y su rivalidad con Spark Streaming

Streaming Analytics with Apache Flink - Meetupfiles.meetup.com/18824486/Flink @ DC Flink Meetup.pdf · Apache Flink Stack 2 DataStream API Stream Processing DataSet API Batch Processing

Apache Flink internals

Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with Apache Flink

Flink and Apache Spark Fernanda de Camargo Magano Dylan ... · Flink and Apache Spark Fernanda de Camargo Magano Dylan Guedes. About Flink ... Introduction to Apache Flink Book. Use

Large Scale Centrality Measures in Apache Flink and Apache ... · Programming model of Apache Giraph & Apache Flink for iterative graph processing • Apache Giraph, a vertex centric

Apache Flink: The Latest and Greatest...Apache Flink: The Latest and Greatest 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution The

Introduction to Apache Flink

Large-Scale Graph Processing with Apache Flink

Graph Sampling with Distributed In-Memory Dataflow Systems · 2019-10-11 · Distributed Graph Sampling, Apache Flink, Apache Spark 1 INTRODUCTION Sampling is used to determine a