Post on 08-Jan-2017
Batch & Stream Graph Processing with Apache Flink
Vasia Kalavri vasia@apache.org
@vkalavri
Apache Flink Meetup London October 5th, 2016
2
Graphs capture relationships between data items
connections, interactions, purchases, dependencies, friendships, etc.
Recommenders
Social networks
Bioinformatics
Web search
Outline
• Distributed Graph Processing 101
• Gelly: Batch Graph Processing with Apache Flink
• BREAK!
• Gelly-Stream: Continuous Graph Processing with Apache Flink
Apache Flink• An open-source, distributed data analysis framework
• True streaming at its core
• Streaming & Batch API
4
Historic data
Kafka, RabbitMQ, ...
HDFS, JDBC, ...
Event logsETL, Graphs,Machine LearningRelational, …
Low latency,windowing, aggregations, ...
WHEN DO YOU NEED DISTRIBUTED GRAPH PROCESSING?
MY GRAPH IS SO BIG, IT DOESN’T FIT IN A SINGLE MACHINE
Big Data Ninja
MISCONCEPTION #1
A SOCIAL NETWORK
NAIVE WHO(M)-T0-FOLLOW
▸Naive Who(m) to Follow:
▸ compute a friends-of-friends list per user
▸ exclude existing friends
▸ rank by common connections
DON’T JUST CONSIDER YOUR INPUT GRAPH SIZE. INTERMEDIATE DATA MATTERS TOO!
DISTRIBUTED PROCESSING IS ALWAYS FASTER THAN SINGLE-NODE
Data Science Rockstar
MISCONCEPTION #2
GRAPHS DON’T APPEAR OUT OF THIN AIR
Expectation…
GRAPHS DON’T APPEAR OUT OF THIN AIR
Reality!
WHEN DO YOU NEED DISTRIBUTED GRAPH PROCESSING?
▸When you do have really big graphs
▸When the intermediate data is big
▸When your data is already distributed
▸When you want to build end-to-end graph pipelines
HOW DO WE EXPRESS A DISTRIBUTED GRAPH ANALYSIS TASK?
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Iterative value propagation
Giraph++
2013
Graph Traversals
NScale
2014
Ego-network analysis
Arabesque
2015
Pattern Matching
Tinkerpop
PREGEL: THINK LIKE A VERTEX
1
5
4
3
2 1 3, 4
2 1, 4
5 3
. . .
PREGEL: SUPERSTEPS
(Vi+1, outbox) <— compute(Vi, inbox)
1 3, 4
2 1, 4
5 3
. .
1 3, 4
2 1, 4
5 3
. .
Superstep i Superstep i+1
PAGERANK: THE WORD COUNT OF GRAPH PROCESSING
VertexID Out-degree Transition Probability
1 2 1/2
2 2 1/2
3 0 -
4 3 1/3
5 1 1
1
5
4
3
2
PAGERANK: THE WORD COUNT OF GRAPH PROCESSING
VertexID Out-degree Transition Probability
1 2 1/2
2 2 1/2
3 0 -
4 3 1/3
5 1 1
PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)
1
5
4
3
2
1
5
4
3
2
PAGERANK: THE WORD COUNT OF GRAPH PROCESSING
VertexID Out-degree Transition Probability
1 2 1/2
2 2 1/2
3 0 -
4 3 1/3
5 1 1
PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)
PAGERANK: THE WORD COUNT OF GRAPH PROCESSING
VertexID Out-degree Transition Probability
1 2 1/2
2 2 1/2
3 0 -
4 3 1/3
5 1 1
PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)
1
5
4
3
2
PAGERANK: THE WORD COUNT OF GRAPH PROCESSING
VertexID Out-degree Transition Probability
1 2 1/2
2 2 1/2
3 0 -
4 3 1/3
5 1 1
PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)
1
5
4
3
2
PREGEL EXAMPLE: PAGERANK
void compute(messages): sum = 0.0
for (m <- messages) do sum = sum + m end for
setValue(0.15/numVertices() + 0.85*sum)
for (edge <- getOutEdges()) do sendMessageTo( edge.target(), getValue()/numEdges) end for
sum up received messages
update vertex rank
distribute rank to neighbors
SIGNAL-COLLECT
outbox <— signal(Vi)
1 3, 4
2 1, 4
5 3
. .
1 3, 4
2 1, 4
5 3
. .
Superstep i
Vi+1 <— collect(inbox)
1 3, 4
2 1, 4
5 3
. .
Signal Collect Superstep i+1
SIGNAL-COLLECT EXAMPLE: PAGERANK
void signal(): for (edge <- getOutEdges()) do sendMessageTo( edge.target(), getValue()/numEdges) end for
void collect(messages): sum = 0.0 for (m <- messages) do sum = sum + m end for
setValue(0.15/numVertices() + 0.85*sum)
distribute rank to neighbors
sum up messages
update vertex rank
GATHER-SUM-APPLY (POWERGRAPH)
1
. . .. . .
Gather Sum
1
2
5
. . .
Apply
3
1 5
5 3
1
. . .
Gather
3
1 5
5 3
Superstep i Superstep i+1
GSA EXAMPLE: PAGERANK
double gather(source, edge, target): return target.value() / target.numEdges()
double sum(rank1, rank2): return rank1 + rank2
double apply(sum, currentRank): return 0.15 + 0.85*sum
compute partial rank
combine partial ranks
update rank
PREGEL VS. SIGNAL-COLLECT VS. GSA
Update Function Properties
Update Function Logic
Communication Scope
Communication Logic
Pregel arbitrary arbitrary any vertex arbitrary
Signal-Collect arbitrarybased on received
messagesany vertex based on vertex
state
GSA associative & commutative
based on neighbors’
valuesneighborhood based on vertex
state
CAN WE HAVE IT ALL?
▸ Data pipeline integration: built on top of an efficient distributed processing engine
▸ Graph ETL: high-level API with abstractions and methods to transform graphs
▸ Familiar programming model: support popular programming abstractions
Gelly the Apache Flink Graph API
Apache Flink Stack
Gel
ly
Tabl
e/SQ
L
ML
SAM
OA
DataSet (Java/Scala) DataStream (Java/Scala)
Had
oop
M/R
Local Remote Yarn Embedded
Dat
aflo
w
Dat
aflo
w
Tabl
e/SQ
L
Cas
cadi
ngStreaming dataflow runtime
CEP
Meet Gelly• Java & Scala Graph APIs on top of Flink’s DataSet API
Flink Core
Scala API(batch and streaming)
Java API(batch and streaming)
FlinkML GellyTable API ...
Transformations and Utilities
Iterative Graph Processing
Graph Library
34
Gelly is NOT
• a graph database
• a specialized graph processor
35
Hello, Gelly!ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Edge<Long, NullValue>> edges = getEdgesDataSet(env);
Graph<Long, Long, NullValue> graph = Graph.fromDataSet(edges, env);
DataSet<Vertex<Long, Long>> verticesWithMinIds = graph.run(
new ConnectedComponents(maxIterations));
val env = ExecutionEnvironment.getExecutionEnvironment
val edges: DataSet[Edge[Long, NullValue]] = getEdgesDataSet(env)
val graph = Graph.fromDataSet(edges, env)
val components = graph.run(new ConnectedComponents(maxIterations))
Java
Scala
Graph MethodsGraph Properties
getVertexIds getEdgeIds numberOfVertices numberOfEdges getDegrees
Mutationsadd vertex/edge remove vertex/edge
Transformationsmap, filter, join subgraph, union, difference reverse, undirected getTriplets
GeneratorsR-Mat (power-law) Grid Star Complete …
Example: mapVertices// increment each vertex value by one val graph = Graph.fromDataSet(...) // increment each vertex value by one val updatedGraph = graph.mapVertices(v => v.getValue + 1)
4
28
5
53
17
4
5
Example: subGraphval graph: Graph[Long, Long, Long] = ... // keep only vertices with positive values // and only edges with negative values val subGraph = graph.subgraph(
vertex => vertex.getValue > 0,
edge => edge.getValue < 0
)
Neighborhood MethodsApply a reduce function to the 1st-hop neighborhood of each vertex in parallelgraph.reduceOnNeighbors(
new MinValue, EdgeDirection.OUT)
What makes Gelly unique?• Batch graph processing on top of a streaming
dataflow engine
• Built for end-to-end analytics
• Support for multiple iteration abstractions
• Graph algorithm building blocks
• A large open-source library of graph algorithms
Why streaming dataflow?
• Batch engines materialize data… even if they don’t have to • the graph is always loaded and materialized in memory,
even if not needed, e.g. mapping, filtering, transformation
• Communication and computation overlap
• We can do continuous graph processing (more after the break!)
End-to-end analytics
• Graphs don’t appear out of thin air…
• We need to support pre- and post-processing
• Gelly can be easily mixed with the DataSet API: pre-processing, graph analysis, and post-processing in the same Flink program
Iterative Graph Processing
• Gelly offers iterative graph processing abstractions on top of Flink’s Delta iterations
• vertex-centric
• scatter-gather
• gather-sum-apply
• partition-centric*
Flink Iteration Operators
Input
Iterative Update Function
Result ReplaceWorkset
Iterative Update Function
Result
Solution Set
State
Optimization• the runtime is aware of the iterative execution • no scheduling overhead between iterations • caching and state maintenance are handled automaticallyPush work
“out of the loop”Maintain state as indexCache Loop-invariant Data
Vertex-Centric SSSPfinal class SSSPComputeFunction extends ComputeFunction {
override def compute(vertex: Vertex, messages: MessageIterator) = {
var minDistance = if (vertex.getId == srcId) 0 else Double.MaxValue
while (messages.hasNext) { val msg = messages.next if (msg < minDistance) minDistance = msg }
if (vertex.getValue > minDistance) { setNewVertexValue(minDistance) for (edge: Edge <- getEdges) sendMessageTo(edge.getTarget, vertex.getValue + edge.getValue) }
Algorithms building blocks• Allow operator re-use across graph algorithms
when processing the same input with a similar configuration
Library of Algorithms• PageRank • Single Source Shortest Paths • Label Propagation • Weakly Connected Components • Community Detection • Triangle Count & Enumeration • Local and Global Clustering Coefficient • HITS • Jaccard & Adamic-Adar Similarity • Graph Summarization
• val ranks = inputGraph.run(new PageRank(0.85, 20))
Tracker
Tracker
Ad Server
display relevant ads
cookie exchange
profiling
Web Tracking
Can’t we block them?
proxy
Tracker
Tracker
Ad Server
Legitimate site
• not frequently updated• not sure who or based on what criteria URLs are
blacklisted• miss “hidden” trackers or dual-role nodes• blocking requires manual matching against the list• can you buy your way into the whitelist?
Available SolutionsCrowd-sourced “black lists” of tracker URLs: - AdBlock, DoNotTrack, EasyPrivacy
DataSet
• 6 months (Nov 2014 - April 2015) of augmented Apache logs from a web proxy
• 80m requests, 2m distinct URLs, 3k users
h2
h3 h4
h5 h6
h8
h7
h1
h3
h4
h5
h6
h1
h2
h7
h8
r1
r2
r3
r5
r6
r7
NT
NT
T
T
?
T
NT
NT
r4
r1
r2r3
r3 r3 r4
r5r6
r7
hosts-projection graph
: referer: non-tracker host: tracker host: unlabeled host
The Hosts-Projection Graph
U: Referers
referer-hosts graph
V: hosts
Classification via Label Propagation
non-tracker tracker unlabeled
55
Data Pipeline
raw logs cleaned logs
1: logs pre-processing
2: bipartite graph creation
3: largest connected component extraction
4: hosts-projection
graph creation
5: community detection
google-analytics.com: Tbscored-research.com: Tfacebook.com: NTgithub.com: NTcdn.cxense.com: NT...
6: results
DataSet API
GellyDataSet API
Feeling Gelly?• Gelly Guide
https://ci.apache.org/projects/flink/flink-docs-master/libs/gelly_guide.html • To Petascale and Beyond @Flink Forward ‘16
http://flink-forward.org/kb_sessions/to-petascale-and-beyond-apache-flink-in-the-clouds/
• Web Tracker Detection @Flink Forward ’15
https://www.youtube.com/watch?v=ZBCXXiDr3TU
paper: Kalavri, Vasiliki, et al. "Like a pack of wolves: Community structure of web trackers." International Conference on Passive and Active Network Measurement, 2016.
Gelly-Stream single-pass stream graph
processing with Flink
Real Graphs are dynamic
Graphs are created from events happening in real-time
How we’ve done graph processing so far
1. Load: read the graph from disk and partition it in memory
2. Compute: read and mutate the graph state
How we’ve done graph processing so far
1. Load: read the graph from disk and partition it in memory
3. Store: write the final graph state back to disk
How we’ve done graph processing so far
2. Compute: read and mutate the graph state
1. Load: read the graph from disk and partition it in memory
What’s wrong with this model?
• It is slow • wait until the computation is over before you see
any result • pre-processing and partitioning
• It is expensive • lots of memory and CPU required in order to
scale • It requires re-computation for graph changes
• no efficient way to deal with updates
Can we do graph processing on streams?
• Maintain the dynamic graph structure
• Provide up-to-date results with low latency
• Compute on fresh state only
Single-pass graph streaming
• Each event is an edge addition
• Maintain only a graph summary
• Recent events are grouped in graph windows
Graph Summaries
• spanners for distance estimation • sparsifiers for cut estimation • sketches for homomorphic properties
graph summary
algorithm algorithm~R1 R2
1
43
2
5
i=0
Batch Connected Components
6
7
8
1
43
2
5
6
7
8
i=0
Batch Connected Components
14
3 45
235
24
78
67
68
1
21
2
2
i=1
Batch Connected Components
6
6
6
1
21
1
2
6
6
6
i=1
Batch Connected Components
2
1 22
112
12
76
6
6
1
11
1
1
i=2
Batch Connected Components
6
6
6
54
76
86
42
31
52
Stream Connected Components
Graph Summary: Disjoint Set (Union-Find)
• Only store component IDs and vertex IDs
54
76
86
42
43
31
52
1
3
Cid = 1
54
76
86
42
43
87
31
52
1
3
Cid = 1
2
5
Cid = 2
54
76
86
42
43
87
41
31
52
1
3
Cid = 1
2
5
Cid = 2
4
54
76
86
42
43
87
41
31
52
1
3
Cid = 1
2
5
Cid = 2
4
6
7Cid = 6
54
76
86
42
43
87
41
31
52
1
3
Cid = 1
2
5
Cid = 2
4
6
7Cid = 6
8
54
76
86
42
43
87
41
31
52
1
3
Cid = 1
2
5
Cid = 2
4
6
7Cid = 6
8
54
76
86
42
43
87
41
52
6
7Cid = 6
8
1
3
Cid = 1
2
5
Cid = 2
4
54
76
86
42
43
87
41
52
1
3
Cid = 1
2
5
4
6
7Cid = 6
8
Distributed Stream Connected Components
Stream Connected Components with Flink
DataStream<DisjointSet> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .fold(new DisjointSet(), new UpdateCC()) .flatMap(new Merger()) .setParallelism(1);
Stream Connected Components with Flink
DataStream<DisjointSet> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .fold(new DisjointSet(), new UpdateCC()) .flatMap(new Merger()) .setParallelism(1);
Partition the edge stream
Stream Connected Components with Flink
DataStream<DisjointSet> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .fold(new DisjointSet(), new UpdateCC()) .flatMap(new Merger()) .setParallelism(1);
Define the merging frequency
Stream Connected Components with Flink
DataStream<DisjointSet> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .fold(new DisjointSet(), new UpdateCC()) .flatMap(new Merger()) .setParallelism(1);
merge locally
Stream Connected Components with Flink
DataStream<DisjointSet> cc = edgeStream .keyBy(0) .timeWindow(Time.of(100, TimeUnit.MILLISECONDS)) .fold(new DisjointSet(), new UpdateCC()) .flatMap(new Merger()) .setParallelism(1); merge globally
Gelly on Streams
DataStreamDataSet
Distributed Dataflow
Deployment
Gelly Gelly-Stream
• Static Graphs • Multi-Pass Algorithms • Full Computations
• Dynamic Graphs • Single-Pass Algorithms • Approximate Computations
DataStream
Introducing Gelly-StreamGelly-Stream enriches the DataStream API with two new additional ADTs:
• GraphStream: • A representation of a data stream of edges.
• Edges can have state (e.g. weights).
• Supports property streams, transformations and aggregations.
• GraphWindow: • A “time-slice” of a graph stream.
• It enables neighborhood aggregations
GraphStream Operations
.getEdges()
.getVertices()
.numberOfVertices()
.numberOfEdges()
.getDegrees()
.inDegrees()
.outDegrees()
GraphStream -> DataStream
.mapEdges();
.distinct();
.filterVertices();
.filterEdges();
.reverse();
.undirected();
.union();
GraphStream -> GraphStream
Property Streams Transformations
Graph Stream Aggregations
result aggregate
property streamgraph stream
(window) fold
combine
fold
reduce
local summaries
global summary
edges
agg
global aggregates can be persistent or transient
graphStream.aggregate( new MyGraphAggregation(window, fold, combine, transform))
Slicing Graph StreamsgraphStream.slice(Time.of(1, MINUTE));
11:40 11:41 11:42 11:43
Aggregating SlicesgraphStream.slice(Time.of(1, MINUTE), direction)
.reduceOnEdges();
.foldNeighbors();
.applyOnNeighbors();
• Slicing collocates edges by vertex information
• Neighborhood aggregations on sliced graphs
source
target
Aggregations
Finding Matches NearbygraphStream.filterVertices(GraphGeeks())
.slice(Time.of(15, MINUTE), EdgeDirection.IN)
.applyOnNeighbors(FindPairs())
slice
GraphStream :: graph geek check-ins
wendy checked_in soap_bar steve checked_in soap_bar
tom checked_in joe’s_grill sandra checked_in soap_bar
rafa checked_in joe’s_grill
wendy
steve
sandra
soapbar
tom
rafa
joe’sgrill
FindPairs
{wendy, steve} {steve, sandra} {wendy, sandra} {tom, rafa}
GraphWindow :: user-place
Feeling Gelly?• Gelly Guide
https://ci.apache.org/projects/flink/flink-docs-master/libs/gelly_guide.html • Gelly-Stream Repository
https://github.com/vasia/gelly-streaming • Gelly-Stream talk @FOSDEM16
https://fosdem.org/2016/schedule/event/graph_processing_apache_flink/
• Related Papers
http://www.citeulike.org/user/vasiakalavri/tag/graph-streaming
Batch & Stream Graph Processing with Apache Flink
Vasia Kalavri vasia@apache.org
@vkalavri
Apache Flink Meetup London October 5th, 2016