Apache Flink & Graph Processing
date post
08-Jan-2017Category
Data & Analytics
view
116download
2
Embed Size (px)
Transcript of Apache Flink & Graph Processing
Batch & Stream Graph Processing with Apache Flink
Vasia Kalavri [email protected]
@vkalavri
Apache Flink Meetup London October 5th, 2016
mailto:[email protected]?subject=
2
Graphs capture relationships between data items
connections, interactions, purchases, dependencies, friendships, etc.
Recommenders
Social networks
Bioinformatics
Web search
Outline
Distributed Graph Processing 101
Gelly: Batch Graph Processing with Apache Flink
BREAK!
Gelly-Stream: Continuous Graph Processing with Apache Flink
Apache Flink An open-source, distributed data analysis framework
True streaming at its core
Streaming & Batch API
4
Historic data
Kafka, RabbitMQ, ...
HDFS, JDBC, ...
Event logsETL, Graphs,Machine LearningRelational,
Low latency,windowing, aggregations, ...
WHEN DO YOU NEED DISTRIBUTED GRAPH PROCESSING?
MY GRAPH IS SO BIG, IT DOESNT FIT IN A SINGLE MACHINE
Big Data Ninja
MISCONCEPTION #1
A SOCIAL NETWORK
NAIVE WHO(M)-T0-FOLLOW
Naive Who(m) to Follow:
compute a friends-of-friends list per user
exclude existing friends
rank by common connections
DONT JUST CONSIDER YOUR INPUT GRAPH SIZE. INTERMEDIATE DATA MATTERS TOO!
DISTRIBUTED PROCESSING IS ALWAYS FASTER THAN SINGLE-NODE
Data Science Rockstar
MISCONCEPTION #2
GRAPHS DONT APPEAR OUT OF THIN AIR
Expectation
GRAPHS DONT APPEAR OUT OF THIN AIR
Reality!
WHEN DO YOU NEED DISTRIBUTED GRAPH PROCESSING?
When you do have really big graphs
When the intermediate data is big
When your data is already distributed
When you want to build end-to-end graph pipelines
HOW DO WE EXPRESS A DISTRIBUTED GRAPH ANALYSIS TASK?
RECENT DISTRIBUTED GRAPH PROCESSING HISTORY
2004
MapReduce
Pegasus
2009
Pregel
2010
Signal-Collect
PowerGraph
2012
Iterative value propagation
Giraph++
2013
Graph Traversals
NScale
2014
Ego-network analysis
Arabesque
2015
Pattern Matching
Tinkerpop
PREGEL: THINK LIKE A VERTEX
1
5
4
3
2 1 3, 4
2 1, 4
5 3
. . .
PREGEL: SUPERSTEPS
(Vi+1, outbox)
PAGERANK: THE WORD COUNT OF GRAPH PROCESSING
VertexID Out-degree Transition Probability
1 2 1/2
2 2 1/2
3 0 -
4 3 1/3
5 1 1
1
5
4
3
2
PAGERANK: THE WORD COUNT OF GRAPH PROCESSING
VertexID Out-degree Transition Probability
1 2 1/2
2 2 1/2
3 0 -
4 3 1/3
5 1 1
PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)
1
5
4
3
2
1
5
4
3
2
PAGERANK: THE WORD COUNT OF GRAPH PROCESSING
VertexID Out-degree Transition Probability
1 2 1/2
2 2 1/2
3 0 -
4 3 1/3
5 1 1
PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)
PAGERANK: THE WORD COUNT OF GRAPH PROCESSING
VertexID Out-degree Transition Probability
1 2 1/2
2 2 1/2
3 0 -
4 3 1/3
5 1 1
PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)
1
5
4
3
2
PAGERANK: THE WORD COUNT OF GRAPH PROCESSING
VertexID Out-degree Transition Probability
1 2 1/2
2 2 1/2
3 0 -
4 3 1/3
5 1 1
PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)
1
5
4
3
2
PREGEL EXAMPLE: PAGERANK
void compute(messages): sum = 0.0
for (m
SIGNAL-COLLECT
outbox
SIGNAL-COLLECT EXAMPLE: PAGERANK
void signal(): for (edge
GATHER-SUM-APPLY (POWERGRAPH)
1
. . .. . .
Gather Sum
1
2
5
. . .
Apply
3
1 5
5 3
1
. . .
Gather
3
1 5
5 3
Superstep i Superstep i+1
GSA EXAMPLE: PAGERANK
double gather(source, edge, target): return target.value() / target.numEdges()
double sum(rank1, rank2): return rank1 + rank2
double apply(sum, currentRank): return 0.15 + 0.85*sum
compute partial rank
combine partial ranks
update rank
PREGEL VS. SIGNAL-COLLECT VS. GSA
Update Function Properties
Update Function Logic
Communication Scope
Communication Logic
Pregel arbitrary arbitrary any vertex arbitrary
Signal-Collect arbitrarybased on received
messagesany vertex based on vertex state
GSA associative & commutativebased on neighbors
valuesneighborhood based on vertex state
CAN WE HAVE IT ALL?
Data pipeline integration: built on top of an efficient distributed processing engine
Graph ETL: high-level API with abstractions and methods to transform graphs
Familiar programming model: support popular programming abstractions
Gelly the Apache Flink Graph API
Apache Flink Stack
Gel
ly
Tabl
e/SQ
L
ML
SAM
OA
DataSet (Java/Scala) DataStream (Java/Scala)
Had
oop
M/R
Local Remote Yarn Embedded
Dat
aflo
w
Dat
aflo
w
Tabl
e/SQ
L
Cas
cadi
ngStreaming dataflow runtime
CEP
Meet Gelly Java & Scala Graph APIs on top of Flinks DataSet API
Flink Core
Scala API(batch and streaming)
Java API(batch and streaming)
FlinkML GellyTable API ...
Transformations and Utilities
Iterative Graph Processing
Graph Library
34
Gelly is NOT
a graph database
a specialized graph processor
35
Hello, Gelly!ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet edges = getEdgesDataSet(env);
Graph graph = Graph.fromDataSet(edges, env);
DataSet verticesWithMinIds = graph.run(
new ConnectedComponents(maxIterations));
val env = ExecutionEnvironment.getExecutionEnvironment
val edges: DataSet[Edge[Long, NullValue]] = getEdgesDataSet(env)
val graph = Graph.fromDataSet(edges, env)
val components = graph.run(new ConnectedComponents(maxIterations))
Java
Scala
Graph MethodsGraph Properties
getVertexIds getEdgeIds numberOfVertices numberOfEdges getDegrees
Mutationsadd vertex/edge remove vertex/edge
Transformationsmap, filter, join subgraph, union, difference reverse, undirected getTriplets
GeneratorsR-Mat (power-law) Grid Star Complete
Example: mapVertices// increment each vertex value by one val graph = Graph.fromDataSet(...) // increment each vertex value by one val updatedGraph = graph.mapVertices(v => v.getValue + 1)
4
28
5
53
17
4
5
Example: subGraphval graph: Graph[Long, Long, Long] = ... // keep only vertices with positive values // and only edges with negative values val subGraph = graph.subgraph(
vertex => vertex.getValue > 0,
edge => edge.getValue < 0
)
Neighborhood MethodsApply a reduce function to the 1st-hop neighborhood of each vertex in parallelgraph.reduceOnNeighbors(
new MinValue, EdgeDirection.OUT)
What makes Gelly unique? Batch graph processing on top of a streaming
dataflow engine
Built for end-to-end analytics
Support for multiple iteration abstractions
Graph algorithm building blocks
A large open-source library of graph algorithms
Why streaming dataflow?
Batch engines materialize data even if they dont have to the graph is always loaded and materialized in memory,
even if not needed, e.g. mapping, filtering, transformation
Communication and computation overlap
We can do continuous graph processing (more after the break!)
End-to-end analytics
Graphs dont appear out of thin air
We need to support pre- and post-processing
Gelly can be easily mixed with the DataSet API: pre-processing, graph analysis, and post-processing in the same Flink program
Iterative Graph Processing
Gelly offers iterative graph processing abstractions on top of Flinks Delta iterations
vertex-centric
scatter-gather
gather-sum-apply
partition-centric*
Flink Iteration Operators
Input
Iterative Update Function
Result ReplaceWorkset
Iterative Update Function
Result
Solution Set
State
Optimization the runtime is aware of the iterative execution no scheduling overhead between iterations caching and state maintenance are handled automaticallyPush work
out of the loopMaintain state as indexCache Loop-invariant Data
Vertex-Centric SSSPfinal class SSSPComputeFunction extends ComputeFunction {
override def compute(vertex: Vertex, messages: MessageIterator) = {
var minDistance = if (vertex.getId == srcId) 0 else Double.MaxValue
while (messages.hasNext) { val msg = messages.next if (msg < minDistance) minDistance = msg }
if (vertex.getValue > minDistance) { setNewVertexValue(minDistance) for (edge: Edge
Algorithms building blocks Allow operator re-use across graph algorithms
when processing the same input wit