Apache Flink & Graph Processing

Click here to load reader

Embed Size (px)

Transcript of Apache Flink & Graph Processing

  • Batch & Stream Graph Processing with Apache Flink

    Vasia Kalavri [email protected]

    @vkalavri

    Apache Flink Meetup London October 5th, 2016

    mailto:[email protected]?subject=

  • 2

    Graphs capture relationships between data items

    connections, interactions, purchases, dependencies, friendships, etc.

    Recommenders

    Social networks

    Bioinformatics

    Web search

  • Outline

    Distributed Graph Processing 101

    Gelly: Batch Graph Processing with Apache Flink

    BREAK!

    Gelly-Stream: Continuous Graph Processing with Apache Flink

  • Apache Flink An open-source, distributed data analysis framework

    True streaming at its core

    Streaming & Batch API

    4

    Historic data

    Kafka, RabbitMQ, ...

    HDFS, JDBC, ...

    Event logsETL, Graphs,Machine LearningRelational,

    Low latency,windowing, aggregations, ...

  • WHEN DO YOU NEED DISTRIBUTED GRAPH PROCESSING?

  • MY GRAPH IS SO BIG, IT DOESNT FIT IN A SINGLE MACHINE

    Big Data Ninja

    MISCONCEPTION #1

  • A SOCIAL NETWORK

  • NAIVE WHO(M)-T0-FOLLOW

    Naive Who(m) to Follow:

    compute a friends-of-friends list per user

    exclude existing friends

    rank by common connections

  • DONT JUST CONSIDER YOUR INPUT GRAPH SIZE. INTERMEDIATE DATA MATTERS TOO!

  • DISTRIBUTED PROCESSING IS ALWAYS FASTER THAN SINGLE-NODE

    Data Science Rockstar

    MISCONCEPTION #2

  • GRAPHS DONT APPEAR OUT OF THIN AIR

    Expectation

  • GRAPHS DONT APPEAR OUT OF THIN AIR

    Reality!

  • WHEN DO YOU NEED DISTRIBUTED GRAPH PROCESSING?

    When you do have really big graphs

    When the intermediate data is big

    When your data is already distributed

    When you want to build end-to-end graph pipelines

  • HOW DO WE EXPRESS A DISTRIBUTED GRAPH ANALYSIS TASK?

  • RECENT DISTRIBUTED GRAPH PROCESSING HISTORY

    2004

    MapReduce

    Pegasus

    2009

    Pregel

    2010

    Signal-Collect

    PowerGraph

    2012

    Iterative value propagation

    Giraph++

    2013

    Graph Traversals

    NScale

    2014

    Ego-network analysis

    Arabesque

    2015

    Pattern Matching

    Tinkerpop

  • PREGEL: THINK LIKE A VERTEX

    1

    5

    4

    3

    2 1 3, 4

    2 1, 4

    5 3

    . . .

  • PREGEL: SUPERSTEPS

    (Vi+1, outbox)

  • PAGERANK: THE WORD COUNT OF GRAPH PROCESSING

    VertexID Out-degree Transition Probability

    1 2 1/2

    2 2 1/2

    3 0 -

    4 3 1/3

    5 1 1

    1

    5

    4

    3

    2

  • PAGERANK: THE WORD COUNT OF GRAPH PROCESSING

    VertexID Out-degree Transition Probability

    1 2 1/2

    2 2 1/2

    3 0 -

    4 3 1/3

    5 1 1

    PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)

    1

    5

    4

    3

    2

  • 1

    5

    4

    3

    2

    PAGERANK: THE WORD COUNT OF GRAPH PROCESSING

    VertexID Out-degree Transition Probability

    1 2 1/2

    2 2 1/2

    3 0 -

    4 3 1/3

    5 1 1

    PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)

  • PAGERANK: THE WORD COUNT OF GRAPH PROCESSING

    VertexID Out-degree Transition Probability

    1 2 1/2

    2 2 1/2

    3 0 -

    4 3 1/3

    5 1 1

    PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)

    1

    5

    4

    3

    2

  • PAGERANK: THE WORD COUNT OF GRAPH PROCESSING

    VertexID Out-degree Transition Probability

    1 2 1/2

    2 2 1/2

    3 0 -

    4 3 1/3

    5 1 1

    PR(3) = 0.5*PR(1) + 0.33*PR(4) + PR(5)

    1

    5

    4

    3

    2

  • PREGEL EXAMPLE: PAGERANK

    void compute(messages): sum = 0.0

    for (m

  • SIGNAL-COLLECT

    outbox

  • SIGNAL-COLLECT EXAMPLE: PAGERANK

    void signal(): for (edge

  • GATHER-SUM-APPLY (POWERGRAPH)

    1

    . . .. . .

    Gather Sum

    1

    2

    5

    . . .

    Apply

    3

    1 5

    5 3

    1

    . . .

    Gather

    3

    1 5

    5 3

    Superstep i Superstep i+1

  • GSA EXAMPLE: PAGERANK

    double gather(source, edge, target): return target.value() / target.numEdges()

    double sum(rank1, rank2): return rank1 + rank2

    double apply(sum, currentRank): return 0.15 + 0.85*sum

    compute partial rank

    combine partial ranks

    update rank

  • PREGEL VS. SIGNAL-COLLECT VS. GSA

    Update Function Properties

    Update Function Logic

    Communication Scope

    Communication Logic

    Pregel arbitrary arbitrary any vertex arbitrary

    Signal-Collect arbitrarybased on received

    messagesany vertex based on vertex state

    GSA associative & commutativebased on neighbors

    valuesneighborhood based on vertex state

  • CAN WE HAVE IT ALL?

    Data pipeline integration: built on top of an efficient distributed processing engine

    Graph ETL: high-level API with abstractions and methods to transform graphs

    Familiar programming model: support popular programming abstractions

  • Gelly the Apache Flink Graph API

  • Apache Flink Stack

    Gel

    ly

    Tabl

    e/SQ

    L

    ML

    SAM

    OA

    DataSet (Java/Scala) DataStream (Java/Scala)

    Had

    oop

    M/R

    Local Remote Yarn Embedded

    Dat

    aflo

    w

    Dat

    aflo

    w

    Tabl

    e/SQ

    L

    Cas

    cadi

    ngStreaming dataflow runtime

    CEP

  • Meet Gelly Java & Scala Graph APIs on top of Flinks DataSet API

    Flink Core

    Scala API(batch and streaming)

    Java API(batch and streaming)

    FlinkML GellyTable API ...

    Transformations and Utilities

    Iterative Graph Processing

    Graph Library

    34

  • Gelly is NOT

    a graph database

    a specialized graph processor

    35

  • Hello, Gelly!ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

    DataSet edges = getEdgesDataSet(env);

    Graph graph = Graph.fromDataSet(edges, env);

    DataSet verticesWithMinIds = graph.run(

    new ConnectedComponents(maxIterations));

    val env = ExecutionEnvironment.getExecutionEnvironment

    val edges: DataSet[Edge[Long, NullValue]] = getEdgesDataSet(env)

    val graph = Graph.fromDataSet(edges, env)

    val components = graph.run(new ConnectedComponents(maxIterations))

    Java

    Scala

  • Graph MethodsGraph Properties

    getVertexIds getEdgeIds numberOfVertices numberOfEdges getDegrees

    Mutationsadd vertex/edge remove vertex/edge

    Transformationsmap, filter, join subgraph, union, difference reverse, undirected getTriplets

    GeneratorsR-Mat (power-law) Grid Star Complete

  • Example: mapVertices// increment each vertex value by one val graph = Graph.fromDataSet(...) // increment each vertex value by one val updatedGraph = graph.mapVertices(v => v.getValue + 1)

    4

    28

    5

    53

    17

    4

    5

  • Example: subGraphval graph: Graph[Long, Long, Long] = ... // keep only vertices with positive values // and only edges with negative values val subGraph = graph.subgraph(

    vertex => vertex.getValue > 0,

    edge => edge.getValue < 0

    )

  • Neighborhood MethodsApply a reduce function to the 1st-hop neighborhood of each vertex in parallelgraph.reduceOnNeighbors(

    new MinValue, EdgeDirection.OUT)

  • What makes Gelly unique? Batch graph processing on top of a streaming

    dataflow engine

    Built for end-to-end analytics

    Support for multiple iteration abstractions

    Graph algorithm building blocks

    A large open-source library of graph algorithms

  • Why streaming dataflow?

    Batch engines materialize data even if they dont have to the graph is always loaded and materialized in memory,

    even if not needed, e.g. mapping, filtering, transformation

    Communication and computation overlap

    We can do continuous graph processing (more after the break!)

  • End-to-end analytics

    Graphs dont appear out of thin air

    We need to support pre- and post-processing

    Gelly can be easily mixed with the DataSet API: pre-processing, graph analysis, and post-processing in the same Flink program

  • Iterative Graph Processing

    Gelly offers iterative graph processing abstractions on top of Flinks Delta iterations

    vertex-centric

    scatter-gather

    gather-sum-apply

    partition-centric*

  • Flink Iteration Operators

    Input

    Iterative Update Function

    Result ReplaceWorkset

    Iterative Update Function

    Result

    Solution Set

    State

  • Optimization the runtime is aware of the iterative execution no scheduling overhead between iterations caching and state maintenance are handled automaticallyPush work

    out of the loopMaintain state as indexCache Loop-invariant Data

  • Vertex-Centric SSSPfinal class SSSPComputeFunction extends ComputeFunction {

    override def compute(vertex: Vertex, messages: MessageIterator) = {

    var minDistance = if (vertex.getId == srcId) 0 else Double.MaxValue

    while (messages.hasNext) { val msg = messages.next if (msg < minDistance) minDistance = msg }

    if (vertex.getValue > minDistance) { setNewVertexValue(minDistance) for (edge: Edge

  • Algorithms building blocks Allow operator re-use across graph algorithms

    when processing the same input wit