Gearpump akka streams

52
Implementing an akka-streams materializer for big data The Gearpump Materializer Kam Kasravi

Transcript of Gearpump akka streams

Page 1: Gearpump akka streams

Implementing an akka-streams materializer for big data

The Gearpump Materializer

Kam Kasravi

Page 2: Gearpump akka streams

Technical Presentation

● Familiarity with akka-streams flow and graph DSL’s● Familiarity with big data and real time streaming platforms● Familiarity with scala

● Effort between the akka-streams and Gearpump teams started late last year● Resulted in a number of pull requests into akka-streams to enable different materializers● Close to completion with good support of the akka-streams DSL (all GraphStages)● Fairly seamless to switch between local and distributed

Page 3: Gearpump akka streams

Who am I?● Committer on Apache Gearpump (incubating)

- http://gearpump.apache.org● Architect on Trusted Analytics Platform (TAP)

- http://trustedanalytics.org● Lead or Architect across many companies, industries

- NYSE, eBay, PayPal, Yahoo, ...Title Goes Here

There are many variations of passages of lorem ipsum available, but the

majority suffered alteration some form.

Page 4: Gearpump akka streams

What is Apache Gearpump?

● Accepted into Apache incubator last March● Similar to Apache Beam and Apache Flink (real-time message delivery)● Heavily leverages the actor model and akka (more so than others)● Unique features like dynamic DAG● Excellent runtime visualization tooling of cluster and application DAGs● One of the best big data performance profiles (both throughput, latency)

Page 5: Gearpump akka streams

Age

nda ● Why?

○ Why integrate akka-streams into a big data platform?● Big Data platform evolving features

○ Functionality big data platforms are embracing● Prerequisites needed for any Big Data platform

○ Minimal features a big data platform must have ● Big data platform integration challenges

○ What concepts do not map well within big data platforms? ● Object models: akka-streams, Gearpump● Materialization

○ ActorMaterializer - materializing the module tree○ GearpumpMaterializer - rewriting the module tree

Page 6: Gearpump akka streams

Why?

● Akka-streams has limitations inherent within a single JVM ○ Throughput and latency are key big data features that require scaling beyond single JVM’s

● Akka-streams DSL is a superset of other big data platform DSLs○ Has a logical plan (declarative) that can be transformed to an execution plan (runtime)

● Akka-streams programming paradigm is declarative, composable, extensible*, stackable* and reusable*

* Provides a level of extensibility and functionality beyond most big data platform DSLs

Page 7: Gearpump akka streams

Extensible

● Extend GraphStage● Extend Source, Sink, Flow or BidiFlow● All derive from Graph

* Provides a level of extensibility and functionality beyond most big data platform DSLs

Page 8: Gearpump akka streams

Stackable

● Another term for nestable or recursive. Reference to Kleisli (theoretical).● Source, Sink, Flow or BidiFlow may contain their own topologies

* Provides a level of extensibility and functionality beyond most big data platform DSLs

Page 9: Gearpump akka streams

Reusable

● Graph topologies can be attached anywhere (any Graph)● Recent akka-streams feature is dynamic attachment via hubs● Hubs will take advantage of Gearpump dynamic DAG within the

GearpumpMaterializer

* Provides a level of extensibility and functionality beyond most big data platform DSLs

Page 10: Gearpump akka streams

Big Data platform evolving features (1)● Big data platforms are moving to consolidate disparate API’s

○ Too many APIs: Concord, Flink, Heron, Pulsar, Spark, Storm, Samza○ Common DSL is also an approach being taken by Apache Beam○ Analogy to SQL - common grammar that different platforms execute

Page 11: Gearpump akka streams

Big Data platform evolving features (2)● Big data platforms will increasingly require dynamic

pipelines that are compositional and reusable● Examples include:

○ Machine learning○ IoT sensors

Page 12: Gearpump akka streams

Big Data platform evolving features (3)● Machine learning use cases

○ Replace or update scoring models

○ Model Ensembles

■ concept drift

■ data drift

Page 13: Gearpump akka streams

Big Data platform evolving features (4)● IoT use cases

○ Bring new sensors on line with no interruption

○ Change or update configuration parameters at remote sensors

Page 14: Gearpump akka streams

Prerequisites needed for any Big Data platform (1)

Downstream must be able to pull

Upstream must be able to push

1. Push and Pull

Downstream must be able to backpressureall the way to source

2. Backpressure

<< <<

Page 15: Gearpump akka streams

Prerequisites needed for any Big Data platform (2)

3. Parallelization

4. Asynchronous

5. Bidirectional

Page 16: Gearpump akka streams

Big data platform integration challenges (1)

A number of GraphStages have completion or cancellation semantics. Big data pipelines are often infinite streams and do not complete. Cancel is often viewed as a failure.

● Balance[T]● Completion[T]● Merge[T]● Split[T]

Page 17: Gearpump akka streams

Big data platform integration challenges (2)

A number of GraphStages have specific upstream and downstream ordering and timing directives.

● Batch[T]● Concat[T]● Delay[T]● DelayInitial[T]● Interleave[T]

Page 18: Gearpump akka streams

Big data platform integration challenges (3)

The async attribute as well as fusing do not map cleanly when distributing GraphStage functionality across machines.

● Graph.async● Fusing

Page 19: Gearpump akka streams

Graph.async● Collapses multiple operations (GraphStageLogic) into one actor● Distributed scenarios where one may want actors within the

same JVM or on the same machine

Page 20: Gearpump akka streams

Fusing● Creates one or more islands delimited by async boundaries● For distributed scenario no fusing should occur until the

materializer can evaluate and optimize the execution plan

Page 21: Gearpump akka streams

Object Models● Akka-stream’s GraphStage, Module, Shape● Gearpump’s Graph, Task, Partitioner

Page 22: Gearpump akka streams

Akka-streams Object Model ↪ Base type is a Graph. Common base type is a GraphStage↪ Graph contains a

↳ Module contains a↳ Shape

↪ Only a RunnableGraph can be materialized↪ A RunnableGraph needs at least one Source and one Sink

Page 23: Gearpump akka streams

Akka-streams Graph[S, M]

● Graph is parameterized by ○ Shape○ Materialized Value

● Graph contains a Module contains a Shape○ Module is where the runtime is constructed and manipulated

● Graph’s first level subtypes provide basic functionality○ Source○ Sink○ Flow○ BidiFlow

S MGraph

Source

Sink

Flow

BidiFlow

ModuleShape

Page 24: Gearpump akka streams

GraphStage[S <: Shape]

Graph

GraphStageWithMaterializedValue

GraphStage

GraphStageModule

Module

Page 25: Gearpump akka streams

GraphStage[S <: Shape]subtypes (incomplete)↳ Balance[T]↳ Batch[In, Out]↳ Broadcast[T]↳ Collect[In, Out]↳ Concat[T]↳ DelayInitial[T]↳ DropWhile[T]↳ Expand[In, Out]↳ FlattenMerge[T, M]↳ Fold[In, Out]

↳ FoldAsync[T]↳ FutureSource[T]↳ GroupBy[T, K]↳ Grouped[T]↳ GroupedWithin[T]↳ Interleave[T]↳ Intersperse[T]↳ LimitWeighted[T]↳ Map[In, Out]↳ MapAsync[In, Out]

↳ Merge[T]↳ MergePreferred[T]↳ MergeSorted[T]↳ OrElse[T]↳ Partition[T]↳ PrefixAndTail[T]↳ Recover[T]↳ Scan[In, Out]↳ SimpleLinearGraph[T]↳ Sliding[T]

Page 26: Gearpump akka streams

What about Module? ● Module is a recursive structure containing a Set[Modules]● Module is a declarative data structure used as the AST● Module is used to represent a graph of nodes and edges from the original

GraphStages● Module contains downstream and upstream ports (edges)● Materializers walk the module tree to create and run instances of publishers

and subscribers. ● Each publisher and subscriber is an actor (ActorGraphInterpreter)

Page 27: Gearpump akka streams

Gearpump Object Model ↪ Graph[Node, Edge] holds

↳ Tasks (Node)↳ Partitioners (Edge)

↪ This is a Gearpump Graph, not to be confused with akka-streams Graph.

Page 28: Gearpump akka streams

Gearpump Graph[N<:Task, E<:Partitioner]

● Graph is parameterized by ○ Node - must be a subtype of Task○ Edge - must be a subtype of Parititioner

N EGraphList[Task]List[Partitioner]

Page 29: Gearpump akka streams

Task

Task

GraphTask

Page 30: Gearpump akka streams

GraphTasksubtypes (incomplete)↳ BalanceTask↳ BatchTask[In, Out]↳ BroadcastTask[T]↳ CollectTask[In, Out]↳ ConcatTask↳ DelayInitialTask[T]↳ DropWhileTask[T]↳ ExpandTask[In, Out]↳ FlattenMerge[T, M]↳ FoldTask[In, Out]

↳ FutureSourceTask[T]↳ GroupByTask[T, K]↳ GroupedTask[T]↳ GroupedWithinTask[T]↳ InterleaveTask[T]↳ IntersperseTask[T]↳ LimitWeightedTask[T]↳ MapTask[In, Out]↳ MapAsyncTask[In, Out]

↳ MergeTask[T]↳ OrElseTask[T]↳ PartitionTask[T]↳ PrefixAndTailTask[T]↳ RecoverTask[T]↳ ScanTask[In, Out]↳ SlidingTask[T]

Page 31: Gearpump akka streams

Materializer Variations1. AST (module tree) is matched for every module type

(GearpumpMaterializer)2. AST (module tree) is matched for certain module types

○ After distribution - local ActorMaterializer is used for operations on that worker

○ Materializer works more as a distribution coordinator

Page 32: Gearpump akka streams

Example 1Source Broadcast Flow MergeSink

implicit val materializer = ActorMaterializer()

val sinkActor = system.actorOf(Props(new SinkActor())

val source = Source((1 to 5))

val sink = Sink.actorRef(sinkActor, "COMPLETE")

val flowA: Flow[Int, Int, NotUsed] = Flow[Int].map {

x => println(s"processing broadcasted element : $x in flowA"); x

}

val flowB: Flow[Int, Int, NotUsed] = Flow[Int].map {

x => println(s"processing broadcasted element : $x in flowB"); x

}

val graph = RunnableGraph.fromGraph(GraphDSL.create() {

implicit b =>

val broadcast = b.add(Broadcast[Int](2))

val merge = b.add(Broadcast[Int](2))

source ~> broadcast

broadcast ~> flowA ~> merge

broadcast ~> flowB ~> merge

merge ~> sink

ClosedShape

})

graph.run()

Page 33: Gearpump akka streams

Example 1

implicit val materializer = ActorMaterializer()

val sinkActor = system.actorOf(Props(new SinkActor())

val source = Source((1 to 5))

val sink = Sink.actorRef(sinkActor, "COMPLETE")

val flowA: Flow[Int, Int, NotUsed] = Flow[Int].map {

x => println(s"processing broadcasted element : $x in flowA"); x

}

val flowB: Flow[Int, Int, NotUsed] = Flow[Int].map {

x => println(s"processing broadcasted element : $x in flowB"); x

}

val graph = RunnableGraph.fromGraph(GraphDSL.create() {

implicit b =>

val broadcast = b.add(Broadcast[Int](2))

val merge = b.add(Broadcast[Int](2))

source ~> broadcast

broadcast ~> flowA ~> merge

broadcast ~> flowB ~> merge

merge ~> sink

ClosedShape

})

graph.run()

Source Broadcast

Flow

Flow

Merge

GraphStages

Sink

class SinkActor extends Actor {

def receive: Receive = {

case any: Any =>

println(s“Confirm received: $any”)

}

Page 34: Gearpump akka streams

Example 1

Source Broadcast

Flow

Flow

Merge

GraphStages

Sink

Module TreeGraphStageModule

GraphStageModule

stage=SingleSource

stage=StatefulMapConcat

ActorRefSink

stage=Broadcast

stage=Map

stage=Merge

GraphStageModule

GraphStageModule

GraphStageModule

Page 35: Gearpump akka streams

Example 1

implicit val materializer = ActorMaterializer()

val sinkActor = system.actorOf(Props(new SinkActor())

val source = Source((1 to 5))

val sink = Sink.actorRef(sinkActor, "COMPLETE")

val flowA: Flow[Int, Int, NotUsed] = Flow[Int].map {

x => println(s"processing broadcasted element : $x in flowA"); x

}

val flowB: Flow[Int, Int, NotUsed] = Flow[Int].map {

x => println(s"processing broadcasted element : $x in flowB"); x

}

val graph = RunnableGraph.fromGraph(GraphDSL.create() {

implicit b =>

val broadcast = b.add(Broadcast[Int](2))

val merge = b.add(Broadcast[Int](2))

source ~> broadcast

broadcast ~> flowA ~> merge

broadcast ~> flowB ~> merge

merge ~> sink

ClosedShape

})

graph.run()

source broadcast

flowA

flowB

merge

GraphStages

sink

Page 36: Gearpump akka streams

Example 1

processing broadcasted element : 1 in flowA

processing broadcasted element : 1 in flowB

processing broadcasted element : 2 in flowA

Confirm received: 1

Confirm received: 1

processing broadcasted element : 2 in flowB

Confirm received: 2

Confirm received: 2

processing broadcasted element : 3 in flowA

processing broadcasted element : 3 in flowB

processing broadcasted element : 4 in flowA

processing broadcasted element : 4 in flowB

Confirm received: 3

Confirm received: 3

processing broadcasted element : 5 in flowA

processing broadcasted element : 5 in flowB

Confirm received: 4

Confirm received: 4

Confirm received: 5

Confirm received: 5

Confirm received: COMPLETE

source broadcast

flowA

flowB

merge

GraphStages

sink

ActorMaterializer Output

Page 37: Gearpump akka streams

Example 1

implicit val materializer = GearpumpMaterializer()

val sinkActor = system.actorOf(Props(new SinkActor())

val source = Source((1 to 5))

val sink = Sink.actorRef(sinkActor, "COMPLETE")

val flowA: Flow[Int, Int, NotUsed] = Flow[Int].map {

x => println(s"processing broadcasted element : $x in flowA"); x

}

val flowB: Flow[Int, Int, NotUsed] = Flow[Int].map {

x => println(s"processing broadcasted element : $x in flowB"); x

}

val graph = RunnableGraph.fromGraph(GraphDSL.create() {

implicit b =>

val broadcast = b.add(Broadcast[Int](2))

val merge = b.add(Broadcast[Int](2))

source ~> broadcast

broadcast ~> flowA ~> merge

broadcast ~> flowB ~> merge

merge ~> sink

ClosedShape

})

graph.run()

source broadcast

flowA

flowB

merge

GraphStages

sink

Page 38: Gearpump akka streams

Example 1

processing broadcasted element : 1 in flowA

processing broadcasted element : 1 in flowB

processing broadcasted element : 2 in flowB

processing broadcasted element : 2 in flowA

processing broadcasted element : 3 in flowB

processing broadcasted element : 3 in flowA

processing broadcasted element : 4 in flowB

processing broadcasted element : 4 in flowA

processing broadcasted element : 5 in flowB

Confirm received: 1

processing broadcasted element : 5 in flowA

Confirm received: 1

Confirm received: 2

Confirm received: 2

Confirm received: 3

Confirm received: 3

Confirm received: 4

Confirm received: 4

Confirm received: 5

Confirm received: 5

source broadcast

flowA

flowB

merge

GraphStages

sink

GearpumpMaterializer Output

Page 39: Gearpump akka streams

Demo

GraphStageModule(stage=SingleSource)

ActorRefSinkGraphStageModule(stage=StatefulMapConcat)

GraphStageModule(stage=Broadcast)

GraphStageModule(stage=Map)

GraphStageModule(stage=Merge)

Page 40: Gearpump akka streams

ActorMaterializer

GraphStageModule(stage=SingleSource)

ActorRefSinkGraphStageModule(stage=StatefulMapConcat)

GraphStageModule(stage=Broadcast)

GraphStageModule(stage=Map)

GraphStageModule(stage=Merge)

1. Traverses the Module Tree

Page 41: Gearpump akka streams

ActorMaterializer

2. Builds a runtime graph of BoundaryPublisher and BoundarySubscribers (Reactive API).

3. Each Publisher or Subscriber contains an instance of GraphStageLogic specific to that GraphStage.

4. Each Publisher or Subscriber also contains an instance of ActorGraphInterpreter - an Actor that manages the message flow using GraphStageLogic.

Page 42: Gearpump akka streams

GearpumpMaterializer

GraphStageModule(stage=SingleSource)

ActorRefSink

GraphStageModule(stage=Broadcast)

GraphStageModule(stage=Map)

GraphStageModule(stage=Merge)

1. Rewrites the Module Tree into ‘local’ and ‘remote’ Gearpump Graphs.

GraphStageModule(stage=StatefulMapConcat)

Page 43: Gearpump akka streams

GearpumpMaterializer

GraphStageModule(stage=SingleSource)

ActorRefSink

GraphStageModule(stage=Broadcast)

GraphStageModule(stage=Map)

GraphStageModule(stage=Merge)

2. Choice of ‘local’ and ‘remote’ is determined by a ‘Strategy’. The default Strategy is to put Source and Sink types in local

GraphStageModule(stage=StatefulMapConcat)

Page 44: Gearpump akka streams

GearpumpMaterializer

ActorRefSink

3. Inserts BridgeModules into both Graphs

SourceBridgeModule

SinkBridgeModule

SinkBridgeModule

GraphStageModule(stage=Broadcast)

GraphStageModule(stage=Map)

GraphStageModule(stage=Merge)GraphStageModule(

stage=StatefulMapConcat)

GraphStageModule(stage=SingleSource)

SourceBridgeModule

Page 45: Gearpump akka streams

GearpumpMaterializer

ActorRefSink

4. Local graph is passed to a LocalGraphMaterializer

SinkBridgeModule

GraphStageModule(stage=SingleSource)

SourceBridgeModule

LocalGraphMaterializer is a variant (subtype) of ActorMaterializer

Page 46: Gearpump akka streams

GearpumpMaterializer

5. Converts the remote graph’s Modules into Tasks

SourceBridgeTask SinkBridgeTaskBroadcastTask

TransformTask

MergeTaskStatefulMapConcatTask

Page 47: Gearpump akka streams

GearpumpMaterializer

6. Sends this Graph to the Gearpump master

SourceBridgeTask SinkBridgeTaskBroadcastTask

TransformTask

MergeTaskStatefulMapConcatTask

Page 48: Gearpump akka streams

GearpumpMaterializer

7. Materialization is controlled at BridgeTasks

SourceBridgeTask SinkBridgeTaskBroadcastTask

TransformTask

MergeTaskStatefulMapConcatTask

Page 49: Gearpump akka streams

Example 2No local graph.More typical of distributed apps.

implicit val materializer = GearpumpMaterializer()

val sink = GearSink.to(new LoggerSink[String]))

val sourceData = new CollectionDataSource(

List("red hat", "yellow sweater", "blue jack", "red

apple", "green plant", "blue sky"))

val source = GearSource.from[String](sourceData)

source.filter(_.startsWith("red")).map("I want to order

item: " + _).runWith(sink)

Page 50: Gearpump akka streams

Example 3More complex Graph with loops

implicit val materializer = GearpumpMaterializer()

RunnableGraph.fromGraph(GraphDSL.create() {

implicitbuilder =>

val A = builder.add(Source.single(0)).out

val B = builder.add(Broadcast[Int](2))

val C = builder.add(Merge[Int](2))

val D = builder.add(Flow[Int].map(_ + 1))

val E = builder.add(Balance[Int](2))

val F = builder.add(Merge[Int](2))

val G = builder.add(Sink.foreach(println)).in

C <~ F

A ~> B ~> C ~> F

B ~> D ~> E ~> F

E ~> G

ClosedShape

}).run()

Page 51: Gearpump akka streams

Sum

mar

y ● Akka-streams provides a compelling programming model that enables declarative pipeline reuse and extensibility.

● Akka-streams allows different materializers to control and materialize different parts of the module tree.

● It’s possible to provide a seamless (or nearly seamless) conversion of akka-streams to run in a distributed setting by merely replacing ActorMaterializer with GearpumpMaterializer.

● Alternative distributed materializers can be implemented using a similar approach.

● Distributed akka-streams via Apache Gearpump will be available in the next release of Apache Gearpump (0.8.2) or will be made available within an akka specific repo.

Page 52: Gearpump akka streams

Thank you

twitter:@ApacheGearpump@kkasravi