Productionalizing Spark Streaming

Post on 26-Jan-2015

116 views 0 download


Spark Summit 2013 Talk: At Sharethrough we have deployed Spark to our production environment to support several user facing product features. While building these features we uncovered a consistent set of challenges across multiple streaming jobs. By addressing these challenges you can speed up development of future streaming jobs. In this talk we will discuss the 3 major challenges we encountered while developing production streaming jobs and how we overcame them. First we will look at how to write jobs to ensure fault tolerance since streaming jobs need to run 24/7 even under failure conditions. Second we will look at the programming abstractions we created using functional programming and existing libraries. Finally we will look at the way we test all the pieces of a job –from manipulating data through writing to external databases– to give us confidence in our code before we deploy to production

Transcript of Productionalizing Spark Streaming


Productionalizing Spark Streaming

Spark Summit 2013Ryan Weald


Monday, December 2, 13


What We’re Going to Cover

•What we do and Why we choose Spark

•Fault tolerance for long lived streaming jobs

•Common patterns and functional abstractions

•Testing before we “do it live”

Monday, December 2, 13


Special focus on common patterns and

their solutions

Monday, December 2, 13


What is Sharethrough?

Advertising for the Modern Internet


Monday, December 2, 13


What is Sharethrough?

Monday, December 2, 13


Why Spark Streaming?

Monday, December 2, 13


Why Spark Streaming

•Liked theoretical foundation of mini-batch

•Scala codebase + functional API

•Young project with opportunities to contribute

•Batch model for iterative ML algorithms

Monday, December 2, 13


Great...Now productionalize it

Monday, December 2, 13


Fault Tolerance

Monday, December 2, 13


Keys to Fault Tolerance

1.Receiver fault tolerance

2.Monitoring job progress

Monday, December 2, 13


Receiver Fault Tolerance

•Use Actors with supervisors

•Use self healing connection pools

Monday, December 2, 13


Use Actors

class RabbitMQStreamReceiver (uri:String, exchangeName: String, routingKey: String) extends Actor with Receiver with Logging {

implicit val system = ActorSystem() override def preStart() = { //Your code to setup connections and actors //Include inner class to process messages }

def receive: Receive = { case _ => logInfo("unknown message") }}

Monday, December 2, 13


Track All Outputs

•Low watermarks - Google MillWheel

•Database updated_at

•Expected output file size alerting

Monday, December 2, 13


Common Patterns&

Functional Programming

Monday, December 2, 13


Map -> Aggregate ->Store

Common Job Pattern

Monday, December 2, 13


Mapping Data { rawRequest => val params = QueryParams.parse(rawRequest) (params.getOrElse("beaconType", "unknown"), 1L)}

Monday, December 2, 13



Monday, December 2, 13


Basic Aggregation

//beacons is DStream[String, Long]//example Seq(("click", 1L), ("click", 1L))val sum: (Long, Long) => Long = _ + _beacons.reduceByKey(sum)

Monday, December 2, 13


What Happens when we want to sum multiple things?

Monday, December 2, 13


Long Basic Aggregation

val inputData = Seq( ("user_1",(1L, 1L, 1L)), ("user_1",(2L, 2L, 2L)))def sum(l: (Long, Long, Long), r: (Long, Long, Long)) = { (l._1 + r._1, l._2 + r._2, l._3 + r._3)}inputData.reduceByKey(sum)

Monday, December 2, 13


Now Sum 4 Ints instead

(ノಥ益ಥ)ノ ┻━┻

Monday, December 2, 13


Monoids to the Rescue

Monday, December 2, 13


WTF is a Monoid?

trait Monoid[T] { def zero: T def plus(r: T, l: T): T}

* Just need to make sure plus is associative.(1+ 5) + 2 == (2 + 1) + 5

Monday, December 2, 13


Monoid Based Aggregation

object LongMonoid extends Monoid[(Long, Long, Long)] { def zero = (0, 0, 0) def plus(r: (Long, Long, Long), l: (Long, Long, Long)) = { (l._1 + r._1, l._2 + r._2, l._3 + r._3) }}

inputData.reduceByKey(, _))

Monday, December 2, 13


Twitter Algebird

Monday, December 2, 13


Algebird Based Aggregation

import com.twitter.algebird._val aggregator = implicitly[Monoid[(Long,Long, Long)]]

inputData.reduceByKey(, _))

Monday, December 2, 13


How many unique users per publisher?

Monday, December 2, 13


Too big for memory based naive Map

Monday, December 2, 13


HyperLogLog FTW

Monday, December 2, 13


HLL Aggregation

import com.twitter.algebird._val aggregator = new HyperLogLogMonoid(12)inputData.reduceByKey(, _))

Monday, December 2, 13


Monoids == Reusable Aggregation

Monday, December 2, 13


Common Job Pattern

Map -> Aggregate ->Store

Monday, December 2, 13



Monday, December 2, 13


How do we store the results?

Monday, December 2, 13


Storage API Requirements

•Incremental updates (preferably associative)

•Pluggable to support “big data” stores

•Allow for testing jobs

Monday, December 2, 13


Storage API

trait MergeableStore[K, V] { def get(key: K): V def put(kv: (K,V)): V /* * Should follow same associative property * as our Monoid from earlier */ def merge(kv: (K,V)): V}

Monday, December 2, 13


Twitter Storehaus

Monday, December 2, 13


Storing Spark Results

def saveResults(result: DStream[String, Long], store: RedisStore[String, Long]) = { result.foreach { rdd => rdd.foreach { element => val (keys, value) = element store.merge(keys, impressions) } } }

Monday, December 2, 13


Everyone can benefit

Monday, December 2, 13


Potential API additions?

class PairDStreamFunctions[K, V] { def aggregateByKey(aggregator: Monoid[V]) def store(store: MergeableStore[K, V]) }

Monday, December 2, 13


Twitter Summingbird


Monday, December 2, 13


Testing Your Jobs

Monday, December 2, 13


Testing best Practices

•Try and avoid full integration tests

•Use in-memory stores for testing

•Keep logic outside of Spark

•Use Summingbird in memory platform???

Monday, December 2, 13


Ryan Weald@rweald

Thank You

Monday, December 2, 13