Post on 26-Jan-2017
CHAPTER 10 : SPARK STREAMING
Learning Sparkby Holden Karau et. al.
Overview: Spark Streaming
A Simple Example Architecture and
Abstraction Transformations
Stateless Stateful
Output OperationsInput Sources
Core Sources Additional Sources Multiple Sources and
Cluster Sizing 24/7 Operation
Checkpointing Driver Fault Tolerance Worker Fault Tolerance Receiver Fault Tolerance Processing Guarantees
Streaming UIPerformance
Considerations Batch and Window Sizes Level of Parallelism Garbage Collection and
Memory Usage Conclusion
10.1 A Simple Example
Before we dive into the details of Spark Streaming, let’s consider a simple example. We will receive a stream of newline-delimited lines of text from a server running at port 7777, filter only the lines that contain the word error, and print them.
Spark Streaming programs are best run as standalone applications built using Maven or sbt. Spark Streaming, while part of Spark, ships as a separate Maven artifact and has some additional imports you will want to add to your project.
10.2 Architecture and Abstraction
Edx and Coursera Courses
Introduction to Big Data with Apache SparkSpark Fundamentals IFunctional Programming Principles in Scala
10.2 Architecture and Abstraction (cont.)
10.3 Transformations
Stateless the processing of each batch does not depend on the
data of its previous batches include the common RDD transformations like map(),
filter(), and reduceByKey() Stateful
use data or intermediate results from previous batches to compute the results of the current batch
include transformations based on: sliding windows tracking state across time
10.3.1 Stateless Transformations
10.3.2 Stateless Transformations
Windowed Transformation compute results across a longer time period than the
StreamingContext’s batch interval, by combining results from multiple batches
A windowed stream with a window duration of
3 batches and a slide duration of 2 batches;
every two time steps, we compute a result over
the previous 3 time steps
10.3.2 Stateless Transformations (cont.)
UpdateStateByKey transformation updateStateByKey() maintains state across the
batches in a DStream by providing access to a state variable for DStreams of key/value pairs
update(events, oldState) returns a newState events is a list of events that arrived in the current batch
(may be empty) oldState is an optional state object, stored within an
Option; it might be missing if there was no previous state for the key
newState is also an Option; we can return an empty Option to specify that we want to delete the state
10.4 Output Operations
Specify what needs to be done with the final transformed data in a stream
print()save()
Saving DStream to text files in Scala ipAddressRequestCount.saveAsTextFiles("outputDir", "txt")
Saving SequenceFiles from a DStream in Scala val writableIpAddressRequestCount =
ipAddressRequestCount.map { (ip, count) => (new Text(ip), new LongWritable(count)) }
writableIpAddressRequestCount.saveAsHadoopFiles[ SequenceFileOutputFormat[Text, LongWritable]]("outputDir", "txt")
10.5 Input Sources
Spark Streaming has built-in support for a number of different data sources. “core” sources are built into the Spark Streaming
Maven artifact others are available through additional artifacts
Eg: spark-streaming-kafka.
10.5.1 Core Sources
Stream of files allows a stream to be created from files written in a directory of
a Hadoop-compatible filesystem needs to have a consistent date format for the directory names
and the files have to be created atomically Eg: Streaming text files written to a directory in Scala
val logData = ssc.textFileStream(logDirectory) Akka actor stream
allows using Akka actors as a source for streaming To construct an actor stream:
create an Akka actor implement the org.apache.spark.streaming.receiver.ActorHelper
interface
10.5.2 Additional Sources
Apache KafkaApache PlumePush-based receiver Pull-based receiverCustom input sources
10.5.3 Multiple Sources and Cluster Sizing
We can combine multiple DStreams using operations like union() combine data from multiple input DStreams
The receivers are executed in the Spark cluster to use multiple ones Each receiver runs as a long-running task within
Spark’s executors, and hence occupies CPU cores allocated to the application
Note: Do not run Spark Streaming programs locally with master config‐ ured as "local" or "local[1]”
10.6 “24/7” Operations
Spark provides strong fault tolerance guarantees. As long as the input data is stored reliably, Spark
Streaming will always compute the correct result from it, offering “exactly once” semantics, even if workers or the driver fail.
To run Spark Streaming applications 24/71. setting up checkpointing to a reliable storage
system, such as HDFS or Amazon S3 2. worry about the fault tolerance of the driver program
and of unreliable input sources
10.6.1 Checkpointing
Main mechanism needs to be set up for fault tolerance
Allows periodically saving data about the application to a reliable storage system, such as HDFS or Amazon S3 for use in recovering
Two purposes: Limiting the state that must be recomputed on failure Providing fault tolerance for the driver
10.6.2 Driver Fault Tolerance
Requires creating our StreamingContext, which takes in the checkpoint directory use the StreamingContext.getOrCreate() function
Write initialization code using getOrCreate(), need to actually restart your driver program when it crashes
10.6.3 Worker Fault Tolerance
Spark Streaming uses the same techniques as Spark for its fault tolerance.
All the data received from external sources is replicated among the Spark workers
All RDDs created through transformations of this replicated input data are tolerant to failure of a worker node, as the RDD lineage allows the system to recompute the lost data all the way from the surviving replica of the input data.
10.6.4 Receiver Fault Tolerance
Spark Streaming restarts the failed receivers on other nodes in the cluster
Receivers provide the guarantees: All data read from a reliable filesystem (e.g., with
StreamingContext.hadoop Files) is reliable, because the underlying filesystem is replicated.
For unreliable sources such as Kafka, push-based Flume, or Twitter, Spark repli‐ cates the input data to other nodes, but it can briefly lose data if a receiver task is down.
10.6.5 Processing Guarantees
Spark Streaming provide exactly- once semantics for all transformations Even if a worker fails and some data gets reprocessed,
the final transformed result (that is, the transformed RDDs) will be the same as if the data were processed exactly once.
When the transformed result is to be pushed to external systems using out‐ put operations, the task pushing the result may get executed multiple times due to failures, and some data can get pushed multiple times.
10.7 Streaming UI
UI page that lets us look at what applications are doing. (typically http:// <driver>:4040)
10.8 Performance Considerations
Batch in window sizesLevel of parallelismGarbage Collection and Memory Usage
10.8.1 Batch and Window Sizes
Minimum batch size Spark Streaming can use: 500 milliseconds
The best approach: start with a larger batch size (around 10 seconds) work your way down to a smaller batch size.
If the processing times reported in the Streaming UI remain consistent, then you can continue to decrease the batch size Note: if they are increasing you may have reached the
limit for your application.
10.8.2 Level of Parallelism
Increasing the parallelism - a common way to reduce the processing time of batches
3 ways: Increasing the number of receivers Explicitly repartitioning received data Increasing parallelism in aggregation
10.8.3 Garbage Collection and Memory Usage
Java’s garbage collection - an aspect that can cause problems
To minimize large pauses due to GC enabling Java’s Concurrent Mark- Sweep garbage collector. The Concurrent Mark-Sweep garbage collector does
consume more resources overall, but introduces fewer pauses.
To reduce GC pressure Cache RDDs in serialized form Use Kryo serialization Use an LRU cache
Edx and Coursera Courses
Introduction to Big Data with Apache SparkSpark Fundamentals IFunctional Programming Principles in Scala
10.9 Conclusion
In this chapter, we have seen how to work with streaming data using DStreams.
Since DStreams are composed of RDDs, the techniques and knowledge you have gained from the earlier chapters remains applicable for streaming and real-time applications.
In the next chapter, we will look at machine learning with Spark.