Spark: A Brief History - Indian Institute of...
Transcript of Spark: A Brief History - Indian Institute of...
![Page 1: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/1.jpg)
Spark: A Brief History
https://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
![Page 2: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/2.jpg)
A Brief History:
2002
2002MapReduce @ Google
2004MapReduce paper
2006Hadoop @ Yahoo!
2004 2006 2008 2010 2012 2014
2014Apache Spark top-level
2010Spark paper
2008Hadoop Summit
![Page 3: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/3.jpg)
A Brief History: MapReduce
circa 1979 – Stanford, MIT, CMU, etc. set/list operations in LISP, Prolog, etc., for parallel processingwww-formal.stanford.edu/jmc/history/lisp/lisp.htm
circa 2004 – Google MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawatresearch.google.com/archive/mapreduce.html
circa 2006 – Apache Hadoop, originating from the Nutch Project Doug Cuttingresearch.yahoo.com/files/cutting.pdf
circa 2008 – Yahoo web scale search indexing Hadoop Summit, HUG, etc. developer.yahoo.com/hadoop/
circa 2009 – Amazon AWS Elastic MapReduce Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc. aws.amazon.com/elasticmapreduce/
![Page 4: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/4.jpg)
MapReduce use cases showed two major limitations:
1. difficultly of programming directly in MR
2. performance bottlenecks, or batch not fitting the use cases
In short, MR doesn’t compose well for large applications
Therefore, people built specialized systems as workarounds…
A Brief History: MapReduce
![Page 5: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/5.jpg)
A Brief History: MapReduce
MapReduce
General Batch Processing
Pregel Giraph
Dremel Drill Tez
Impala GraphLab
Storm S4
Specialized Systems: iterative, interactive, streaming, graph, etc.
The State of Spark, and Where We're Going Next Matei Zaharia Spark Summit (2013) youtu.be/nU6vO2EJAb4
![Page 6: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/6.jpg)
2002
2002MapReduce @ Google
2004MapReduce paper
2006Hadoop @ Yahoo!
2004 2006 2008 2010 2012 2014
2014Apache Spark top-level
2010Spark paper
2008Hadoop Summit
A Brief History: Spark
Spark: Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica USENIX HotCloud (2010) people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf !Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica NSDI (2012) usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
![Page 7: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/7.jpg)
A Brief History: Spark
Unlike the various specialized systems, Spark’s goal was to generalize MapReduce to support new apps within same engine
Two reasonably small additions are enough to express the previous models:
• fast data sharing • general DAGs
This allows for an approach which is more efficient for the engine, and much simpler for the end users
![Page 8: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/8.jpg)
A Brief History: Spark
The State of Spark, and Where We're Going Next Matei Zaharia Spark Summit (2013) youtu.be/nU6vO2EJAb4
used as libs, instead of specialized systems
![Page 9: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/9.jpg)
Some key points about Spark:
• handles batch, interactive, and real-timewithin a single framework
• native integration with Java, Python, Scala
• programming at a higher level of abstraction
• more general: map/reduce is just one setof supported constructs
A Brief History: Spark
![Page 10: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/10.jpg)
https://www.safaribooksonline.com/library/view/data-analytics-with/9781491913734/assets/dawh_0401.png
![Page 11: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/11.jpg)
Resilient Distributed Datasets A Fault-‐Tolerant Abstraction for In-‐Memory Cluster Computing
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica
UC Berkeley UC BERKELEY
![Page 12: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/12.jpg)
Motivation MapReduce greatly simplified “big data” analysis on large, unreliable clusters
But as soon as it got popular, users wanted more: » More complex, multi-‐stage applications (e.g. iterative machine learning & graph processing) » More interactive ad-‐hoc queries
Response: specialized frameworks for some of these apps (e.g. Pregel for graph processing)
![Page 13: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/13.jpg)
Motivation Complex apps and interactive queries both need one thing that MapReduce lacks:
Efficient primitives for data sharing
In MapReduce, the only way to share data across jobs is stable storage slow!
![Page 14: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/14.jpg)
Examples
iter. 1 iter. 2 . . .
Input
HDFS read
HDFS write
HDFS read
HDFS write
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFS read
Slow due to replication and disk I/O, but necessary for fault tolerance
![Page 15: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/15.jpg)
iter. 1 iter. 2 . . .
Input
Goal: In-‐Memory Data Sharing
Input
query 1
query 2
query 3
. . .
one-‐time processing
10-‐100× faster than network/disk, but how to get FT?
![Page 16: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/16.jpg)
Challenge
How to design a distributed memory abstraction that is both fault-‐tolerant and efficient?
![Page 17: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/17.jpg)
Challenge Existing storage abstractions have interfaces based on fine-‐grained updates to mutable state » RAMCloud, databases, distributed mem, Piccolo
Requires replicating data or logs across nodes for fault tolerance » Costly for data-‐intensive apps » 10-‐100x slower than memory write
![Page 18: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/18.jpg)
Solution: Resilient Distributed Datasets (RDDs)
Restricted form of distributed shared memory » Immutable, partitioned collections of records » Can only be built through coarse-‐grained deterministic transformations (map, filter, join, …)
Efficient fault recovery using lineage » Log one operation to apply to many elements » Recompute lost partitions on failure » No cost if nothing fails
![Page 19: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/19.jpg)
Input
query 1
query 2
query 3
. . .
RDD Recovery
one-‐time processing
iter. 1 iter. 2 . . .
Input
![Page 20: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/20.jpg)
Generality of RDDs Despite their restrictions, RDDs can express surprisingly many parallel algorithms » These naturally apply the same operation to many items
Unify many current programming models » Data flow models: MapReduce, Dryad, SQL, … » Specialized models for iterative apps: BSP (Pregel), iterative MapReduce (Haloop), bulk incremental, …
Support new apps that these models don’t
![Page 21: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/21.jpg)
Memory bandwidth
Network bandwidth
Tradeoff Space
Granularity of Updates
Write Throughput
Fine
Coarse
Low High
K-‐V stores, databases, RAMCloud
Best for batch workloads
Best for transactional workloads
HDFS RDDs
![Page 22: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/22.jpg)
Spark Programming Interface
DryadLINQ-‐like API in the Scala language
Usable interactively from Scala interpreter
Provides: » Resilient distributed datasets (RDDs) » Operations on RDDs: transformations (build new RDDs), actions (compute and output results) » Control of each RDD’s partitioning (layout across nodes) and persistence (storage in RAM, on disk, etc)
![Page 23: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/23.jpg)
Spark Operations
Transformations (define a new RDD)
map filter
sample groupByKey reduceByKey sortByKey
flatMap union join
cogroup cross
mapValues
Actions (return a result to driver program)
collect reduce count save
lookupKey
![Page 24: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/24.jpg)
Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.persist()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Master
messages.filter(_.contains(“foo”)).count
messages.filter(_.contains(“bar”)).count
tasks
results
Msgs. 1
Msgs. 2
Msgs. 3
Base RDD Transformed RDD
Action
Result: full-‐text search of Wikipedia in <1 sec (vs 20 sec for on-‐disk data) Result: scaled to 1 TB data in 5-‐7 sec
(vs 170 sec for on-‐disk data)
![Page 25: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/25.jpg)
Task Scheduler Dryad-‐like DAGs
Pipelines functions within a stage
Locality & data reuse aware
Partitioning-‐aware to avoid shuffles
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= cached data partition
![Page 26: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/26.jpg)
https://databricks-training.s3.amazonaws.com/slides/advanced-spark-training.pdf
![Page 27: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/27.jpg)
What is RDD?
Resilient Distributed Dataset- A big collection of data with following
properties- Immutable- Distributed- Lazily evaluated- Type inferred- Cacheable
https://www.slideshare.net/datamantra/anatomy-of-rdd
![Page 28: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/28.jpg)
Pseudo Monad
• Wraps iterator + partitions distribution
• Keeps track of history for fault tolerance
• Lazily evaluated, chaining of expressions
https://www.slideshare.net/deanchen11/scala-bay-spark-talk
![Page 29: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/29.jpg)
Partitions
● Logical division of data● Derived from Hadoop Map/Reduce● All Input,Intermediate and output data will be
represented as partitions● Partitions are basic unit of parallelism● RDD data is just collection of partitions
![Page 30: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/30.jpg)
Partition from Input Data
Data in HDFS
Chunk 1 Chunk 2 Chunk 3
Input Format
RDD
Partition 1 Partition 2 Partition 3
![Page 31: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/31.jpg)
Partition and Immutability
● All partitions are immutable ● Every transformation generates new partition● Partition immutability driven by underneath
storage like HDFS● Partition immutability allows for fault
recovery
![Page 32: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/32.jpg)
Partitions and Distribution
● Partitions derived from HDFS are distributedby default
● Partitions also location aware● Location awareness of partitions allow for
data locality● For computed data, using caching we can
distribute in memory also
![Page 33: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/33.jpg)
Accessing partitions
● We can access partition together rather single row at a time
● mapParititons API of RDD allows us that● Accessing partition at a time allows us to do
some partionwise operation which cannot be done by accessing single row.
![Page 34: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/34.jpg)
Partition for transformed Data
● Partitioning will be different for key/value pairs that are generated by shuffle operation
● Partitioning is driven by partitioner specified● By default HashPartitioner is used● You can use your own partitioner also
![Page 35: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/35.jpg)
Hash Partitioning
Input RDD
Partition 1 Partition 2 Partition 3
Hash(key)%no.of.partitions
Ouptut RDD Partition 1 Partition 2
![Page 36: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/36.jpg)
Custom Partitioner
● Partition the data according to your datastructure
● Custom partitioning allows control over no ofpartitions and the distribution of data acrosswhen grouping or reducing is done
![Page 37: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/37.jpg)
Look up operation
● Partitioning allows faster lookups● Lookup operation allows to look up for a
given value by specifying the key● Using partitioner, lookup determines which
partition look for● Then it only need to look in that partition● If no partition is specified, it will fallback to
filter
![Page 38: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/38.jpg)
Parent(Dependency)
● Each RDD has access to it’s parent RDD● Nil is the value of parent for first RDD● Before computing it’s value, it always
computes it’s parent● This chain of running allows for laziness
![Page 39: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/39.jpg)
Sub classing
● Each spark operator, creates an instance ofspecific sub class of RDD
● map operator results in MappedRDD,flatMap in FlatMappedRDD etc
● Subclass allows RDD to remember theoperation that is performed in thetransformation
![Page 40: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/40.jpg)
RDD transformationsval dataRDD = sc.textFile(args(1))
val splitRDD = dataRDD.flatMap(value => value.split(“ “)
Hadoop RDD
splitRDD: FlatMappedRDD
dataRDD :MappedRDD
Nil
![Page 41: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/41.jpg)
Compute
● Compute is the function for evaluation ofeach partition in RDD
● Compute is an abstract method of RDD● Each sub class of RDD like MappedRDD,
FilteredRDD have to override this method
![Page 42: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/42.jpg)
RDD actionsval dataRDD = sc.textFile(args(1))
val flatMapRDD = dataRDD.flatMap(value => value.split(“ “)
flatMapRDD.collect()
Hadoop RDD
FlatMap RDD
Mapped RDD
Nil
runJob
compute
compute
compute
![Page 43: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/43.jpg)
runJob API
● runJob API of RDD is the api to implementactions
● runJob allows to take each partition andallow you evaluate
● All spark actions internally use runJob api.
![Page 44: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/44.jpg)
Caching
● cache internally uses persist API● persist sets a specific storage level for a
given RDD● Spark context tracks persistent RDD● When first evaluates, partition will be put into
memory by block manager
![Page 45: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/45.jpg)
Block manager
● Handles all in memory data in spark● Responsible for
○ Cached Data ( BlockRDD)○ Shuffle Data○ Broadcast data
● Partition will be stored in Block with id (RDD.id, partition_index)
![Page 46: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/46.jpg)
How caching works?
● Partition iterator checks the storage level● if Storage level is set it calls
cacheManager.getOrCompute(partition)
● as iterator is run for each RDD evaluation, its transparent to user
![Page 47: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/47.jpg)
The State of Spark, and Where We're Going Next Matei Zaharia Spark Summit (2013) youtu.be/nU6vO2EJAb4
A Brief History: Spark
![Page 48: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/48.jpg)
RDDs track the graph of transformations that built them (their lineage) to rebuild lost data
E.g.: messages = textFile(...).filter(_.contains(“error”)).map(_.split(‘\t’)(2))
HadoopRDD path = hdfs://…
FilteredRDD func = _.contains(...)
MappedRDD func = _.split(…)
Fault Recovery
HadoopRDD FilteredRDD MappedRDD
![Page 49: Spark: A Brief History - Indian Institute of Sciencecds.iisc.ac.in/wp-content/uploads/DS256.2017.L15.Spark_.RDD_.pdf · Spark: A Brief History ... Justin Ma, Murphy McCauley, Michael](https://reader033.fdocuments.us/reader033/viewer/2022052711/5ac3a1a37f8b9ae06c8c58e7/html5/thumbnails/49.jpg)
Fault Recovery Results
119
57 56 58 58 81
57 59 57 59
0 20 40 60 80 100 120 140
1 2 3 4 5 6 7 8 9 10
Iteratrion
time (s)
Iteration
Failure happens