Distributed Processing Frameworks
-
Upload
antonios-katsarakis -
Category
Software
-
view
96 -
download
0
Transcript of Distributed Processing Frameworks
![Page 1: Distributed Processing Frameworks](https://reader035.fdocuments.us/reader035/viewer/2022081900/5a64c3977f8b9ac21c8b5919/html5/thumbnails/1.jpg)
Distributed Processing
Frameworks
Author: Antonios Katsarakis
![Page 2: Distributed Processing Frameworks](https://reader035.fdocuments.us/reader035/viewer/2022081900/5a64c3977f8b9ac21c8b5919/html5/thumbnails/2.jpg)
Literature
• MapReduce: Simplified Data Processing on Large Clusters
Jeff Dean et al. - OSDI’04.
• Spark: Cluster Computing with Working Sets
M. Zaharia et al. - HotCloud’10.
![Page 3: Distributed Processing Frameworks](https://reader035.fdocuments.us/reader035/viewer/2022081900/5a64c3977f8b9ac21c8b5919/html5/thumbnails/3.jpg)
Why Big Data?
• More data to process: IoT, smart devices, web applications
- About 2.3 trillion GB of new data are generated every day
• Growth of CPU performance cannot keep up with increasing
amount of data to process
• This leads us to the Big Data era
- Big data: Data sets are so large that the processing power of a
single machine is inadequate to deal with them
• We need to find ways to process these massive amounts of data
![Page 4: Distributed Processing Frameworks](https://reader035.fdocuments.us/reader035/viewer/2022081900/5a64c3977f8b9ac21c8b5919/html5/thumbnails/4.jpg)
MapReduce• Proposed by Jeff Dean et al. (Google) 2004
- Cited more than 18k
• A programming model that enables the parallel
and distributed processing of large data sets
• Typical MapReduce Program:
- Read Data
- Map: filtering of the data
- Shuffle and short
- Reduce: summary operation on data
- Write the Results
ReduceReduce
Input Data
1/3
Input
1/3
Input
1/3
Input
Map Map Map
Interm.
Data
Interm.
Data
Interm.
Data
Output
Data
Output
Data
![Page 5: Distributed Processing Frameworks](https://reader035.fdocuments.us/reader035/viewer/2022081900/5a64c3977f8b9ac21c8b5919/html5/thumbnails/5.jpg)
Critical Reflection• Outcome:
- Novel idea that lead to a whole new era of distributed systems
- Big impact in industry (Hadoop MapReduce)
- Lowered the cost of computations
• Limitations:
- Restricted to batch processing
- It only support map and reduce operations
- The shuffling phase introduces overheads
![Page 6: Distributed Processing Frameworks](https://reader035.fdocuments.us/reader035/viewer/2022081900/5a64c3977f8b9ac21c8b5919/html5/thumbnails/6.jpg)
Spark
• Proposed by Matei Zaharia et al. 2010
- Cited 1.5k
• Another programming model based on
higher-ordered functions that execute
user-defined functions in parallel
• Aims to replace MapReduce in industry
• Main Ideas:
- Represent the computations as DAGs
- Cache datasets into memory
![Page 7: Distributed Processing Frameworks](https://reader035.fdocuments.us/reader035/viewer/2022081900/5a64c3977f8b9ac21c8b5919/html5/thumbnails/7.jpg)
Spark Model• Resilient Distributed
Datasets (RRDs):
immutable collections of
objects spread across a
cluster
• Operations over RDDs:
1.Transformations: lazy
operators that create new
RDDs
2.Actions: launch a
computation on an RDD
Pip
elin
ed RDD1
var count = readFile(…)
.map(…)
.filter(..)
.reduceByKey()
.count()
File splited
into chunks
(RDD0)
RDD2
RDD3
RDD4
Result
Job (RDD) Graph
Sta
ge
1S
t. 2
![Page 8: Distributed Processing Frameworks](https://reader035.fdocuments.us/reader035/viewer/2022081900/5a64c3977f8b9ac21c8b5919/html5/thumbnails/8.jpg)
Critical Reflection• Benefits:
- High level API
- Support more applications types
- Performance optimizations
• Limitations:
- Detailed performance analysis on the thread level is hard
- Multipurpose application support makes performance improvements and
tuning really challenging
- The shuffling phase introduces overheads
![Page 9: Distributed Processing Frameworks](https://reader035.fdocuments.us/reader035/viewer/2022081900/5a64c3977f8b9ac21c8b5919/html5/thumbnails/9.jpg)
Conclusion
• Clusters provide the computational power to
process Big Data
• MapReduce allows developers to build programs for
clusters
• Spark tries to overcome limitations of MapReduce
• These systems introduce many challenges in terms
of measuring and improving their performance