Spark vs Hadoop
-
Upload
olesya-eidam -
Category
Technology
-
view
75 -
download
1
Transcript of Spark vs Hadoop
![Page 1: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/1.jpg)
Apache Spark Data Analytics.Comparison to the Existing Technology at the Example of Apache Hadoop MapReduce.
Final Presentation
Seminar: „Data Science in the Era of Big Data“
Olesya Eidam
Technische Universität München
13.08.2015
![Page 2: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/2.jpg)
IntroductionA brief introduction of the existing big data analytics tools
![Page 3: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/3.jpg)
Source: [1]
The World of Big DataApache Hadoop and Spark within the context of big data analytics:
![Page 4: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/4.jpg)
Outline
1. Introduction
2. Hadoop
3. Spark
4. Spark vs. Hadoop MapReduce
5. Spark + HDFS
6. Machine Learning: K-Means
![Page 5: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/5.jpg)
Apache HadoopThe framework for handling big data based on several interlocking technologies
![Page 6: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/6.jpg)
What is Hadoop?The Hadoop project’s open-source software for reliable, scalable, distributed computing
Source: [7], [8]
![Page 7: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/7.jpg)
HDFS and YARN ArchitectureA Hadoop cluster is characterized by a master – slave architecture, which utilizes the “shared-nothing” principle for effective data processing.
Source: [11]
![Page 8: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/8.jpg)
Map Reduce: an exampleMapReduce means breaking the processing into two phases: the map phase and the reduce phase, both performed in a distributed, parallel way on a cluster of computers.
Source: [11]
![Page 9: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/9.jpg)
MapReduce within Hadoop Framework…represents a scalable solution, which can be extended to several reduce tasks…
Source: [18]
![Page 10: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/10.jpg)
Limitations of Hadoop MapReduce …however not necessarily a universally suitable solution especially for the tasks with growing importance.
Source: [2]
![Page 11: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/11.jpg)
Shuffle and SortSlow due to replication, serialization, I/O. Inefficient for iterative algorithms and interactive data mining:
Source: [4]
![Page 12: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/12.jpg)
Apache Spark An open-source project for fast, in-memory and large-scale data processing
![Page 13: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/13.jpg)
What is Spark ?“Effective, fast, general-purpose cluster computing framework with high level APIs in Java, Scala, Python and R”:
Source: [9]
![Page 14: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/14.jpg)
Spark‘s buildupIn addition to the benefits of HDFS Spark relies on DAG* pattern for complex, multi-step data pipelines and in-memory data sharing across DAG.
Source: [12] *DAG: Directed Acyclic Graph
![Page 15: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/15.jpg)
Anatomy of RDD Distributed collections of objects that can be cached in memory across cluster nodes.
Source: [5] *RDD: Resilient Distributed Datasets
Some of RDD Characteristics
immutable
resilient,
distributed,
lazily evaluated,
cacheable/persistent and
fault-tolerant
![Page 16: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/16.jpg)
Actions and TransformationsSpark enables lazy evaluation due to a dependency chain of RDDs. DAG allows for running consistently more complex operations.
Source: [14], [8]
Transformations Return pointers to new RDD Transformations are lazy (Not computed
immediately) Transformed RDDs gets recomputed when
actions run on it RDD can be persisted in memory or disk
Actions Return Values Actions result into a DAG of operations DAG is compiled into stages where each stage is
executed as series of tasks Tasks : Fundamental units of work
![Page 17: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/17.jpg)
MapReduce vs SparkComparison to Hadoop MapReduce
![Page 18: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/18.jpg)
The Map SideSpark does not merge or partition spill files, the output of map phase is written to OS buffer cache, each map task outputs as many spill files as number of reducers.
Source: [6]
vs
Hadoop MapReduce Spark
![Page 19: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/19.jpg)
The Reduce SideThe map phase pushes the data in the form of intermediate (shuffle) files to the reducers. These files are written to reducer’s memory and reduce functionality is invoked.
Source: [6]
Hadoop MapReduce Spark
vs
![Page 20: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/20.jpg)
Better for Iterative ComputationsData sharing in Hadoop is slow due to replication, serialization and disk I/O.
Source: [16]
vs
Hadoop MapReduce
Spark
![Page 21: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/21.jpg)
Better for Interactive ComputationsBy the same reason Hadoop underperforms for interactive (low-latency) computations.
Source: [16]
Hadoop MapReduce
Spark vs
![Page 22: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/22.jpg)
Spark on HDFSCan Spark replace Hadoop ?
![Page 23: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/23.jpg)
The combination of Hadoop and Spark Operational applications augmented by in-memory performance:
Source: [14]
Hadoop features
Spark features
![Page 24: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/24.jpg)
K-MeansUse case in machine learning: iterative algorithm for clustering data
![Page 25: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/25.jpg)
The AlgorithmK-Means works by forming clusters of data points by minimizing the sum of squared distances between the data points and their centroids.
Source: [6]
![Page 26: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/26.jpg)
A short comparison:
~227 Lines of Code
~64 Lines of Code
![Page 27: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/27.jpg)
Results by S. Gopalani, R. AroraThe results clearly showed that the performance of Spark turn out to be considerably higher in terms of time.
Source: [6]
Experimental Environment
64MB, 1240 MB with a single node and 1240MB with two nodes
monitored the performance in terms of the time taken for clustering as per the requirements
The machines used had a configuration as follows: • 4GB RAM • Linux Ubuntu • 500 GB Hard Drive
![Page 28: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/28.jpg)
Results by M. Zacharia et. al.Spark outperforms Hadoop by up to 20x in iterative machine learning and graph applications.
Source: [13]
![Page 29: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/29.jpg)
Source: [1]
High Performance Computing… Apache Hadoop and Spark within the context of the big data analytics:
![Page 30: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/30.jpg)
MPI and HARP PerformanceHPC* tools perform better Hadoop and Spark , but can be boosted using a hybrid approach of other technologies that blend HPC and big data, including Spark and HARP.
Source: [17]*HPC: High Performance Computing
![Page 31: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/31.jpg)
Thank you for your attention!
...any questions?
![Page 32: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/32.jpg)
LiteratureResources used for this presentation
![Page 33: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/33.jpg)
LiteratureResources used for this presentation:
[1] B. Zhang. A Brief Introduction of Existing Big Data Tools - A Presentation, Retrieved August 2015, URL: http://scholarwiki.indiana.edu/Z604/slides/big%20data%20tools%20v2.pdf
[2] G. Fox. Multi-faceted Classification of Big Data Uses and Proposed Architecture Integrating High Performance Computing and the Apache Stack – A Presentation for the Sixth Interantional Workshop on Cloud Data Management, Cloud DB 2014, Chicago March 2014.
[3] S. Jha, J. Qiu, A. Luckow, P. Mantha, G. C.Fox. A Tale of Two Data-Intensive Paradigms:Applications, Abstractions, and Architectures. Big Data (BigData Congress), 2014 IEEE International Congress on. IEEE, 2014.
[4] T. White. Hadoop. The Denite Guide. O'Reilly Media, Inc., 2010.
[5] T. Duarte. Anatomy of RDD - An Explanatory Video Illustration, Retrieved in June 2015. URL:http://www.sparkinternals.com/
![Page 34: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/34.jpg)
LiteratureResources used for this presentation:
[6] A. R. Gopalani, S. Comparing apache spark and map reduce with performance analysis using k-means. International Journal of Computer Applications (0975 - 8887), 113(1), March 2015.
[7] Apache, Inc. Apache™ Hadoop® Documetation, Retreived in July 2015.URL: http://www.apache.org/
[8] Hortonworks, Inc. Hortonworks Data Platform: Getting Started Guide – A Whitepaper, May 2014
[9] Apache, Inc . Apache ™ SparkDocumetation, Retreived in July 2015.URL: http://www.apache.org/
[10] A.Murthy, Hortonworks, Inc. Apache Hadoop 2 is now GA! – A Blog Entry, October 2013, Retrieved August 2015. URL: http://hortonworks.com/blog/apache-hadoop-2-is-ga/
![Page 35: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/35.jpg)
[11] Edureka!. Apache Hadoop 2.0 and YARN – Instruction, October 2013, Retrieved in August 2015, URL: http://www.edureka.co/blog/apache-hadoop-2-0-and-yarn/
[12] V. Shukla, R. Venkatesh. Hortonworks, Inc. Spark Webinar Presentation, October 2014
[13] e. a. M. Zacharía Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. University of California, Berkeley, 2012.
[14] MC Srivas, MapR Technologies, Inc. Why Spark on Hadoop Matters – A Presentation, July 2014.
[15] Y Wang, R Goldstone, W Yu, T Wang. Characterization and optimization of memory-resident mapreduce on HPC systems . - 2014 IEEE 28th International Parallel & Distributed Processing Symposium
LiteratureResources used for this presentation:
![Page 36: Spark vs Hadoop](https://reader031.fdocuments.us/reader031/viewer/2022022203/5878524f1a28ab68198b684f/html5/thumbnails/36.jpg)
[16] Databricks, Inc. Intro to Apache Spark – A Workshop Presentation, Retrieved in August 2015. URL: http://training.databricks.com/workshop/itas_workshop.pdf
[17] S. Jha, J. Qiu, A.Luckow, P. Mantha, G. C. Fox. A tale of two data-intensive paradigms: AppliBig Data (BigData Congress), 2014 IEEE International Congress on (pp. 645-652). IEEE. June 2014cations, abstractions, and architectures.
[18] IBM, Inc. What is MapReduce? – An Explanatory Article, Retreived in August 2015. URL: http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/
LiteratureResources used for this presentation: