Spark - The beginnings
-
Upload
daniel-leon -
Category
Software
-
view
172 -
download
5
Transcript of Spark - The beginnings
![Page 1: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/1.jpg)
![Page 2: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/2.jpg)
Spark – The beginningsDaniel LeonOptymyze
[7th of November 2015]
![Page 3: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/3.jpg)
Content1) Hadoop Dilemma2) Processing engines war3) Spark ecosystem4) Resilient Distributed Datasets5) Spark application workflow6) Conclusion
![Page 4: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/4.jpg)
Big data technologies
![Page 5: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/5.jpg)
Hadoop – Where does it end ?
![Page 6: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/6.jpg)
Hadoop Architecture
![Page 7: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/7.jpg)
Hadoop evolution
![Page 8: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/8.jpg)
Map Reduce workflow
![Page 9: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/9.jpg)
Hadoop ecosystem
![Page 10: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/10.jpg)
Beyond Map Reduce
• Complex iterative algorithms• Interactive queries• Real time processing
![Page 11: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/11.jpg)
Different processing model
•More operation available• Flexible way of composing operations• Pluggable data sources• Streaming capabilities built-in• Pluggable algorithm
![Page 12: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/12.jpg)
Searching for another processing engine
![Page 13: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/13.jpg)
Processing engine comparison
![Page 14: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/14.jpg)
Processing engine comparison
![Page 15: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/15.jpg)
Spark ecosystem
![Page 16: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/16.jpg)
Spark ecosystem
![Page 17: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/17.jpg)
100TB Daytona Sort Competition 2014
![Page 18: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/18.jpg)
Resilient Distributed Dataset - RDD
• Stored in memory and storage• Immutable• Enables parallel operations on collections of elements• Contains lineage information
![Page 19: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/19.jpg)
Resilient Distributed Dataset - RDD
![Page 20: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/20.jpg)
Constructing RDD's
• Parallelize existing collectionsl RDD=sc.parallelize([“a”, “b”, “c”])• From files in HDFS, S3, Hive
l linesRDD=sc.textFile(“README”)• Transforming an existing RDD
![Page 21: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/21.jpg)
Operations on RDD's• Transformations – lazy
l filterl mapl groupBy
• Actionsl countl collect
![Page 22: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/22.jpg)
Spark terminology• Job – the work required to compute an RDD• Stage – a wave of work within a job, corresponding to one or morepipelined RDD's• Task – a unit of work within a stage, correspoding to one RDD partition• Shuffle – the transfer the data between stages
![Page 23: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/23.jpg)
Spark architecture
![Page 24: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/24.jpg)
Conclusion• Spark is :
• Complete and standalone solution for distributed processing• Fluent API• Pluggable with other big data frameworks• One of the most actively contributed Apache project
![Page 25: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/25.jpg)
Documentation
https://hadoopecosystemtable.github.iohttps://databricks.com/spark/developer-resourceshttps://databricks.com/resources/slideshttps://databricks.com/spark/training
![Page 26: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/26.jpg)
Spark – The beginningsDaniel LeonOptymyze
[7th of November 2015]
![Page 27: Spark - The beginnings](https://reader035.fdocuments.us/reader035/viewer/2022062401/58ecb6671a28aba6718b4647/html5/thumbnails/27.jpg)