Spark – The beginningsDaniel LeonOptymyze
[7th of November 2015]
Content1) Hadoop Dilemma2) Processing engines war3) Spark ecosystem4) Resilient Distributed Datasets5) Spark application workflow6) Conclusion
Big data technologies
Hadoop – Where does it end ?
Hadoop Architecture
Hadoop evolution
Map Reduce workflow
Hadoop ecosystem
Beyond Map Reduce
• Complex iterative algorithms• Interactive queries• Real time processing
Different processing model
•More operation available• Flexible way of composing operations• Pluggable data sources• Streaming capabilities built-in• Pluggable algorithm
Searching for another processing engine
Processing engine comparison
Processing engine comparison
Spark ecosystem
Spark ecosystem
100TB Daytona Sort Competition 2014
Resilient Distributed Dataset - RDD
• Stored in memory and storage• Immutable• Enables parallel operations on collections of elements• Contains lineage information
Resilient Distributed Dataset - RDD
Constructing RDD's
• Parallelize existing collectionsl RDD=sc.parallelize([“a”, “b”, “c”])• From files in HDFS, S3, Hive
l linesRDD=sc.textFile(“README”)• Transforming an existing RDD
Operations on RDD's• Transformations – lazy
l filterl mapl groupBy
• Actionsl countl collect
Spark terminology• Job – the work required to compute an RDD• Stage – a wave of work within a job, corresponding to one or morepipelined RDD's• Task – a unit of work within a stage, correspoding to one RDD partition• Shuffle – the transfer the data between stages
Spark architecture
Conclusion• Spark is :
• Complete and standalone solution for distributed processing• Fluent API• Pluggable with other big data frameworks• One of the most actively contributed Apache project
Documentation
https://hadoopecosystemtable.github.iohttps://databricks.com/spark/developer-resourceshttps://databricks.com/resources/slideshttps://databricks.com/spark/training
Spark – The beginningsDaniel LeonOptymyze
[7th of November 2015]
Top Related