Post on 30-Nov-2020
Spark Processing 101
September 10, 2015
Justin Sun
OverviewWhat is Spark?SparkContextResilient Distributed Datasets (RDDs)TransformationsActionsCode ExamplesResources
What is Spark?General cluster computing system for Big
DataSupports in-memory processingAPIs for Scala, Java, and PythonAdditional libraries:
Spark Streaming – Process live data streamsSpark SQL – SQL and Data FramesMLlib – Machine learningGraphX - Graph processing
Spark ContextStarting point for working with SparkSpecifies access to cluster or local machineRequired if you write a standalone programProvided as ‘sc’ by the Spark shellScala:
val conf = new SparkConf().setAppName("Simple App")
val sc = new SparkContext(conf)
Java: SparkConf conf = new SparkConf().setAppName("Simple App");
JavaSparkContext sc = new JavaSparkContext(conf);
Resilient Distributed Datasets (RDDs)Main abstraction in SparkFault-tolerantSupports parallel operationsCreate RDDs by
Calling sc.parallelize()Reading in data from an external source
Text file – sc.textFile() HDFS source Cassandra
TransformationsImmutable after creationEnable parallel computationsInput is an RDD, output is a pointer to an RDDCan be chained togetherArguments are functions or closuresLazy evaluation: Nothing happens until an
action is run
ActionsProgram is run when an action is calledExamples:
reduce()collect()count()first()take()
Visual TransformationsDataBricks Visual Guide to Spark
Transformations and Actions – http://training.databricks.com/visualapi.pdf map()filter()flatMap()
Code exampleshttp://spark.apache.org/docs/latest/quick-start.html
ResourcesSpark website – http://spark.apache.org/docs/latestQuick Start –
http://spark.apache.org/docs/latest/quick-start.htmlDataBricks Developer Resources –
https://databricks.com/spark/developer-resourcesSpark YouTube channel –
https://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82-w
edX.org Online CoursesCS100.1X – Introduction to Big Data with Apache SparkCS190.1X – Scalable Machine Learning