Spark Processing 101...Aug 12, 2015 · Spark Context Starting point for working with Spark...
Transcript of Spark Processing 101...Aug 12, 2015 · Spark Context Starting point for working with Spark...
Spark Processing 101
September 10, 2015
Justin Sun
OverviewWhat is Spark?SparkContextResilient Distributed Datasets (RDDs)TransformationsActionsCode ExamplesResources
What is Spark?General cluster computing system for Big
DataSupports in-memory processingAPIs for Scala, Java, and PythonAdditional libraries:
Spark Streaming – Process live data streamsSpark SQL – SQL and Data FramesMLlib – Machine learningGraphX - Graph processing
Spark ContextStarting point for working with SparkSpecifies access to cluster or local machineRequired if you write a standalone programProvided as ‘sc’ by the Spark shellScala:
val conf = new SparkConf().setAppName("Simple App")
val sc = new SparkContext(conf)
Java: SparkConf conf = new SparkConf().setAppName("Simple App");
JavaSparkContext sc = new JavaSparkContext(conf);
Resilient Distributed Datasets (RDDs)Main abstraction in SparkFault-tolerantSupports parallel operationsCreate RDDs by
Calling sc.parallelize()Reading in data from an external source
Text file – sc.textFile() HDFS source Cassandra
TransformationsImmutable after creationEnable parallel computationsInput is an RDD, output is a pointer to an RDDCan be chained togetherArguments are functions or closuresLazy evaluation: Nothing happens until an
action is run
ActionsProgram is run when an action is calledExamples:
reduce()collect()count()first()take()
Visual TransformationsDataBricks Visual Guide to Spark
Transformations and Actions – http://training.databricks.com/visualapi.pdf map()filter()flatMap()
Code exampleshttp://spark.apache.org/docs/latest/quick-start.html
ResourcesSpark website – http://spark.apache.org/docs/latestQuick Start –
http://spark.apache.org/docs/latest/quick-start.htmlDataBricks Developer Resources –
https://databricks.com/spark/developer-resourcesSpark YouTube channel –
https://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82-w
edX.org Online CoursesCS100.1X – Introduction to Big Data with Apache SparkCS190.1X – Scalable Machine Learning