Spark Processing 101...Aug 12, 2015  · Spark Context Starting point for working with Spark...

Post on 30-Nov-2020

6 views 1 download

Transcript of Spark Processing 101...Aug 12, 2015  · Spark Context Starting point for working with Spark...

Spark Processing 101

September 10, 2015

Justin Sun

OverviewWhat is Spark?SparkContextResilient Distributed Datasets (RDDs)TransformationsActionsCode ExamplesResources

What is Spark?General cluster computing system for Big

DataSupports in-memory processingAPIs for Scala, Java, and PythonAdditional libraries:

Spark Streaming – Process live data streamsSpark SQL – SQL and Data FramesMLlib – Machine learningGraphX - Graph processing

Spark ContextStarting point for working with SparkSpecifies access to cluster or local machineRequired if you write a standalone programProvided as ‘sc’ by the Spark shellScala:

val conf = new SparkConf().setAppName("Simple App")

val sc = new SparkContext(conf)

Java: SparkConf conf = new SparkConf().setAppName("Simple App");

JavaSparkContext sc = new JavaSparkContext(conf);

Resilient Distributed Datasets (RDDs)Main abstraction in SparkFault-tolerantSupports parallel operationsCreate RDDs by

Calling sc.parallelize()Reading in data from an external source

Text file – sc.textFile() HDFS source Cassandra

TransformationsImmutable after creationEnable parallel computationsInput is an RDD, output is a pointer to an RDDCan be chained togetherArguments are functions or closuresLazy evaluation: Nothing happens until an

action is run

ActionsProgram is run when an action is calledExamples:

reduce()collect()count()first()take()

Visual TransformationsDataBricks Visual Guide to Spark

Transformations and Actions – http://training.databricks.com/visualapi.pdf map()filter()flatMap()

Code exampleshttp://spark.apache.org/docs/latest/quick-start.html

ResourcesSpark website – http://spark.apache.org/docs/latestQuick Start –

http://spark.apache.org/docs/latest/quick-start.htmlDataBricks Developer Resources –

https://databricks.com/spark/developer-resourcesSpark YouTube channel –

https://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82-w

edX.org Online CoursesCS100.1X – Introduction to Big Data with Apache SparkCS190.1X – Scalable Machine Learning