Spark Processing 101...Aug 12, 2015  · Spark Context Starting point for working with Spark...

10
Spark Processing 101 September 10, 2015 Justin Sun

Transcript of Spark Processing 101...Aug 12, 2015  · Spark Context Starting point for working with Spark...

Page 1: Spark Processing 101...Aug 12, 2015  · Spark Context Starting point for working with Spark Specifies access to cluster or local machine Required if you write a standalone program

Spark Processing 101

September 10, 2015

Justin Sun

Page 2: Spark Processing 101...Aug 12, 2015  · Spark Context Starting point for working with Spark Specifies access to cluster or local machine Required if you write a standalone program

OverviewWhat is Spark?SparkContextResilient Distributed Datasets (RDDs)TransformationsActionsCode ExamplesResources

Page 3: Spark Processing 101...Aug 12, 2015  · Spark Context Starting point for working with Spark Specifies access to cluster or local machine Required if you write a standalone program

What is Spark?General cluster computing system for Big

DataSupports in-memory processingAPIs for Scala, Java, and PythonAdditional libraries:

Spark Streaming – Process live data streamsSpark SQL – SQL and Data FramesMLlib – Machine learningGraphX - Graph processing

Page 4: Spark Processing 101...Aug 12, 2015  · Spark Context Starting point for working with Spark Specifies access to cluster or local machine Required if you write a standalone program

Spark ContextStarting point for working with SparkSpecifies access to cluster or local machineRequired if you write a standalone programProvided as ‘sc’ by the Spark shellScala:

val conf = new SparkConf().setAppName("Simple App")

val sc = new SparkContext(conf)

Java: SparkConf conf = new SparkConf().setAppName("Simple App");

JavaSparkContext sc = new JavaSparkContext(conf);

Page 5: Spark Processing 101...Aug 12, 2015  · Spark Context Starting point for working with Spark Specifies access to cluster or local machine Required if you write a standalone program

Resilient Distributed Datasets (RDDs)Main abstraction in SparkFault-tolerantSupports parallel operationsCreate RDDs by

Calling sc.parallelize()Reading in data from an external source

Text file – sc.textFile() HDFS source Cassandra

Page 6: Spark Processing 101...Aug 12, 2015  · Spark Context Starting point for working with Spark Specifies access to cluster or local machine Required if you write a standalone program

TransformationsImmutable after creationEnable parallel computationsInput is an RDD, output is a pointer to an RDDCan be chained togetherArguments are functions or closuresLazy evaluation: Nothing happens until an

action is run

Page 7: Spark Processing 101...Aug 12, 2015  · Spark Context Starting point for working with Spark Specifies access to cluster or local machine Required if you write a standalone program

ActionsProgram is run when an action is calledExamples:

reduce()collect()count()first()take()

Page 8: Spark Processing 101...Aug 12, 2015  · Spark Context Starting point for working with Spark Specifies access to cluster or local machine Required if you write a standalone program

Visual TransformationsDataBricks Visual Guide to Spark

Transformations and Actions – http://training.databricks.com/visualapi.pdf map()filter()flatMap()

Page 9: Spark Processing 101...Aug 12, 2015  · Spark Context Starting point for working with Spark Specifies access to cluster or local machine Required if you write a standalone program

Code exampleshttp://spark.apache.org/docs/latest/quick-start.html

Page 10: Spark Processing 101...Aug 12, 2015  · Spark Context Starting point for working with Spark Specifies access to cluster or local machine Required if you write a standalone program

ResourcesSpark website – http://spark.apache.org/docs/latestQuick Start –

http://spark.apache.org/docs/latest/quick-start.htmlDataBricks Developer Resources –

https://databricks.com/spark/developer-resourcesSpark YouTube channel –

https://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82-w

edX.org Online CoursesCS100.1X – Introduction to Big Data with Apache SparkCS190.1X – Scalable Machine Learning