Introduction to Spark - DataFactZ
-
Upload
datafactz -
Category
Technology
-
view
966 -
download
2
Transcript of Introduction to Spark - DataFactZ
2
Introduction to Apache Spark
3
What is Apache Spark?
Architecture Spark History Spark vs. Hadoop Getting Started
Scala - A scalable language
Spark Core RDD Transformations Actions Lazy Evaluation - in
action
Working with KV Pairs Pair RDDs, Joins
AgendaAdvanced Spark
Accumulators, Broadcast
Running on a cluster Standalone
Programs
Spark SQL Data Frames
(SchemaRDD) Intro to Parquet Parquet + Spark
Advanced Libraries Spark Streaming MLlib
4
What is Spark?
A distributed computing platform designed to be
Fast Fast to develop distributed applications Fast to run distributed applications
General Purpose A single framework to handle a variety of workloads Batch, interactive, iterative, streaming, SQL
5
Fast & General Purpose Fast/Speed
Computations in memory Faster than MR even for disk computations
Generality Designed for a wide range of workloads Single Engine to combine batch, interactive,
iterative, streaming algorithms. Has rich high-level libraries and simple native
APIs in Java, Scala and Python. Reduces the management burden of
maintaining separate tools.
6
Spark Architecture
DataFrame API
PackagesSprak Streaming
Spark Core
Spark SQL MLLib GraphX
Standalone
Yarn
Mesos
Datasources
7
Spark Unified Stack
8
Cluster ManagersCan run on a variety of cluster managers
Hadoop YARN - Yet Another Resource Negotiator is a cluster management technology and one of the key features in Hadoop 2.
Apache Mesos - abstracts CPU, memory, storage, and other compute resources away from machines, enabling fault-tolerant and elastic distributed systems.
Spark Standalone Scheduler – provides an easy way to get started on an empty set of machines.
Spark can leverage existing Hadoop infrastructure
9
Spark History
Started in 2009 as a research project in UC Berkeley RAD lab which became AMP Lab.
Spark researchers found that Hadoop MapReduce was inefficient for iterative and interactive computing.
Spark was designed from the beginning to be fast for interactive, iterative with support for in-memory storage and fault-tolerance.
Apart from UC Berkeley, Databricks, Yahoo! and Intel are major contributors. Spark was open sourced in March 2010 and transformed into Apache
Foundation project in June 2013.
10
Spark Vs HadoopHadoop MapReduce Mostly suited for batch jobs Difficulty to program directly in MR Batch doesn’t compose well for large apps Specialized systems needed as a workaround
Spark Handles batch, interactive, and real-time within a
single framework Native integration with Java, Python, Scala Programming at a higher level of abstraction More general than MapReduce
11
Getting Started Multiple ways of using Spark
Certified Spark Distributions Datastax Enterprise (Cassandra + Spark) HortonWorks HDP MAPR
Local/Standalone
Databricks Cloud
Amazon AWS EC2
12
Databricks Cloud A hosted data platform powered by Apache Spark
Features Exploration and Visualization Managed Spark Clusters Production Pipelines Support for 3rd party apps (Tableau, Pentaho, Qlik View)
Databricks Cloud Trail
http://databricks.com/registration
13
Local Mode Install Java JDK 6/7 on MacOSX or Windows
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html
Install Python 2.7 using Anaconda (only on Windows)https://store.continuum.io/cshop/anaconda/
Download Apache Spark from Databricks, unzip the downloaded filehttp://training.databricks.com/workshop/usb.zip
The provided link is for Spark 1.5.1, however the latest binary can also be obtained from
http://spark.apache.org/downloads.html
Connect to the newly created spark-training directory
14
ExerciseThe following steps demonstrate how to create a simple spark program in Spark using Scala
Create a collection of 1,000 integers Use the collection to create a base RDD Apply a function to filter numbers less than 50 Display the filtered values Invoke the spark-shell and type the following code
$SPARK_HOME/bin/spark-shellval data = 0 to 1000val distData = sc.parallelize(data)val filteredData = distData.filter(s => s < 50)filteredData.collect()
15
Functional Programming + Scala
16
Functional Programming Functional Programming
Computation as evaluation of mathematical functions. Avoids changing state and mutable-data. Functions are treated as values just like integers or literals. Functions can be passed as arguments and received as results. Functions can be defined inside other functions. Functions cannot have side-effects. Functions communicate with the environment by taking arguments and
returning results, they do not maintain state.
In functional programming language operations of a program should map input values to output values rather than change data in place.
Examples: Haskell, Scala
17
Scala – A Scalable Language
A multi-paradigm programming language with focus on functional programming.
High level language for the JVM Statically Typed Object Oriented + Functional Generates byte code that runs on the top of any JVM Comparable in speed to Java Interoperates with Java, can use any Java class Can be called from Java code
Spark core is completely written in Scala.
Spark SQL, GraphX, Spark Streaming etc. are libraries written in Scala.
18
Scala – Main Features
What differentiates Scala from Java? Anonymous functions (Closures/Lambda functions).
Type inference (Statically Typed).
Implicit Conversions.
Pattern Matching.
Higher-order Functions.
19
Scala – Main Features Anonymous functions (Closures or Lambda functions)
Regular functiondef containsString( x: String ): Boolean = {
x.contains(“mysql”)
}
Anonymous functionx => x.contains(“mysql”)
_.contains(“mysql”) //shortcut notation
Type Inferencedef squareFunc( x: Int ) = {
x*x
}
20
Scala – Main Features Implicit Conversions
val a: Int = 1
Val b: Int = 4
val myRange: Range = a to b
myRange.foreach(println) OR
(1 to 4).foreach(println)
Pattern Matchingval pairs = List((1, 2), (2, 3), (3, 4))
val result = pair.filter(s => s._2 != 2)
val result = pair.filter{case(x, y) => y != 2}
Higher-order functionsmessages.filter(x => x.contains(“mysql"))
messages.filter(_.contains(“mysql”))
21
Scala – Exercise1. Filter strings containing “mysql” from a list.val lines = List("My first Scala program", "My first mysql query")
def containsString(x: String) = x.contains("mysql") //regular function
lines.filter(containsString) //higher order function
lines.filter(s => s.contains("mysql")) //anonymous function
lines.filter(_.contains(“mysql")) //shortcut notation
2. From a list of tuples filter tuples that don't have 2 as their second element.val pairs = List((1, 2), (2, 3), (3, 4))
pairs.filter(s => s._2 != 2) //no pattern matching
pairs.filter{ case(x, y) => y != 2 } //pattern matching
3. Functional operations map input to output and do not change data in place.val nums = List(1, 2, 3, 4, 5)
val numSquares = nums.map(s => s * s) //returns square of each element
println(numSquares)
22
Spark Core
23
Directed Acyclic Graph (DAG)DAG
A chain of MapReduce jobs A Pig that script defines a chain of MR jobs A Spark program is also a DAG
Limitations of Hadoop/MapReduce A graph of MR jobs are schedules to run sequentially, inefficiently Between each MR job the DAG writes data to disk (HDFS) In MR the dataset is abstracted as KV pairs called the KV store MR jobs are batch processes so KV store cannot be queries interactively
Advantages of Spark Spark DAGs don’t run like Hadoop/MR DAGs so much more efficiently Spark DAGs run in memory as much as possible and spill over to disk only
when needed Spark dataset is called an RDD The RDD is stored in memory so it can be interactively queried
24
Resilient Distributed Dataset(RDD)Resilient Distributed Dataset
Spark’s primary abstraction A distributed collection of items called elements, could be KV pairs or
anything else RDDs are immutable RDD is a Scala object Transformations and Actions can be performed on RDDs RDD can be created from HDFS file, local file, parallelized collection,
JSON file etc.
Data Lineage (What makes RDD resilient?) RDD has lineage that keep tracks of where data came from and how it
was derived Lineage is stored in the DAG or the driver program DAG is logical only because the compiler optimizes the DAG for
efficiency
25
RDD Visualized
26
RDD OperationsTransformations Operate on an RDD and return a new
RDD Are lazily evaluatedActions Return a value after running a
computation on an RDDLazy Evaluation Evaluation happens only when an
action is called Deferring decisions for better runtime
optimization
27
Spark CoreTransformations Operate on an RDD and return a new RDD. Are Lazily EvaluatedActions Return a value after running a computation on a RDD. The DAG is evaluated only when an action takes place.Lazy Evaluation Only type checking happens when a DAG is compiled. Evaluation happens only when an action is called. Deferring decisions will yield more information at
runtime to better optimize the program So a Spark program actually starts executing when an
action is called.
28
Hello Spark! (Scala)Simple Word Count App Create a RDD from a text file
val lines= sc.textFile("README.md")
Perform a series of transformations to compute the word countval words = lines.flatMap(_.split(" "))
val pairs = words.map(s => (s, 1))
val wordCounts = pairs.reduceByKey(_ + _)
Action: send word count results back to the driver programwordCounts.collect()
wordCounts.take(10)
Action: save word counts to a text file wordCounts.saveAsTextFile("../../WordCount")
How many times does the keyword “Spark” occur?
29
Hello Spark! (Python)Simple Word Count App (Scala) Create a RDD from a text file
lines = sc.textFile("README.md")
Perform a series of transformations to compute the word countwords = lines.flatMap(lambda l: l.split(" "))
pairs = words.map(lambda s: (s, 1))
wordCounts = pairs.reduceByKey(lambda x, y: (x + y))
Action: send word count results back to the driver programwordCounts.collect()
wordCounts.take(10)
Action: save word counts to a text file wordCounts.saveAsTextFile("WordCount")
How many times does the keyword “Spark” occur?
30
Working with Key-Value Pairs Creating Pair RDDs
Many of Spark’s input formats directly return key/value data. Transformations like map can also be used to create pair RDDs Creating a pair RDD from csv files that has two columns.
val pairs = sc.textFile(“pairsCSV.csv”).map(_.split(“,”)).map(s => (s(0), s(1))
Transforming Pair RDDs Special transformations exist on pair RDD which are not available for regular RDDs reduceByKey - combine values with the same key (has a built in map-side reducer) groupByKey - group values by key mapValues - apply function to each value of the pair without changing the keys sort ByKey - returns an RDD sorted by the Keys
Joining Pair RDDs Two RDDs can be joined using their keys Only pair RDDs are supported
31
Broadcast & Accumulator Variables Broadcast Variable
Read-only variable cached on each node Useful to keep a moderately large input dataset on each node Spark uses efficient bit-torrent algorithms to ship broadcast variables to each node Minimizes network costs while distributing dataset
val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar.value
Accumulators Implement counters, sums etc. in parallel, supports associative addition Natively supported type re numeric and standard mutable collections Only driver can read accumulator value, tasks can't
val accum = sc.accumulator(0)
sc.parallelize(List(1, 2, 3, 4)).foreach(x => accum++)
accum.value
32
Standalone Apps Applications must define a “main( )” method App must create a spark context Applications can be built using
Java + Maven Scala + SBT
SBT - Simple Build Tool Included with Spark download and doesn’t need to be installed separately Similar to Maven but supports incremental compile and interactive shell requires a build.st configuration file
IDEs like IntelliJ Idea have Scala and SBT plugins available can be configured to build and run Spark programs in Scala
33
Building with SBT build.sbt
Should include Scala version and Spark dependencies Directory Structure
./myapp/src/main/scala/MyApp.scala
Package the jar from the ./myapp folder run
sbt package
a jar file is created in ./myapp/target/scala-2.10/myapp_2.10-1.0.jar
spark-submit, specific master URL or localSPARK_HOME/bin/spark-submit \
--class "MyApp" \
--master local[4] \
target/scala-2.10/myapp_2.10-1.0.jar
34
Spark Cluster
35
Spark SQL + Parquet
36
Spark SQL Spark’s interface for working with structured and semi-structured
data. Can load data from JSON, Hive, Parquet Data can be queried internally using SQL, Scala, Python or from
external BI tools. Spark SQL provides a special RDD called Schema RDD. (replaced with
data frame since Spark 1.3) Spark supports UDF A Schema RDD is an RDD for Row objects. Spark SQL Components
Catalyst Optimizer Spark SQL Core Hive Support
37
Spark SQL
38
DataFrames
Extension of RDD API and a Spark SQL abstractionDistributed collection of data with named columnsEquivalent to RDBMS tables or data frames in R/PandasCan be built from a variety of structured data sourcesHive tables, JSON, Databases, RDDs etc.
39
Why DataFrame?
Lots of data formats are structuredSchema-on-readData has inherent structure and needed to make sense of itRDD programming with structured data is not intuitiveDataFrame = RDD(ROW) + Schema + DSLWrite SQLsUse Domain Specific Language (DSL)
40
Using Spark SQLSQLContextEntry point for all SQL functionalityExtends existing spark context to support SQLIf JSON or Parquet files readily result a DataFrame (schemaRDD)Register DataFrame as temp tableTables persist only as long as the program
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val parquetFile = sqlContext.parquetFile("../spark_training/data/wiki_parquet")
parquetFile.registerTempTable("wikiparquet")
val teenagers = sqlContext.sql(""" SELECT * FROM wikiparquet limit 2""")
cacheTable("people")
teenagers.collect.foreach(println)
41
Intro to Parquet
Business Use Case:Analytics produce a lot of derived data and statistics Compression needed for efficient data storageCompressing is easy but deriving insights is notNeed a new mechanism to store and retrieve data easily and efficiently in Hadoop ecosystem.
42
Intro to Parquet (Contd.)
Solution: Parquet A columnar storage format for Hadoop eco. Independent of
Processing Framework (MapReduce, Spark, Cascading, Scalding etc. ) Programming Language (Java, Scala, Python, C++) Data Model (Avro, Thrift, ProtoBuf, POJO)
Supports Nested data structures Self-describing data format Binary packaging for CPU efficiency
43
Parquet Design GoalsInteroperability
Model and Language agnostic Supports a myriad of frameworks, query engines and data models
Space(IO) Efficiency Columnar Storage Row layout - encode one value at a time Column layout - encode an array of values at a time
Partitioning Vertical - for projection pushdown Horizontal - for predicate pushdown Read only the blocks that are needed, no need to scan the whole file
Query/CPU Efficiency Binary packaging for CPU efficiency Right encoding for right data
44
Parquet File Partitioning
When to use Partitioning? Data too large and takes long time to read Data always queried with conditions Columns have reasonable cardinality (not just male vs female) Choose column combinations that are frequently used together for filtering Partition pruning helps read only the directories being filtered
45
Parquet With Spark Spark fully supports parquet file formats Spark 1.3 can automatically scan and merge files if data model changes Spark 1.4 supports partition pruning Can auto discover partition folders scans only those folders required by predicatedf.write(“year”, “month”, “day”).parquet(“path/to/output”)
46
SQL Exercise (Twitter Study)old no data frames//create a case class to assign schema to structured data
case class Tweet(tweet_id: String, retweet: String, timestamp: String, source: String, text: String)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
//sc.textFile("data/tweets.csv").map(s => s.split(",")).map(s => Tweet(s(0), s(3), s(5), s(6), s(7))).take(5).foreach(println)
val tweets = sc.textFile("data/tweets.csv").map(s => s.split(",")).map(s => Tweet(s(0), s(3), s(5), s(6), s(7)))
tweets.registerTempTable("tweets")
//show the top 10 tweets by the number of re-tweets
val top10Tweets = sqlContext.sql("""select text, sum(IF(retweet is null, 0, 1)) rtcount from tweets group by text order by rtcount desc limit 10”"")
top10Tweets.collect.foreach(println)
47
SQL Exercise (Twitter Study)import org.apache.spark.sql.types._
import com.databricks.spark.csv._
import sqlContext.implicits._
val csvSchema = StructType(List(StructField("tweet_id",StringType,true), StructField("retweet",StringType,true), StructField("timestamp",StringType,true), StructField("source",DoubleType,true), StructField("text",StringType,true)))
val tweets = new CsvParser().withSchema(csvSchema).withDelimiter(',').withUseHeader(false).csvFile(sqlContext, "data/tweets.csv")
tweets.registerTempTable("tweets")
//show the top 10 tweets by the number of re-tweets
val top10Tweets = sqlContext.sql("""select text, sum(IF(retweet is null, 0, 1)) rtcount from tweets where text != "" group by text order by rtcount desc limit 10""")
top10Tweets.collect.foreach(println)
48
Advanced Libraries
49
Spark Streaming Big-data apps need to process large data streams in real time Streaming API similar to that of Spark Core Scales to 100s of nodes Fault-tolerant stream processing Integrates with batch + interactive processing Stream processing as series of small batch jobs Divide live stream into batches of X seconds Each batch is processed as an RDD Results of RDD ops are returned as batches Requires additional setup to run 24/7 - checkpointing Spark 1.2 APIs only in Scala/Java, Python API experimental
50
DStreams - Discretized Streams Abstraction provided by Streaming API Sequence of data arriving over time Represented as a sequence of RDDs Can be created from various sources
Flume Kafka HDFS
Offer two types of operations Transformations - yield new DStreams Output operations - write data to external systems
New time related operations like sliding window are also offered
51
DStream TransformationsStateless
Processing of one batch doesn’t depend on previous batch Similar to any RDD transformation map, filter, reduceByKey Transformations are applied to each individual RDD of the DStream Can join data with the same batch using join, cogroup etc. Combine data from multiple DStreams using union transform can be applied to RDDs within DStreams individually
Stateful Uses intermediate results from previous batches Require check pointing to enable fault tolerance Two types Windowed operations - Transformations based on sliding window of time updateStateByKey - track state across events for each key (key, event) -> (key,
state)
52
DStream Output Operations Specify what needs to be done to the final transformed data If no output operation is specified the DStream is not evaluated If there is no output operation in the entire streaming context then the
context will not start Common Output Operations
print( ) - prints first 10 elements from each batch of the DStream saveAsTextFile( ) - saves the output to a file foreachRDD( ) - run arbitrary operation on each RDD of the DStream foreachPartition( ) - write each partition to an external database
53
Machine Learning - MLlib Spark’s machine learning library designed to run in parallel on clusters Consists of a variety of learning algorithms accessible from all of Spark’s APIs A set of functions to call on RDDs but introduces a few new data types
Vectors LabeledPoints
A typical machine learning task consists of the following steps Data Preparation
Start with an RDD of raw data (text etc.) Perform data preparation to clean up the data
Feature Extraction Convert text to numerical features and create an RDD of vectors
Model Training Apply learning algorithm to the RDD of vectors resulting in a model object
Model Evaluation Evaluate the model using the test dataset Tune the model and its parameters Apply model to real data to perform predictions
54
Tips & Tricks
55
Performance TuningShuffle in Spark Performance issuesCode on Driver vs Workers Cause of ErrorsSerialization Task not serializable error
56
Shuffle in Spark reduceByKey vs groupByKey
Can solve the same problem groupByKey can cause out of disk error
Prefer reduceByKey, combineByKey, foldByKey over groupByKey
57
Execution on Driver vs. WorkersWhat is the Driver program?
Programs that declares transformations and actions on RDDs Program that submits requests to the Spark master Program that creates the SparkContext
Main program is executed on the Driver Transformations are executed on the Workers Actions may transfer data from workers to Driver Collect sends all the partitions to the driver Collect on large RDDs can cause Out of Memory Instead use saveAsText( ) or count( ) or take(N)
58
Serializations Errors Serialization Error
org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableExcept
Happens when… Initialize variable on driver/master and use on workers Spark will try to serialize the object and send to workers Will error out is the object is not serializable Try to create DB connection on driver and use on workers
Some available fixes Make the class serializable Declare instance with in the lambda function Make NotSerializable object as static and create once per worker using
rdd.forEachPartition Create db connection on each worker
59
Where do I go from here?
60
Community
spark.apache.org/community.html Worldwide events: goo.gl/2YqJZK Video, presentation archives: spark-summit.org Dev resources: databricks.com/spark/developer-resources Workshops: databricks.com/services/spark-training
61
Books
Learning Spark - Holden Karau, Andy Konwinski, Matei Zaharia, Patrick Wendellshop.oreilly.com/product/0636920028512.do
Fast Data Processing with Spark - Holden Karaushop.oreilly.com/product/9781782167068.do
Spark in Action - Chris Freglysparkinaction.com/
62
Where can I find all the code and examples?
All the code presented in this class and the assignments + data can be found on my github:
https://github.com/snudurupati/spark_training Instructions on how to download, compile and run are also given there. I will keep adding new code and examples so keep checking it!
63