SlidesadoptedfromMateiZaharia(MIT)andOliverVagner(NCR)
Spark&SparkSQLHigh-SpeedIn-MemoryAnalytics overHadoopandHiveData
CSE 6242 / CX 4242 Data and Visual Analytics | Georgia Tech
Instructor: Duen Horng (Polo) Chau1
WhatisSpark?NotamodifiedversionofHadoop
Separate,fast,MapReduce-likeengine» In-memorydatastorageforveryfastiterativequeries»Generalexecutiongraphsandpowerfuloptimizations»Upto40xfasterthanHadoop
CompatiblewithHadoop’sstorageAPIs»Canread/writetoanyHadoop-supportedsystem,includingHDFS,HBase,SequenceFiles,etc.
http://spark.apache.org
2
WhatisSparkSQL? (FormallycalledShark)
PortofApacheHivetorunonSpark
CompatiblewithexistingHivedata,metastores,andqueries(HiveQL,UDFs,etc)
Similarspeedupsofupto40x
3
ProjectHistory[latest:v2.2]Sparkprojectstartedin2009atUCBerkeleyAMPlab,opensourced2010
BecameApacheTop-LevelProjectinFeb2014
Shark/SparkSQLstartedsummer2011
Builtby250+developersandpeoplefrom50companies
Scaleto1000+nodesinproduction
InuseatBerkeley,Princeton,Klout,Foursquare,Conviva,Quantifind,Yahoo!Research,…
UCBERKELEY
http://en.wikipedia.org/wiki/Apache_Spark 4
WhyaNewProgrammingModel?
MapReducegreatlysimplifiedbigdataanalysis
Butassoonasitgotpopular,userswantedmore:»Morecomplex,multi-stageapplications(e.g.iterativegraphalgorithmsandmachinelearning)»Moreinteractivead-hocqueries
5
WhyaNewProgrammingModel?
MapReducegreatlysimplifiedbigdataanalysis
Butassoonasitgotpopular,userswantedmore:»Morecomplex,multi-stageapplications(e.g.iterativegraphalgorithmsandmachinelearning)»Moreinteractivead-hocqueries
Requirefasterdatasharingacrossparalleljobs
5
IsMapReducedead?Notreally.
http://www.reddit.com/r/compsci/comments/296aqr/on_the_death_of_mapreduce_at_google/
http://www.datacenterknowledge.com/archives/2014/06/25/google-dumps-mapreduce-favor-new-hyper-scale-analytics-system/
6
DataSharinginMapReduce
iter.1 iter.2 ...
Input
HDFS read
HDFS write
HDFS read
HDFS write
Input
query1
query2
query3
result1
result2
result3
...
HDFS read
7
DataSharinginMapReduce
iter.1 iter.2 ...
Input
HDFS read
HDFS write
HDFS read
HDFS write
Input
query1
query2
query3
result1
result2
result3
...
HDFS read
Slowduetoreplication,serialization,anddiskIO 7
iter.1 iter.2 ...
Input
DataSharinginSpark
Distributedmemory
Input
query1
query2
query3
...
one-time processing
8
iter.1 iter.2 ...
Input
DataSharinginSpark
Distributedmemory
Input
query1
query2
query3
...
one-time processing
10-100×fasterthannetworkanddisk 8
SparkProgrammingModel
Keyidea:resilientdistributeddatasets(RDDs)»Distributedcollectionsofobjectsthatcanbecachedinmemoryacrossclusternodes»Manipulatedthroughvariousparalleloperators»Automaticallyrebuiltonfailure
Interface»Cleanlanguage-integratedAPIinScala»CanbeusedinteractivelyfromScala,Pythonconsole»Supportedlanguages:Java,Scala,Python,R
9
http://www.scala-lang.org/old/faq/4
Functional programming in D3: http://sleptons.blogspot.com/2015/01/functional-programming-d3js-good-example.html
Scala vs Java 8: http://kukuruku.co/hub/scala/java-8-vs-scala-the-difference-in-approaches-and-mutual-innovations
10
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
Worker
Worker
Worker
Driver
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
BaseRDD
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
TransformedRDD
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).countAction
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block1
Block2
Block3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block1
Block2
Block3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
tasks
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block1
Block2
Block3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
tasks
results
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block1
Block2
Block3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
tasks
results
Cache1
Cache2
Cache3
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block1
Block2
Block3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
Cache1
Cache2
Cache3
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block1
Block2
Block3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
Cache1
Cache2
Cache3
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block1
Block2
Block3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
Cache1
Cache2
Cache3
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block1
Block2
Block3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
Cache1
Cache2
Cache3
Result:full-textsearchofWikipediain<1sec(vs20secforon-diskdata)
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block1
Block2
Block3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
Cache1
Cache2
Cache3
Result:full-textsearchofWikipediain<1sec(vs20secforon-diskdata)
Result:scaledto1TBdatain5-7sec (vs170secforon-diskdata)
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
FaultToleranceRDDstracktheseriesoftransformationsusedtobuildthem(theirlineage)torecomputelostdata
E.g: messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))
HadoopRDDpath=hdfs://…
FilteredRDDfunc=_.contains(...)
MappedRDDfunc=_.split(…)
12
Example:LogisticRegressionval data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }
println("Final w: " + w)
13
Example:LogisticRegressionval data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }
println("Final w: " + w)
Loaddatainmemoryonce
13
Example:LogisticRegressionval data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }
println("Final w: " + w)
Initialparametervector
13
Example:LogisticRegressionval data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }
println("Final w: " + w)RepeatedMapReducesteps
todogradientdescent
13
LogisticRegressionPerformance
Run
ning
Tim
e(s)
0
1000
2000
3000
4000
NumberofIterations
1 5 10 20 30
HadoopSpark
127s/iteration
firstiteration174s furtheriterations6s
14
SupportedOperatorsmap
filter
groupBy
sort
join
leftOuterJoin
rightOuterJoin
reducecountreduceByKey
groupByKey
firstunioncross
samplecogroup
take
partitionBy
pipesave...
15
SparkUsers
16
SparkSQL:HiveonSpark
17
MotivationHiveisgreat,butHadoop’sexecutionenginemakeseventhesmallestqueriestakeminutes
Scalaisgoodforprogrammers,butmanydatausersonlyknowSQL
CanweextendHivetorunonSpark?
18
HiveArchitecture
Metastore
HDFS
Client
Driver
SQLParser
QueryOptimizer
PhysicalPlan
Execution
CLI JDBC
MapReduce
19
SparkSQLArchitecture
Metastore
HDFS
Client
Driver
SQLParser
PhysicalPlan
Execution
CLI JDBC
Spark
CacheMgr.
QueryOptimizer
[Engleetal,SIGMOD2012]20
UsingSparkSQLCREATE TABLE mydata_cached AS SELECT …
RunstandardHiveQLonit,includingUDFs»Afewesotericfeaturesarenotyetsupported
CanalsocallfromScalatomixwithSpark
21
BenchmarkQuery1SELECT * FROM grep WHERE field LIKE ‘%XYZ%’;
22
BenchmarkQuery2SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earnings FROM rankings AS R, userVisits AS V ON R.pageURL = V.destURL WHERE V.visitDate BETWEEN ‘1999-01-01’ AND ‘2000-01-01’ GROUP BY V.sourceIP ORDER BY earnings DESCLIMIT 1;
23
BehaviorwithNotEnoughRAM
Iterationtime(s)
0
25
50
75
100
%ofworkingsetinmemory
Cachedisabled 25% 50% 75% Fullycached
11.5
29.740.7
58.168.8
24
What’sNext?RecallthatSpark’smodelwasmotivatedbytwoemerginguses(interactiveandmulti-stageapps)
Anotheremergingusecasethatneedsfastdatasharingisstreamprocessing»Trackandupdatestateinmemoryaseventsarrive»Large-scalereporting,clickanalysis,spamfiltering,etc
25
StreamingSparkExtendsSparktoperformstreamingcomputations
Runsasaseriesofsmall(~1s)batchjobs,keepingstateinmemoryasfault-tolerantRDDs
Intermixseamlesslywithbatchandad-hocqueries
tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _)
T=1
T=2
…
map reduceByWindow
[Zahariaetal,HotCloud2012] 26
map()vsflatMap()Thebestexplanation:
https://www.linkedin.com/pulse/difference-between-map-flatmap-transformations-spark-pyspark-pandey
flatMap=map+flatten
27
StreamingSparkExtendsSparktoperformstreamingcomputations
Runsasaseriesofsmall(~1s)batchjobs,keepingstateinmemoryasfault-tolerantRDDs
Intermixseamlesslywithbatchandad-hocqueries
tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _)
T=1
T=2
…
map reduceByWindow
[Zahariaetal,HotCloud2012]
Result:canprocess42millionrecords/second(4GB/s)on100nodesatsub-secondlatency
28
SparkStreamingCreateandoperateonRDDsfromlivedatastreamsatsetintervals
Dataisdividedintobatchesforprocessing
Streamsmaybecombinedasapartofprocessingoranalyzedwithhigherleveltransforms
29
SPARKPLATFORM
30
Standard FS/HDFS/CFS/S3
GraphXSpark SQLShark
Spark Streaming
YARN/Spark/Mesos
Scala/Python/Java
RDD
MLlib Execution
Resource Management
Data Storage
GraphXParallelgraphprocessing
ExtendsRDD->ResilientDistributedPropertyGraph» Directedmultigraphwithpropertiesattachedtoeachvertexandedge
Limitedalgorithms» PageRank» ConnectedComponents» TriangleCounts
Alphacomponent31
MLlibScalablemachinelearninglibrary
InteroperateswithNumPy
Availablealgorithmsin1.0» LinearSupportVectorMachine(SVM)» LogisticRegression» LinearLeastSquares» DecisionTrees» NaïveBayes» CollaborativeFilteringwithALS» K-means» SingularValueDecomposition» PrincipalComponentAnalysis» GradientDescent
32
MLlib2.0(partofSpark2.0)
33https://spark.apache.org/docs/latest/mllib-guide.html
Spark2 (verynewstill)
34
Newfeaturehighlights https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html
Spark2.0.0hasAPIbreakingchanges
PartlywhyHW3usesSpark1.6(also,Clouderadistribution’sSpark2supportisinbeta)
Moredetails:https://spark.apache.org/releases/spark-release-2-0-0.html
Top Related