High-Speed In-Memory Analytics over Hadoop and Hive...
Transcript of High-Speed In-Memory Analytics over Hadoop and Hive...
![Page 1: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/1.jpg)
SlidesadoptedfromMateiZaharia(MIT)andOliverVagner(NCR)
Spark&SparkSQLHigh-SpeedIn-MemoryAnalytics overHadoopandHiveData
CSE 6242 / CX 4242 Data and Visual Analytics | Georgia Tech
Instructor: Duen Horng (Polo) Chau1
![Page 2: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/2.jpg)
WhatisSpark?NotamodifiedversionofHadoop
Separate,fast,MapReduce-likeengine» In-memorydatastorageforveryfastiterativequeries»Generalexecutiongraphsandpowerfuloptimizations»Upto40xfasterthanHadoop
CompatiblewithHadoop’sstorageAPIs»Canread/writetoanyHadoop-supportedsystem,includingHDFS,HBase,SequenceFiles,etc.
http://spark.apache.org
2
![Page 3: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/3.jpg)
WhatisSparkSQL? (FormallycalledShark)
PortofApacheHivetorunonSpark
CompatiblewithexistingHivedata,metastores,andqueries(HiveQL,UDFs,etc)
Similarspeedupsofupto40x
3
![Page 4: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/4.jpg)
ProjectHistory[latest:v2.2]Sparkprojectstartedin2009atUCBerkeleyAMPlab,opensourced2010
BecameApacheTop-LevelProjectinFeb2014
Shark/SparkSQLstartedsummer2011
Builtby250+developersandpeoplefrom50companies
Scaleto1000+nodesinproduction
InuseatBerkeley,Princeton,Klout,Foursquare,Conviva,Quantifind,Yahoo!Research,…
UCBERKELEY
http://en.wikipedia.org/wiki/Apache_Spark 4
![Page 5: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/5.jpg)
WhyaNewProgrammingModel?
MapReducegreatlysimplifiedbigdataanalysis
Butassoonasitgotpopular,userswantedmore:»Morecomplex,multi-stageapplications(e.g.iterativegraphalgorithmsandmachinelearning)»Moreinteractivead-hocqueries
5
![Page 6: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/6.jpg)
WhyaNewProgrammingModel?
MapReducegreatlysimplifiedbigdataanalysis
Butassoonasitgotpopular,userswantedmore:»Morecomplex,multi-stageapplications(e.g.iterativegraphalgorithmsandmachinelearning)»Moreinteractivead-hocqueries
Requirefasterdatasharingacrossparalleljobs
5
![Page 7: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/7.jpg)
IsMapReducedead?Notreally.
http://www.reddit.com/r/compsci/comments/296aqr/on_the_death_of_mapreduce_at_google/
http://www.datacenterknowledge.com/archives/2014/06/25/google-dumps-mapreduce-favor-new-hyper-scale-analytics-system/
6
![Page 8: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/8.jpg)
DataSharinginMapReduce
iter.1 iter.2 ...
Input
HDFS read
HDFS write
HDFS read
HDFS write
Input
query1
query2
query3
result1
result2
result3
...
HDFS read
7
![Page 9: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/9.jpg)
DataSharinginMapReduce
iter.1 iter.2 ...
Input
HDFS read
HDFS write
HDFS read
HDFS write
Input
query1
query2
query3
result1
result2
result3
...
HDFS read
Slowduetoreplication,serialization,anddiskIO 7
![Page 10: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/10.jpg)
iter.1 iter.2 ...
Input
DataSharinginSpark
Distributedmemory
Input
query1
query2
query3
...
one-time processing
8
![Page 11: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/11.jpg)
iter.1 iter.2 ...
Input
DataSharinginSpark
Distributedmemory
Input
query1
query2
query3
...
one-time processing
10-100×fasterthannetworkanddisk 8
![Page 12: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/12.jpg)
SparkProgrammingModel
Keyidea:resilientdistributeddatasets(RDDs)»Distributedcollectionsofobjectsthatcanbecachedinmemoryacrossclusternodes»Manipulatedthroughvariousparalleloperators»Automaticallyrebuiltonfailure
Interface»Cleanlanguage-integratedAPIinScala»CanbeusedinteractivelyfromScala,Pythonconsole»Supportedlanguages:Java,Scala,Python,R
9
![Page 13: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/13.jpg)
http://www.scala-lang.org/old/faq/4
Functional programming in D3: http://sleptons.blogspot.com/2015/01/functional-programming-d3js-good-example.html
Scala vs Java 8: http://kukuruku.co/hub/scala/java-8-vs-scala-the-difference-in-approaches-and-mutual-innovations
10
![Page 14: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/14.jpg)
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
![Page 15: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/15.jpg)
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
Worker
Worker
Worker
Driver
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
![Page 16: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/16.jpg)
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
![Page 17: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/17.jpg)
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
BaseRDD
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
![Page 18: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/18.jpg)
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
![Page 19: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/19.jpg)
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
TransformedRDD
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
![Page 20: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/20.jpg)
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
![Page 21: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/21.jpg)
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
![Page 22: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/22.jpg)
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).countAction
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
![Page 23: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/23.jpg)
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
![Page 24: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/24.jpg)
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block1
Block2
Block3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
![Page 25: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/25.jpg)
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block1
Block2
Block3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
tasks
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
![Page 26: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/26.jpg)
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block1
Block2
Block3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
tasks
results
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
![Page 27: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/27.jpg)
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block1
Block2
Block3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
tasks
results
Cache1
Cache2
Cache3
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
![Page 28: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/28.jpg)
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block1
Block2
Block3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
Cache1
Cache2
Cache3
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
![Page 29: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/29.jpg)
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block1
Block2
Block3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
Cache1
Cache2
Cache3
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
![Page 30: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/30.jpg)
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block1
Block2
Block3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
Cache1
Cache2
Cache3
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
![Page 31: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/31.jpg)
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block1
Block2
Block3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
Cache1
Cache2
Cache3
Result:full-textsearchofWikipediain<1sec(vs20secforon-diskdata)
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
![Page 32: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/32.jpg)
Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns
lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()
Block1
Block2
Block3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
Cache1
Cache2
Cache3
Result:full-textsearchofWikipediain<1sec(vs20secforon-diskdata)
Result:scaledto1TBdatain5-7sec (vs170secforon-diskdata)
11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html
![Page 33: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/33.jpg)
FaultToleranceRDDstracktheseriesoftransformationsusedtobuildthem(theirlineage)torecomputelostdata
E.g: messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))
HadoopRDDpath=hdfs://…
FilteredRDDfunc=_.contains(...)
MappedRDDfunc=_.split(…)
12
![Page 34: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/34.jpg)
Example:LogisticRegressionval data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }
println("Final w: " + w)
13
![Page 35: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/35.jpg)
Example:LogisticRegressionval data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }
println("Final w: " + w)
Loaddatainmemoryonce
13
![Page 36: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/36.jpg)
Example:LogisticRegressionval data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }
println("Final w: " + w)
Initialparametervector
13
![Page 37: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/37.jpg)
Example:LogisticRegressionval data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }
println("Final w: " + w)RepeatedMapReducesteps
todogradientdescent
13
![Page 38: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/38.jpg)
LogisticRegressionPerformance
Run
ning
Tim
e(s)
0
1000
2000
3000
4000
NumberofIterations
1 5 10 20 30
HadoopSpark
127s/iteration
firstiteration174s furtheriterations6s
14
![Page 39: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/39.jpg)
SupportedOperatorsmap
filter
groupBy
sort
join
leftOuterJoin
rightOuterJoin
reducecountreduceByKey
groupByKey
firstunioncross
samplecogroup
take
partitionBy
pipesave...
15
![Page 40: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/40.jpg)
SparkUsers
16
![Page 41: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/41.jpg)
SparkSQL:HiveonSpark
17
![Page 42: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/42.jpg)
MotivationHiveisgreat,butHadoop’sexecutionenginemakeseventhesmallestqueriestakeminutes
Scalaisgoodforprogrammers,butmanydatausersonlyknowSQL
CanweextendHivetorunonSpark?
18
![Page 43: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/43.jpg)
HiveArchitecture
Metastore
HDFS
Client
Driver
SQLParser
QueryOptimizer
PhysicalPlan
Execution
CLI JDBC
MapReduce
19
![Page 44: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/44.jpg)
SparkSQLArchitecture
Metastore
HDFS
Client
Driver
SQLParser
PhysicalPlan
Execution
CLI JDBC
Spark
CacheMgr.
QueryOptimizer
[Engleetal,SIGMOD2012]20
![Page 45: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/45.jpg)
UsingSparkSQLCREATE TABLE mydata_cached AS SELECT …
RunstandardHiveQLonit,includingUDFs»Afewesotericfeaturesarenotyetsupported
CanalsocallfromScalatomixwithSpark
21
![Page 46: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/46.jpg)
BenchmarkQuery1SELECT * FROM grep WHERE field LIKE ‘%XYZ%’;
22
![Page 47: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/47.jpg)
BenchmarkQuery2SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earnings FROM rankings AS R, userVisits AS V ON R.pageURL = V.destURL WHERE V.visitDate BETWEEN ‘1999-01-01’ AND ‘2000-01-01’ GROUP BY V.sourceIP ORDER BY earnings DESCLIMIT 1;
23
![Page 48: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/48.jpg)
BehaviorwithNotEnoughRAM
Iterationtime(s)
0
25
50
75
100
%ofworkingsetinmemory
Cachedisabled 25% 50% 75% Fullycached
11.5
29.740.7
58.168.8
24
![Page 49: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/49.jpg)
What’sNext?RecallthatSpark’smodelwasmotivatedbytwoemerginguses(interactiveandmulti-stageapps)
Anotheremergingusecasethatneedsfastdatasharingisstreamprocessing»Trackandupdatestateinmemoryaseventsarrive»Large-scalereporting,clickanalysis,spamfiltering,etc
25
![Page 50: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/50.jpg)
StreamingSparkExtendsSparktoperformstreamingcomputations
Runsasaseriesofsmall(~1s)batchjobs,keepingstateinmemoryasfault-tolerantRDDs
Intermixseamlesslywithbatchandad-hocqueries
tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _)
T=1
T=2
…
map reduceByWindow
[Zahariaetal,HotCloud2012] 26
![Page 51: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/51.jpg)
map()vsflatMap()Thebestexplanation:
https://www.linkedin.com/pulse/difference-between-map-flatmap-transformations-spark-pyspark-pandey
flatMap=map+flatten
27
![Page 52: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/52.jpg)
StreamingSparkExtendsSparktoperformstreamingcomputations
Runsasaseriesofsmall(~1s)batchjobs,keepingstateinmemoryasfault-tolerantRDDs
Intermixseamlesslywithbatchandad-hocqueries
tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _)
T=1
T=2
…
map reduceByWindow
[Zahariaetal,HotCloud2012]
Result:canprocess42millionrecords/second(4GB/s)on100nodesatsub-secondlatency
28
![Page 53: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/53.jpg)
SparkStreamingCreateandoperateonRDDsfromlivedatastreamsatsetintervals
Dataisdividedintobatchesforprocessing
Streamsmaybecombinedasapartofprocessingoranalyzedwithhigherleveltransforms
29
![Page 54: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/54.jpg)
SPARKPLATFORM
30
Standard FS/HDFS/CFS/S3
GraphXSpark SQLShark
Spark Streaming
YARN/Spark/Mesos
Scala/Python/Java
RDD
MLlib Execution
Resource Management
Data Storage
![Page 55: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/55.jpg)
GraphXParallelgraphprocessing
ExtendsRDD->ResilientDistributedPropertyGraph» Directedmultigraphwithpropertiesattachedtoeachvertexandedge
Limitedalgorithms» PageRank» ConnectedComponents» TriangleCounts
Alphacomponent31
![Page 56: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/56.jpg)
MLlibScalablemachinelearninglibrary
InteroperateswithNumPy
Availablealgorithmsin1.0» LinearSupportVectorMachine(SVM)» LogisticRegression» LinearLeastSquares» DecisionTrees» NaïveBayes» CollaborativeFilteringwithALS» K-means» SingularValueDecomposition» PrincipalComponentAnalysis» GradientDescent
32
![Page 57: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/57.jpg)
MLlib2.0(partofSpark2.0)
33https://spark.apache.org/docs/latest/mllib-guide.html
![Page 58: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/cse6242/2018spring/slides/CSE6242-620-ScalingUp-spark.pdfSlides adopted from Matei Zaharia (MIT) and Oliver](https://reader030.fdocuments.us/reader030/viewer/2022040303/5e8648a20983cd60de1352ec/html5/thumbnails/58.jpg)
Spark2 (verynewstill)
34
Newfeaturehighlights https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html
Spark2.0.0hasAPIbreakingchanges
PartlywhyHW3usesSpark1.6(also,Clouderadistribution’sSpark2supportisinbeta)
Moredetails:https://spark.apache.org/releases/spark-release-2-0-0.html