Map-Reduce and Spark -...

42
CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy 1 Duke CS, Fall 2017 CompSci 516: Database Systems

Transcript of Map-Reduce and Spark -...

Page 1: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

CompSci 516DatabaseSystems

Lecture12Map-Reduce

andSpark

Instructor:Sudeepa Roy

1DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 2: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

Announcements

• Practicemidtermpostedonsakai– Firstprepareandthenattempt!

• MidtermnextWednesday10/11inclass– Closedbook/notes,noelectronicdevices– EverythinguntilandincludingLecture12included

• HW2duein2weeks– Firstrunyourcodeonlocalmachinetoensurethatitiscorrect,thenon

AWS– RemembertostopAWSinstances!

2DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 3: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

Wherearewenow?

Welearntü RelationalModelandQueryLanguages

ü SQL,RA,RCü Postgres(DBMS)ü XML(overview)§ HW1

ü DatabaseNormalizationü DBMSInternals

– Storage– Indexing– QueryEvaluation– OperatorAlgorithms– Externalsort– QueryOptimization

• Today:– MapReduceandSpark

DukeCS,Fall2017 CompSci516:DatabaseSystems 3

Page 4: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

ReadingMaterial• Recommended(optional)readings:

– Chapter2(Sections1,2,3)ofMiningofMassiveDatasets,byRajaraman andUllman:http://i.stanford.edu/~ullman/mmds.html

– OriginalGoogleMRpaperbyJeffDeanandSanjayGhemawat,OSDI’04:http://research.google.com/archive/mapreduce.html

– “ResilientDistributedDatasets:AFault-TolerantAbstractionforIn-MemoryClusterComputing”(seecoursewebsite)– byMatei Zahariaetal.- 2012

4

Acknowledgement:SomeofthefollowingslideshavebeenborrowedfromProf.Shivnath Babu,Prof.DanSuciu,Prajakta Kalmegh,andJunghoon Kang

DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 5: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

MapReduce

DukeCS,Fall2017 CompSci516:DatabaseSystems 5

Page 6: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

BigData

itcannotbe storedinonemachine

storethedatasetsonmultiplemachines

GoogleFileSystem

itcannotbe processed inonemachine

parallelizecomputationonmultiplemachines

MapReduce

Ack:SlidebyJunghoon Kang

Page 7: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

TheMap-ReduceFramework

• GooglepublishedMapReducepaperinOSDI2004,ayearaftertheGoogleFileSystempaper

• Ahighlevelprogrammingparadigm– allowsmanyimportantdata-orientedprocessestobewrittensimply

• processeslargedataby:– applyingafunctiontoeachlogicalrecordintheinput(map)

– categorizeandcombinetheintermediateresultsintosummaryvalues(reduce)

DukeCS,Fall2017 CompSci516:DatabaseSystems 7

Page 8: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

WheredoesGoogleuseMapReduce?

MapReduce

Input

Output

● crawleddocuments● webrequestlogs

● invertedindices● graphstructureofwebdocuments● summariesofthenumberofpages

crawledperhost● thesetofmostfrequentqueriesinaday

Ack:SlidebyJunghoon Kang

Page 9: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

StorageModel

• Dataisstoredinlargefiles(TB,PB)– e.g.market-basketdata(morewhenwedodatamining)

– orwebdata• Filesaredividedintochunks– typicallymanyMB(64MB)– sometimeseachchunkisreplicatedforfaulttolerance(laterindistributedDBMS)

DukeCS,Fall2017 CompSci516:DatabaseSystems 9

Page 10: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

Map-ReduceSteps

• Inputistypically(key,value)pairs– butcouldbeobjectsofanytype

• MapandReduceareperformedbyanumberofprocesses– physicallylocatedinsomeprocessors

DukeCS,Fall2017 CompSci516:DatabaseSystems 10

samekey

Map ReduceShuffleInputkey-valuepairs

outputlistssortbykey

Page 11: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

Map-ReduceSteps

1. ReadData2. Map– extractsomeinfoofinterest

in(key,value)form3. Shuffleandsort

– sendsamekeystothesamereduceprocess

DukeCS,Fall2017 CompSci516:DatabaseSystems 11

samekey

Map ReduceShuffleInputkey-valuepairs

outputlistssortbykey

4. Reduce– operateonthevaluesofthesamekey– e.g.transform,aggregate,summarize,

filter5. Outputtheresults(key,final-result)

Page 12: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

SimpleExample:Map-Reduce

• Wordcounting• Invertedindexes

Ack:SlidebyProf.Shivnath Babu

DukeCS,Fall2017 CompSci516:DatabaseSystems 12

Page 13: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

MapFunction

• Eachmapprocessworksonachunkofdata• Input:(input-key,value)• Output:(intermediate-key,value)-- maynotbethesameasinputkeyvalue• Example:listalldocidscontainingaword

– outputofmap(word,docid)– emitseachsuchpair– wordiskey,docid isvalue– duplicateeliminationcanbedoneatthereducephase

DukeCS,Fall2017 CompSci516:DatabaseSystems 13

samekey

Map ReduceShuffleInputkey-valuepairs

outputlistssortbykey

Page 14: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

ReduceFunction

• Input:(intermediate-key,list-of-values-for-this-key)– listcanincludeduplicates– eachmapprocesscanleaveitsoutputinthelocaldisk,reduceprocesscanretrieveits

portion• Output:(output-key,final-value)• Example:listalldocidscontainingaword

– outputwillbealistof(word,[doc-id1,doc-id5,….])– ifthecountisneeded,reducecounts#docs,outputwillbealistof(word,count)

DukeCS,Fall2017 CompSci516:DatabaseSystems 14

samekey

Map ReduceShuffleInputkey-valuepairs

outputlistssortbykey

Page 15: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

ExampleProblem:MapReduceExplainhowthequerywillbeexecutedinMapReduce

• SELECTa,max(b)astopb• FROMR• WHEREa>0• GROUPBYa

SpecifythecomputationperformedinthemapandthereducefunctionsDukeCS,Fall2017 CompSci516:DatabaseSystems 15

Page 16: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

Map

• Eachmaptask– ScansablockofR– Callsthemapfunctionforeachtuple– Themapfunctionappliestheselectionpredicatetothetuple

– Foreachtuple satisfyingtheselection,itoutputsarecordwithkey=aandvalue=b

SELECTa,max(b)astopbFROMRWHEREa>0GROUPBYa

•Wheneachmaptaskscansmultiplerelations,itneedstooutputsomethinglikekey=aandvalue=(‘R’,b)whichhastherelationname‘R’

DukeCS,Fall2017 CompSci516:DatabaseSystems 16

Page 17: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

Shuffle

• TheMapReduce enginereshufflestheoutputofthemapphaseandgroupsitontheintermediatekey,i.e.theattributea

SELECTa,max(b)astopbFROMRWHEREa>0GROUPBYa

•Notethattheprogrammerhastowriteonlythemapandreducefunctions,theshufflephaseisdonebytheMapReduceengine(althoughtheprogrammercanrewritethepartitionfunction),butyoushouldstillmentionthisinyouranswers

DukeCS,Fall2017 CompSci516:DatabaseSystems 17

Page 18: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

ReduceSELECTa,max(b)astopbFROMRWHEREa>0GROUPBYa

• Eachreducetask• computes the aggregate value max(b) = topb for each group

(i.e. a) assigned to it (by calling the reduce function) • outputs the final results: (a, topb)

•Multipleaggregatescanbeoutputbythereducephaselikekey=aandvalue=(sum(b),min(b)) etc.

• Sometimesasecond(thirdetc)levelofMap-Reducephasemightbeneeded

A local combiner can be used to compute local max before data gets reshuffled (in the map tasks)

DukeCS,Fall2017 CompSci516:DatabaseSystems 18

Page 19: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

MoreTerminology

• AMap-Reduce“Job”– e.g.countthewordsinalldocs– complexqueriescanhavemultipleMRjobs

• MaporReduce“Tasks”– Agroupofmaporreduce“functions”– scheduledonasingle“worker”

• Worker– aprocessthatexecutesonetaskatatime– oneperprocessor,so4-8permachine

• Amastercontroller– dividesthedataintochunks– assignsdifferentprocessorstoexecutethemapfunctiononeach

chunk– other/sameprocessorsexecutethereducefunctionsontheoutputsof

themapfunctions

DukeCS,Fall2017 CompSci516:DatabaseSystems 19

however,thereisnouniformterminologyacrosssystems

Ack:SlidebyProf.DanSuciu

Page 20: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

WhyisMap-ReducePopular?

• DistributedcomputationbeforeMapReduce– howtodividetheworkloadamongmultiplemachines?– howtodistributedataandprogramtoothermachines?– howtoscheduletasks?– whathappensifataskfailswhilerunning?– …and…and...

• DistributedcomputationafterMapReduce– howtowriteMapfunction?– howtowriteReducefunction?

• Developers’tasksmadeeasy

DukeCS,Fall2017 CompSci516:DatabaseSystems 20

Ack:SlidebyJunghoon Kang

Page 21: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

HandlingFaultToleranceinMR

• Althoughtheprobabilityofamachinefailureislow,theprobabilityofamachinefailingamongthousandsofmachinesiscommon

• WorkerFailure– Themastersendsheartbeattoeachworkernode– Ifaworkernodefails,themasterreschedulesthetaskshandledbytheworker

• MasterFailure– ThewholeMapReducejobgetsrestartedthroughadifferentmaster

DukeCS,Fall2017 CompSci516:DatabaseSystems 21Ack:SlidebyJunghoon Kang

Page 22: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

OtheraspectsofMapReduce• Locality

– TheinputdataismanagedbyGFS– ChoosetheclusterofMapReducemachinessuchthatthose

machinescontaintheinputdataontheirlocaldisk– Wecanconservenetworkbandwidth

• Taskgranularity– Itispreferabletohavethenumberoftaskstobemultiplesof

workernodes– Smallerthepartitionsize,fasterfailoverandbettergranularity

inloadbalance,butitincursmoreoverhead– Needabalance

• BackupTasks– Inordertocopewitha“straggler”,themasterschedulesbackup

executionsoftheremainingin-progresstasks

DukeCS,Fall2017 CompSci516:DatabaseSystems 22Ack:SlidebyJunghoon Kang

Page 23: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

ApacheHadoop

• ApacheHadoophasanopen-sourceversionofGFSandMapReduce– GFS->HDFS(HadoopFileSystem)– GoogleMapReduce->HadoopMapReduce

• YoucandownloadthesoftwareandimplementyourownMapReduceapplications

DukeCS,Fall2017 CompSci516:DatabaseSystems 23Ack:SlidebyJunghoon Kang

Page 24: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

MapReduceProsandCons

• MapReduceisgoodforoff-linebatchjobsonlargedatasets

• MapReduceisnotgoodforiterativejobsduetohighI/Ooverheadaseachiterationneedstoread/writedatafrom/toGFS

• MapReduceisbadforjobsonsmalldatasetsandjobsthatrequirelow-latencyresponse

DukeCS,Fall2017 CompSci516:DatabaseSystems 24Ack:SlidebyJunghoon Kang

Page 25: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

Spark

DukeCS,Fall2017 CompSci516:DatabaseSystems 25

SeetheRDDpaperfromthecoursewebsite

Page 26: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

WhatisSpark?

• NotamodifiedversionofHadoop• Separate,fast,MapReduce-likeengine– In-memorydatastorageforveryfastiterativequeries–Generalexecutiongraphsandpowerfuloptimizations–Upto40xfasterthanHadoop–Upto100xfaster(2-10xondisk)

• CompatiblewithHadoop’sstorageAPIs– Canread/writetoanyHadoop-supportedsystem,includingHDFS,HBase,SequenceFiles,etc

Borrowed slide

Distributedin-memorylargescaledataprocessingengine!

Ack:SlidebyPrajakta Kalmegh

Page 27: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

Applications(BigDataAnalysis)

• In-memoryanalytics&anomalydetection(Conviva)

• Interactivequeriesondatastreams(Quantifind)• Exploratoryloganalysis(Foursquare)• Trafficestimationw/GPSdata(MobileMillennium)

• Twitterspamclassification(Monarch)• ...

Borrowed slide

Ack:SlidebyPrajakta Kalmegh

Page 28: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

WhyaNewProgrammingModel?• MapReducegreatlysimplifiedbigdataanalysis• Butassoonasitgotpopular,userswantedmore:–Morecomplex,multi-stageiterative applications(graphalgorithms,machinelearning)–Moreinteractive ad-hocqueries–Morereal-time onlineprocessing• Allthreeoftheseappsrequirefastdatasharingacrossparalleljobs

Borrowed slide

NOTE: What were the workarounds in MR world? Ysmart [1], Stubby[2], PTF[3], Haloop [4], Twister [5]

Ack:SlidebyPrajakta Kalmegh

Page 29: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

DataSharinginMapReduce

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

Slow due to replication, serialization, and disk IOBorrowed slide

Ack:SlidebyPrajakta Kalmegh

Page 30: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

iter. 1 iter. 2 . . .

Input

DataSharinginSpark

Distributedmemory

Input

query 1

query 2

query 3

. . .

one-timeprocessing

10-100× faster than network and diskBorrowed slide

Ack:SlidebyPrajakta Kalmegh

Page 31: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

RDD:SparkProgrammingModel

• Keyidea:ResilientDistributedDatasets(RDDs)–Distributedcollectionsofobjectsthatcanbecachedinmemoryorstoredondiskacrossclusternodes–Manipulatedthroughvariousparalleloperators–Automaticallyrebuiltonfailure (How?UseLineage)

Borrowed slide

Ack:SlidebyPrajakta Kalmegh

Page 32: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

AdditionalSlidesonSpark(OptionalReading)

DukeCS,Fall2017 CompSci516:DatabaseSystems 32

Ack:ThefollowingslidesarebyPrajakta Kalmegh

Page 33: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

More on RDDs• Transformations: Created through deterministic operations on either

‣ data in stable storage or

‣ other RDDs

• Lineage: RDD has enough information about how it was derived from other datasets

• Immutable: RDD is a read-only, partitioned collection of records

‣ Checkpointing of RDDs with long lineage chains can be done in the background.

‣Mitigating stragglers: We can use backup tasks to recompute transformations on RDDs

• Persistence level: Users can choose a re-use storage strategy (caching in memory, storing the RDD only on disk or replicating it across machines; also chose a persistence priority for data spills)

• Partitioning: Users can ask that an RDD’s elements be partitioned across machines based on a key in each record

Page 34: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

RDD Transformations and Actions

*http://www.tothenew.com/blog/spark-1o3-spark-internals/

*https://spark.apache.org/docs/1.0.1/cluster-overview.html

Note:LazyEvaluation:Averyimportantconcept

Page 35: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

DAGofRDDs

*https://trongkhoanguyenblog.wordpress.com/2014/11/27/understand-rdd-operations-transformations-and-actions/

Page 36: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

FaultTolerance

• RDDstracktheseriesoftransformationsusedtobuildthem(theirlineage)torecomputelostdata

• E.g:messages = textFile(...).filter(_.contains(“error”))

.map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc = _.contains(...)

MappedRDDfunc = _.split(…)

Borrowed slide

Tradeoff:LowComputationcost(cachemoreRDDs)

VSHighmemorycost(notmuchworkforGC)

Page 37: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

Representing RDDs• Graph-based representation. Five components :

Page 38: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

Representing RDDs (Dependencies)

one-to-one many-to-one many-to-many

shuffle

Page 39: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

Representing RDDs (An example)

Page 40: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

Advantages of the RDD model

Page 41: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

Checkpoint!

• DataSharinginSparkandSomeApplications• RDDDefinition,Model,Representation,Advantages

Page 42: Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

OtherEngineFeatures:Implementation

• Notcoveredindetails• SomeSummary:• SparklocalvsSparkStandalonevsSparkcluster(ResourcesharinghandledbyYarn/Mesos)

• JobScheduling: DAGSchedulervsTaskScheduler (FairvsFIFOattaskgranularity)

• MemoryManagement:serializedin-memory(fastest)VSdeserializedin-memoryVSon-diskpersistent

• SupportforCheckpointing:TradeoffbetweenusinglineageforrecomputingpartitionsVScheckpointingpartitionsonstablestorage

• InterpreterIntegration: Shipexternalinstancesofvariablesreferencedinaclosurealongwiththeclosureclasstoworkernodesinordertogivethemaccesstothesevariables