MapReduce - Cornell University

34
MapR educe Simplified Data Processing on Large Clusters (Without the Agonizing Pain) Presented by Aaron Nathan

Transcript of MapReduce - Cornell University

Page 1: MapReduce - Cornell University

MapReduce SimplifiedDataProcessingonLargeClusters

(WithouttheAgonizingPain)

PresentedbyAaronNathan

Page 2: MapReduce - Cornell University

TheProblem

•  Massiveamountsofdata– >100TB(theinternet)– Needssimpleprocessing

•  Computersaren’tperfect– Slow– Unreliable– Misconfigured

•  Requirescomplex(i.e.bugprone)code

Page 3: MapReduce - Cornell University

MapReducetotheRescue!

•  CommonFuncKonalProgrammingModel– MapStep

map (in_key, in_value) -> list(out_key, intermediate_value) •  Splitaproblemintoalotofsmallersubproblems

– ReduceStepreduce (out_key, list(intermediate_value)) -> list(out_value) •  Combinetheoutputsofthesubproblemstogivetheoriginalproblem’sanswer

•  Each“funcKon”isindependent•  HighlyParallelizable

Page 4: MapReduce - Cornell University

Answer!Answer!

AlgorithmPicture

MAP

REDUCE

Answer!

MAPMAP MAP MAP

DATA

K1:vK1:vK2:v K2:v K1:v K2:vK3:v K3:v

K1:v,v,v K2:v,v,v K3:v,v,v

REDUCE REDUCE

aggregator

Page 5: MapReduce - Cornell University

SomeExampleCodemap(String input_key, String input_value): // input_key: document name

// input_value: document contents

for each word w in input_value: EmitIntermediate(w, "1");

reduce(String output_key, Iterator intermediate_values): // output_key: a word

// output_values: a list of counts int result = 0; for each v in intermediate_values:

result += ParseInt(v);

Emit(AsString(result));

Page 6: MapReduce - Cornell University

SomeExampleApplicaKons

•  DistributedGrep•  URLAccessFrequencyCounter•  ReverseWebLinkGraph

•  Term‐VectorperHost

•  DistributedSort•  InvertedIndex

Page 7: MapReduce - Cornell University

TheImplementaKon

•  GoogleClusters– 100s‐1000sDualCorex86CommodityMachines

– CommodityNetworking(100mbps/1Gbps)– GFS

•  GoogleJobScheduler•  Librarylinkedinc++

Page 8: MapReduce - Cornell University

ExecuKon

Page 9: MapReduce - Cornell University

TheMaster

•  MaintainsthestateandidenKfyofallworkers•  Managesintermediatevalues

•  ReceivessignalsfromMapworkersuponcompleKon

•  BroadcastssignalstoReduceworkersastheywork

•  CanretaskcompletedMapworkerstoReduceworkers.

Page 10: MapReduce - Cornell University

InCaseofFailure

•  PeriodicPingsfromMaster‐>Workers– Onfailureresetsstateofassignedtaskofdeadworker

•  Simplesystemprovesresilient– Worksincaseofa80simultaneousmachinefailures!

•  Masterfailureisunhandled.•  WorkerFailuredoesn’teffectoutput

(outputidenKcalwhetherfailureoccursornot)–  Eachmapwritestolocaldiskonly–  Ifamapperislost,thedataisjustreprocessed– Non‐determinisKcmapfuncKonsaren’tguaranteed

Page 11: MapReduce - Cornell University

PreservingBandwidth

•  Machinesareinrackswithsmallinterconnects– UselocaKoninformaKonfromGFS

– Ahemptstoputtasksforworkersandinputslicesonthesamerack

– UsuallyresultsinLOCALreads!

Page 12: MapReduce - Cornell University

BackupExecuKonTasks

•  Whatifonemachineisslow?•  CandelaythecompleKonoftheenKreMROperaKon!

•  Answer:Backup(Redundant)ExecuKons– Whoeverfinishesfirstcompletesthetask!

– Enabledtowardstheendofprocessing

Page 13: MapReduce - Cornell University

ParKKoning

•  M=numberofMapTasks(thenumberofinputsplits)

•  R=numberofReduceTasks(thenumberofintermediatekeysplits)

•  W=numberofworkercomputers•  InGeneral:

– M=sizeof(Input)/64MB–  R=W*n(wherenisasmallnumber)

•  TypicalScenario:InputSize=12TB,M=200,000,R=5000W=2000

Page 14: MapReduce - Cornell University

CustomParKKoning

•  DefaultParKKonedonintermediatekey– Hash(intermediate_key)modR

•  Whatifuserhasaprioriknowledgeaboutthekey?– Allowforuser‐definedhashingfuncKon– Ex.Hash(Hostname(url_key))

Page 15: MapReduce - Cornell University

TheCombiner

•  IfreducerisassociaKveandcommuniviKve–  (2+5)+4=11or2+(5+4)=11–  (15+x)+2=2+(15+x)

•  Repeatedintermediatekeyscanbemerged– Savesnetworkbandwidth– EssenKallylikealocalreducetask

Page 16: MapReduce - Cornell University

I/OAbstracKons

•  HowtogetiniKalkeyvaluepairstomap?– Defineaninput“format”

– Makesuresplitsoccurinreasonableplaces– Ex:Text

•  Eachlineisakey/pair•  CancomefromGFS,bigTable,oranywherereally!

– Outputworksanalogously

Page 17: MapReduce - Cornell University

SkippingBadRecords

•  Whatifausermakesamistakeinmap/reduce•  Andonlyapparentonfewjobs..

•  WorkersendsmessagetoMaster

•  Skiprecordon>1workerfailureandtellotherstoignorethisrecord

Page 18: MapReduce - Cornell University

RemovingUnnecessaryDevelopmentPain

•  LocalMapReduceImplementaKonthatrunsondevelopmentmachine

•  MasterhasHTTPpagewithstatusofenKreoperaKon– Shows“badrecords”

•  ProvideaCounterFacility– Masteraggregates“counts”anddisplayedonMasterHTTPpage

Page 19: MapReduce - Cornell University

AlookattheUI(in1994)

h6p://labs.google.com/papers/mapreduce‐osdi04‐slides/index‐auto‐0013.html

Page 20: MapReduce - Cornell University

PerformanceBenchmarks

SorBng AND Searching

Page 21: MapReduce - Cornell University

Search(Grep)

•  Scanthrough1010100byterecords(1TB)

•  M=15000,R=1

•  StartupKme– GFSLocalizaKon– ProgramPropagaKon

•  Peak‐>30GB/sec!

Page 22: MapReduce - Cornell University

Sort

•  50linesofcode•  Map‐>key+textline

•  Reduce‐>IdenKty•  M=15000,R=4000

– ParKKononinitbytesofintermediatekey

•  Sortsin891sec!

Page 23: MapReduce - Cornell University

WhataboutBackupTasks?

Page 24: MapReduce - Cornell University

Andwait…it’suseful!

NB:August2004

Page 25: MapReduce - Cornell University

OpenSourceImplementaKon

•  Hadoop•  hhp://hadoop.apache.org/core/

– ReliesonHDFS– AllinterfaceslookalmostexactlylikeMapReducepaper

•  Thereisevenatalkaboutittoday!– 4:15B17CSColloquium:MikeCafarella(Uwash)

Page 26: MapReduce - Cornell University

AcBveDisksforLarge‐ScaleDataProcessing

Page 27: MapReduce - Cornell University

TheConcept

•  Useaggregateprocessingpower– Networkeddisksallowforhigherthroughput

•  WhynotmovepartoftheapplicaKonontothediskdevice?– Reducedatatraffic

–  Increaseparallelismfurther

Page 28: MapReduce - Cornell University

ShrinkingSupportHardware…

Page 29: MapReduce - Cornell University

ExampleApplicaKons

•  MediaDatabase– Findsimilarmediadataby“fingerprint”

•  RealTimeApplicaKons– CollectmulKplesensordataquickly

•  DataMining– POSAnalysisrequiredadhocdatabasequeries

Page 30: MapReduce - Cornell University

Approach

•  Leveragetheparallelismavailableinsystemswithmanydisks

•  Operatewithasmallamountofstate,processingdataasitstreamsoffthedisk

•  ExecuterelaKvelyfewinstrucKonsperbyteofdata

Page 31: MapReduce - Cornell University

Results‐NearestNeighborSearch

•  Problem:DeterminekitemsclosesttoaparKculariteminadatabase– Performcomparisonsonthedrive– Returnsthedisksclosestmatches– Serverdoesfinalmerge

Page 32: MapReduce - Cornell University

MediaMiningExample

•  Performlowlevelimagetasksonthedisk!

•  EdgeDetecKonperformedondisk– Senttoserverasedgeimage

– Serverdoeshigherlevelprocessing

Page 33: MapReduce - Cornell University

WhynotjustuseabunchofPC’s?

•  Theperformanceinceaseissimilar•  Infact,thepaperessenKallyusedthissetuptoactuallybenchmarktheirresults!

•  Supposedlythiscouldbecheaper•  Thepaperdoesn’treallygiveagoodargumentforthis…– PossiblyreducedbandwidthondiskIOchannel– Butwhocares?

Page 34: MapReduce - Cornell University

SomeQuesKons

•  Whatcouldadiskpossiblydobeherthanthehostprocessor?

•  WhataddedcostisassociatedwiththismediocreprocessorontheHDD?

•  Arenewdependenciesareintroducedonhardwareandsosware?

•  Perhapsother(beher)placestodothistypeoflocalparallelprocessing?

•  Maybein2001thismademoresense?