MapReduce -...

83
MapReduce & Pig & Spark Ioanna Miliou Giuseppe Attardi Advanced Programming Università di Pisa

Transcript of MapReduce -...

Page 1: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

MapReduce&Pig&Spark

IoannaMiliouGiuseppeAttardi

AdvancedProgrammingUniversitadiPisa

Page 2: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Hadoop• TheApache™Hadoop®projectdevelopsopen-sourcesoftwarefor

reliable,scalable,distributedcomputing.

• Frameworkthatallowsforthedistributedprocessingoflargedatasetsacrossclustersofcomputersusingsimpleprogrammingmodels.

• Itisdesignedtoscaleupfromsingleserverstothousandsofmachines,eachofferinglocalcomputationandstorage.

• Itisdesignedtodetectandhandlefailuresattheapplicationlayer.

ThecoreofApacheHadoopconsistsofastoragepart,knownasHadoopDistributedFileSystem(HDFS),andaprocessingpartcalledMapReduce.

Page 3: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Hadoop• Theprojectincludesthesemodules:

– HadoopCommon:ThecommonutilitiesthatsupporttheotherHadoopmodules.

– HadoopDistributedFileSystem(HDFS):Adistributedfilesystemthatprovideshigh-throughputaccesstoapplicationdata.

– HadoopYARN:Aframeworkforjobschedulingandclusterresourcemanagement.

– HadoopMapReduce:AYARN-basedsystemforparallelprocessingoflargedatasets.

Page 4: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Hadoop• OtherHadoop-relatedprojectsatApacheinclude:

– Ambari:Aweb-basedtoolforprovisioning, managing, andmonitoring ApacheHadoop.

– Avro:Adataserializationsystem.– Cassandra:Ascalablemulti-masterdatabasewithnosinglepointsoffailure.– Chukwa:Adatacollectionsystemformanaging largedistributed systems.– HBase:Ascalable,distributeddatabasethatsupports structureddatastorage

forlargetables.– Hive:Adatawarehouseinfrastructurethatprovidesdatasummarizationand

adhocquerying.– Mahout:AScalablemachinelearninganddatamining library.– Tez:Ageneralizeddata-flowprogramming framework,builtonHadoopYARN,

whichprovidesapowerfulandflexibleengine toexecuteanarbitraryDAGoftaskstoprocessdataforbothbatchandinteractiveuse-cases.

– ZooKeeper:Ahigh-performance coordinationservicefordistributedapplications.

Page 5: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Hadoop– Pig:Ahigh-leveldata-flowlanguageandexecutionframeworkforparallelcomputation.

– Spark:AfastandgeneralcomputeengineforHadoopdata.Sparkprovidesasimpleandexpressiveprogrammingmodelthatsupportsawiderangeofapplications,includingETL,machinelearning,streamprocessing,andgraphcomputation.

Page 6: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

HadoopStack

Page 7: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

WhatisMapReduce?

Page 8: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

• MapReduceistheheartofHadoop®

• ProgrammingparadigmthatallowsformassivescalabilityacrosshundredsorthousandsofserversinaHadoopcluster.

ProposedbyDeanandGhemawat atGoogle

Page 9: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Whatisit?

• ProcessingengineofHadoop• Usedforbigdatabatchprocessing• Parallelprocessingofhugedatavolumes• Faulttolerant• Scalable

Page 10: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Whyuseit?

• YourdatainTerabyte/Petabyterange• YouhavehugeI/O• Hadoopframeworktakescareof– Jobandtaskmanagement– Failures– Storage– Replication

YoujustwriteMapandReducejobs

Page 11: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

BigUsers

• Users– Facebook– Yahoo– Amazon– Ebay

• Providers– Amazon– Cloudera– HortonWorks– MapR

Page 12: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Map&ReduceThetermMapReduceactuallyreferstotwoseparateanddistincttasksthatHadoopprogramsperform.

1. Themap job,whichtakesasetofdataandconvertsitintoanothersetofdata,whereindividualelementsarebrokendownintotuples(key/valuepairs).

map(k1,v1)→list(k2,v2)

1. Thereduce jobtakestheoutputfromamapasinputandcombinesthosedatatuplesintoasmallersetoftuples.

reduce(k2,list(v2))→list(v2)

AsthesequenceofthenameMapReduceimplies,thereducejobisalwaysperformedafterthemapjob.

Page 13: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

TypicalProblemsolvedbyMapReduce

• Readalotofdata

• Map :extractsomethingyoucareaboutfromeachrecord

• ShuffleandSort

• Reduce:aggregate,summarize,filter,ortransform

• Writetheresults

InputData

Map Map Map Map

Shuffle

Reduce Reduce

Results

Page 14: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Example:WordCountinWebPages

Atypicalexerciseforanewengineerinhisorherfirstweek

• Inputisfileswithonedocumentperrecord• Specifyamap functionthattakesakey/valuepair

key=documentURLvalue=documentcontents

• Outputofmapfunctionis(potentiallymany)key/valuepairs.Inourcase,output(word,“1”)onceperwordinthedocument

“document1”, “AppleOrangeMangoOrangeGrapesPlum”

“Apple”,“1”“Orange”,“1”“Mango”,“1”

Page 15: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Examplecontinued:WordCountinWebPages

• MapReducelibrarygatherstogetherallpairswiththesamekey(shuffle/sort)

• ThereducefunctioncombinesthevaluesforakeyInourcase,computethesum

• Outputofreducepairedwithkeyandsaved

key=“Apple”values=“1”

key=“Mango”values=“1”

key=“Orange”values=“1”,“1”

key=“Plum”values=“1”

key=“Grapes”values=“1”

“1” “1”“1”“1”“2”

“Apple”,“1”“Orange”,“2”“Mango”,“1”“Grapes”,“1”“Plum”,“1”

Page 16: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

ExamplePseudo-code

Page 17: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

map() reduce()

Page 18: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

MapReducewrappersWrappershavebeendevelopedinorderto:• provideabettercontrolovertheMapReducecode• aidinthesourcecodedevelopment

Somewell-knownexample:• Sawzall (Google)• Pig(originallyYahoo,nowApache)• Hive(Facebook)• DryadLINQ (Microsoft)

Page 19: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

WidelyapplicableatGoogle• ImplementedasaC++librarylinkedtouserprograms• Canreadandwritemanydifferentdatatypes

Exampleuses:

Page 20: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Example:GeneratingLanguageModelStatistics• Usedinthestatisticalmachinetranslationsystem

o needtocount#oftimesevery5-wordsequenceoccursinlargecorpusofdocuments(andkeepallthosewherecount>=4)

• EasywithMapReduce:o map :extract5-wordsequences=>countfromdocumento reduce :combinecounts,andkeepifcountlargeenough

Page 21: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Example:JoiningwithOtherData

• Example:generateper-docsummary,butincludeper-hostinformation(e.g.#ofpagesonhost,importanttermsonhost)o per-hostinformationmightbeinper-processdatastructure,or

mightinvolveRPCtoasetofmachinescontainingdataforallsites

• EasywithMapReduce:o map :extracthostnamefromURL,lookupper-hostinfo,

combinewithper-docdataandemito reduce :identityfunction(justemitkey/valuedirectly)

Page 22: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

MapReduce:Scheduling

• Onemaster,manyworkerso InputdatasplitintoMmaptasks(typically64MBinsize)o Reducephasepartitioned intoRreducetaskso Tasksareassignedtoworkersdynamicallyo Often:M=200000,R=4000,workers=2000

• Masterassignseachmaptasktoafreeworkero Considerslocalityofdatatoworkerwhenassigning tasko Worker readstaskinput (often fromlocaldisk)o WorkerproducesRlocalfilescontaining intermediatek/vpairs

• Masterassignseachreducetasktoafreeworkero Worker readsintermediatek/vpairsfrommapworkerso Worker sorts&appliesuser’sReduceop toproduce theoutput

Page 23: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together
Page 24: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

TaskGranularityandPipeliningFinegranularitytasks:manymoremaptasksthanmachines• Minimizestimeforfaultrecovery• Canpipelineshufflingwithmapexecution• Betterdynamicloadbalancing

Oftenuse200,000map/5000reducetasksw/2000machines

Page 25: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Faulttolerance:Handledviare-execution

• Onworkerfailure:o Detectfailureviaperiodicheartbeatso Re-executecompletedandin-progressmap taskso Re-executeinprogressreduce taskso Taskcompletioncommittedthroughmaster

• Masterfailure:o Stateischeckpointed :newmasterrecovers&continues

Robust:OnceGooglelost1600of1800machines,butfinishedfine

Page 26: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Refinement:Backuptasks

• Slowworkerssignificantlylengthencompletiontimeo Otherjobsconsumingresourcesonmachineo Baddiskswithsofterrorstransferdataveryslowlyo Weirdthings:processorcachesdisabled(!!)

• Solution:Nearendofphase,spawnbackupcopiesoftaskso Whicheveronefinishesfirst"wins"

• Effect:Dramaticallyshortensjobcompletiontime

Page 27: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Refinement:LocalityOptimization

Masterschedulingpolicy:• Asksforlocationsofreplicasofinputfileblocks• Maptaskstypicallysplitinto64MB• Maptasksscheduledsoinputblockreplicaareonsame

machineorsamerack

Effect:Thousandsofmachinesreadinputatlocaldiskspeed• Withoutthis,rackswitcheslimitreadrate

Page 28: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Refinement:SkippingBadRecords

Map/Reducefunctionssometimesfailforparticularinputs• Bestsolutionistodebug&fix,butnotalwayspossible

Onseg fault:• SendUDPpackettomasterfromsignalhandler• Includesequencenumberofrecordbeingprocessed

IfmasterseesK failuresforsamerecord(typicallyK setto2or3):• Nextworkeristoldtoskiptherecord

Effect:Canworkaroundbugsinthird-partylibraries

Page 29: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

OtherRefinements

• Optionalsecondarykeysforordering

• Compressionofintermediatedata

• Combiner:usefulforsavingnetworkbandwidth

• Localexecutionfordebugging/testing

• User-definedcounters

Page 30: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

“Playaround”

• AmazonElasticMapReduce(AmazonEMR)• Hortonworks Sandbox• MapR SandboxforHadoop• Qubole• MicrosoftAzureHDInsight• Cloudera

Page 31: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

MapReduceexamplesinJava

Page 32: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Serializable vsWritable• Serializable storestheclassnameandtheobjectrepresentationto

thestream;otherinstancesoftheclassarereferredtobyanhandletotheclassname:thisapproachisnotusablewithrandomaccess

• Forthesamereason,thesortingneededfortheshuffleandsortphasecannotbeusedwithSerializable

• Thedeserializationprocesscreatesannewinstanceoftheobject,whileHadoopneedstoreuseobjecttominimizecomputation

• HadoopintroducesthetwointerfacesWritable andWritableComparable thatsolvetheseproblem

Page 33: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Writablewrappers

Page 34: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

ImplementingWritable:theSumCount class

Page 35: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Glossary

Page 36: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

WordCount

• http://www.gutenberg.org/cache/epub/201/pg201.txt

• InputData:Thetextofthebook“Flatland”byEdwinAbbott

Page 37: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

WordCountmapper

Page 38: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

WordCount reducer

Page 39: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

WordCountresults

Page 40: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

TopN :Wewanttofindthetop-nusedwordsofatextfile

• http://www.gutenberg.org/cache/epub/201/pg201.txt

• InputData:Thetextofthebook“Flatland”byEdwinAbbott

Page 41: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

TopNmapper

Page 42: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

TopN reducer

Page 43: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

TopN results

Page 44: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

MEAN:Wewanttofindthemeanmaxtemperatureforeverymonth

• http://archivio-meteo.distile.it/tabelle-dati-archivio-meteo/

• InputData:TemperatureinMilan(DD/MM/YYYY,MIN,MAX)02012015,-2, 703012015,-1,804012015,1,16…29012015,0,530012015,0,931012015,-3,6

Page 45: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Meanmapper

Page 46: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Meanreducer

Page 47: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Meanresults

Page 48: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

TODO:k-meansclusteringalgorithm

• Wewanttoaggregate2Dpointsinclustersusingk-meansalgorithm

• Inputdata:Arandomsetofpoints2.2705 0.91781.8600 2.10022.0915 1.3679-0.16120.8481…

Page 49: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

k-meansalgorithmInput:datapointsD,numberofclusterk

1. initializekcentroidsrandomly2. associateeachdatapointinDwiththenearestcentroid.

Thiswilldividethedatapointsintokclusters.3. recalculatethepositionofcentroids.

Repeatsteps2and3untiltherearenomorechangesinthemembershipofthedatapoints.

Output:datapointswithclustermemberships

Page 50: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

MapReduceexamplesinPython

Page 51: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

WordCountusingmrjob

“a”,936“ab”,6“abbot”,3“abbott”,2“abbreviated”,1…

Page 52: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

ProductRecommendations

• Goal:Foreachproductaclientbuys,generatea‘peoplewhoboughtthisalsoboughtthis’recommendation

• InputData:product_id_1,product_id_2

Page 53: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

CoincidentPurchaseFrequency

Page 54: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

TopRecommendations

Page 55: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

But…Supposeyouhave:

• userdatainonefile,• websitedatainanother,

andyouneedtofind

• thetop5 mostvisitedpagesbyusersaged18-25.

Page 56: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

InMapReduce

Page 57: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

InPigLatin

Page 58: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

WhatisApachePig?

Idea:aMapReduceprogramessentiallyperformsagroup-by-aggregationinparalleloveraclusterofmachines.

• Pig isahigh-levelplatformforcreatingMapReduceprogramsusedwithHadoop.

• ThelanguageforthisplatformiscalledPigLatin.Itcombineshigh-leveldeclarativequeryinginthespiritofSQL,andlow-level,proceduralprogrammingà laMapReduce.

DevelopedatYahoo

Page 59: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Pig• ApachePig isaplatformforanalyzinglargedatasetsthat

consistsofahigh-levellanguageforexpressingdataanalysisprograms,coupledwithinfrastructureforevaluatingtheseprograms.ThesalientpropertyofPigprogramsisthattheirstructureisamenabletosubstantialparallelization,whichinturnsenablesthemtohandleverylargedatasets.

• Atthepresenttime,Pig'sinfrastructurelayerconsistsofacompilerthatproducessequencesofMap-Reduceprograms,forwhichlarge-scaleparallelimplementationsalreadyexist(e.g.,theHadoopsubproject).

Page 60: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

PigLatinPigLatinhasthefollowingkeyproperties:

• Easeofprogramming. Itistrivialtoachieveparallelexecutionofsimple,"embarrassinglyparallel"dataanalysistasks.Complextaskscomprisedofmultipleinterrelateddatatransformationsareexplicitlyencodedasdataflowsequences,makingthemeasytowrite,understand,andmaintain.

• Optimizationopportunities.Thewayinwhichtasksareencodedpermitsthesystemtooptimizetheirexecutionautomatically,allowingtheusertofocusonsemanticsratherthanefficiency.

• Extensibility. Userscancreatetheirownfunctionstodospecial-purposeprocessing.

Page 61: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Performance

Page 62: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

PigHighlights

• Userdefinedfunctions(UDFs)canbewrittenforcolumntransformation(TOUPPER),oraggregation(SUM)

• UDFscanbewrittentotakeadvantageofthecombiner• Fourjoinimplementationsbuiltin:hash,fragment-replicate,merge,

skewed• Multi-query:Pigwillcombinecertaintypesofoperationstogetherina

singlepipelinetoreducethenumberoftimesdataisscanned• Orderbyprovidestotalorderingacrossreducersinabalancedway• WritingloadandstorefunctionsiseasyonceanInputFormat and

OutputFormat exist• Piggybank,acollectionofuserscontributedUDFs

Page 63: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

WhousesPigforwhat?

• 70%ofproductionjobsatYahoo (10ksperday)

• AlsousedbyTwitter,LinkedIn,Ebay,AOL,…

• Usedto– Processweblogs– Builduserbehaviormodels– Processimages– Buildmapsoftheweb– Doresearchonrawdatasets

Page 64: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Components

Pigresidesonusermachine

Usermachine

HadoopCluster

Jobexecutesoncluster

NoneedtoinstallanythingextraonyourHadoopcluster.

Page 65: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

So,whyPig?

• Fasterdevelopment– Fewerlinesofcode– Don’tre-inventthewheel

• Flexible– Metadataisoptional– Extensible– Proceduralprogramming

Page 66: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

But…

• Doyouneedyourprogramtorunfaster?

• Doesyouranalyticjobrunsforhours?

Page 67: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

LimitationsofMapReduce

OneofthemajordrawbacksofMapReduceisitsinefficiencyinrunningiterativealgorithms.

MapReduceisnotdesignedforiterativeprocesses:aftereachiteration,theresultshavetobewrittentothedisktopassthemontothenextiteration.

degradationofperformance

Page 68: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

LimitationsofPig

Pigusesbatchorientedframeworks,whichmeansyouranalyticsjobswillrunformanyminutesorhours.

Spark isfaster!

Page 69: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

WhatisApacheSpark?

• Afastandgeneralcomputeengineforlarge-scaledataprocessing.

• Themajorfeature:theabilitytoperformin-memorycomputation(thedatacanbecachedinmemory).

• Sparkprovidesasimpleandexpressiveprogrammingmodelthatsupportsawiderangeofapplications,includingETL,machinelearning,streamprocessing,andgraphcomputation.

DevelopedattheUniversityofCaliforniaatBerkeley

Page 70: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Spark

• It providesaninterfaceforprogrammingentireclusterswithimplicitdataparallelismandfault-tolerance.

• Forcertaintasks,itistestedtobeupto100xfaster(datainmemory)or10x(dataindisk)fasterthanHadoopMapReduce

• ItcanrunonHadoopYARNmanagerandcanreaddatafromHDFS.

Page 71: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Spark• Designedtobeusedwitharangeofprogramminglanguagesand

onavarietyofarchitectures.

• Increasinglypopularwithawiderangeofdevelopers,thankstospeed,simplicity,andbroadsupport forexistingdevelopmentenvironmentsandstoragesystems.

• Relativelyaccessible tothoselearningtoworkwithitforthefirsttime.

• OneofApache'slargestandmostvibrant,withover500contributorsfrommorethan200organizationsresponsibleforcodeinthesoftwarerelease.

Page 72: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Why?

• SparkisbasicallydevelopedtoovercomeMapReduce’sshortcomingthatitisnotoptimizedforiterativealgorithms andinteractivedataanalysiswhichperformscyclicoperationsonsamesetofdata.

• SparkovercomesthisproblembyprovidinganewstorageprimitivecalledResilientDistributedDatasets (RDDs).

Page 73: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

ResilientDistributedDatasets(RDDs)

TheResilientDistributedDatasetisaconceptattheheartofSpark.Itisdesignedtosupportin-memorydatastorage,distributedacrossaclusterinamannerthatisdemonstrablybothfault-tolerantandefficient.• Fault-tolerance isachieved,inpart,bytrackingthelineageof

transformationsappliedtocoarse-grainedsetsofdata.• Efficiency isachievedthroughparallelizationofprocessingacrossmultiple

nodesinthecluster,andminimizationofdatareplicationbetweenthosenodes.

OncedataisloadedintoanRDD,twobasictypesofoperationcanbecarriedout:• Transformations,whichcreateanewRDDbychangingtheoriginal

throughprocessessuchasmapping,filtering,andmore;• Actions,suchascounts,whichmeasurebutdonotchangetheoriginal

data.

Page 74: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

WordCountinSpark

Page 75: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Anotherexample:logisticregression

Acommonmachinelearningalgorithmforclassifyingobjectssuchas,say,spamvs.non-spamemails.

Page 76: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

PigvsSpark

• Pig– Thisisthebestdataloading toolavailableinsidehadoop.– Itusesascripting languagecalledPigLatin,whichismoreworkflowdriven.– Don'tneedtobeanexpertJavaprogrammerbutneedafewcodingskills.– Isalsoanabstractionlayerontopofmap-reduce.– Simple towriteandcontrol.

• Spark– Prettymuchthesuccessortomap-reduceinHadoop,withanemphasisonin-

memorycomputing.– You'llneedtobeaprettygood Javaprogrammer tousethis.– Muchlowerlevel.

Page 77: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Howtochooseaplatform?

• Thedecisiontochooseaparticularplatformforacertainapplicationusuallydependsonthefollowingimportantfactors:– datasize– speedorthroughputoptimization– modeldevelopment(Training/Applyingamodel)

Page 78: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

Example:k-meansclusteringalgorithm

Thek-meansalgorithmisusedforprovidingmoreinsightintotheanalyticsalgorithmsondifferentplatforms.

Characteristics:• popularandwidelyused• iterativenature• compute-intensivetask(calculatingthecentroids)• aggregationofthelocalresultstoobtainaglobalsolution

Page 79: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

k-meansalgorithmInput:datapointsD,numberofclusterk

1. initializekcentroidsrandomly2. associateeachdatapointinDwiththenearestcentroid.

Thiswilldividethedatapointsintokclusters.3. recalculatethepositionofcentroids.

Repeatsteps2and3untiltherearenomorechangesinthemembershipofthedatapoints.

Output:datapointswithclustermemberships

Page 80: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

k-meansonMapReduceInput:datapointsD,numberofclusterkandcentroids

1. foreachdatapointd∈ Ddo2. assigndtotheclosestcentroid

Output:centroidswithassociateddatapoints

Input:centroidswithassociateddatapoints1. computethenewcentroidsbycalculatingthe

averageofdatapointsincluster2. writetheglobalcentroidstothedisk

Output:newcentroids

Map

Reduce

Page 81: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

k-meansonPigLatinREGISTERudf.jarDEFINEfind_centroid FindCentroid('$centroids');points=LOAD'points.txt'as(id:int,pos:double);centroided=FOREACHpointsGENERATEpos,find_centroid(pos)ascentroid;grouped=GROUPcentroided BYcentroid;result=FOREACHgroupedGENERATEgroup,AVG(centroided.pos);STOREresultINTO'output’;

Page 82: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

k-meansonSpark

SimilartoMapReduce-based implementation

• Insteadofwritingtheglobalcentroidstothedisk,theyarewrittentomemorywhichspeedsuptheprocessingandreducesthediskI/Ooverhead.

• Thedatawillbeloadedintothesystemmemoryinordertoprovidefasteraccess.

Page 83: MapReduce - didawikinf.di.unipi.itdidawikinf.di.unipi.it/.../magistraleinformatica/pa/mapreduce.pdf · Example continued : Word Count in Web Pages • MapReduce library gathers together

References• Dean,J.andGhemawat,S.MapReduce:Simplifieddataprocessingonlarge

clusters.InProceedingsofOperatingSystemsDesignandImplementation (OSDI).SanFrancisco,CA.137-150.2004

• Hadoop:OpensourceimplementationofMapReduce.http://lucene.apache.org/hadoop/

• C.Olston,B.Reed,U.Srivastava,R.Kumar,A.Tomkins.PigLatin:ANot-so-foreignLanguageforDataProcessing.InProceedingsSIGMOD'08.2008

• D.SinghandC.K.Reddy.Asurveyonplatformsforbigdataanalytics, JournalofBigData.2014

• M.Zaharia,M.Chowdhury,T.Das,A.Dave,J.Ma,M.McCauley,M.Franklin,S.Shenker, andI.Stoica. ResilientDistributedDatasets:AFault-TolerantAbstractionforIn-MemoryClusterComputing. USENIXNSDI.2012