Crystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and Spark
CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark...
Transcript of CTBD X preparation - GitHub Pages · •Hive, HBase, Yarn •Futures, Promises, Actors •Spark...
ExamPreparationGuidoSalvaneschi
1
LectureMaterial
Lectures• Introtodist.systems• MapReduce• HDFS• Hive,HBase,Yarn• Futures,Promises,Actors• Spark• Sparkstreaming
2
Papers• MapReduce• GFS• Spark
Exercises• MapReduce• Futures,Actors• Spark
Warning!
• Thesearejustexamplesofthekindofquestionsthatcanappearintheexam.
• Theyarenotsupposedtobecomplete(ofcourse).
• Theyarenotrepresentativeofthecoverageofthecoursetopicsintheexam.
• Theydonotcoverquestionsaboutcoding(but“simple”exercisesprovidegoodexamplesforthat).
3
Explain3reasonsthatmotivatebuildingasysteminadistributedway
4
WhyDistributedSystems
• Functionaldistribution• Computershavedifferentfunctionalcapabilities(e.g.,Fileserver,printer)yetmayneedtoshareresources
• Client/server• Datagathering/dataprocessing
• Incrementalgrowth• Easiertoevolvethesystem• Modularexpandability
• Inherentdistributioninapplicationdomain• Banks,reservationservices,distributedgames,mobileapps• physicallyoracrossadministrativedomains• cashregisterandinventorysystemsforsupermarketchains• computersupportedcollaborativework
WhyDistributedSystems
• Economics• collectionsofmicroprocessorsofferabetterprice/performanceratiothanlargemainframes.
• Lowprice/performanceratio:costeffectivewaytoincreasecomputingpower.
• Betterperformance• Loadbalancing• Replicationofprocessingpower• Adistributedsystemmayhavemoretotalcomputingpowerthanamainframe.Ex.10,000CPUchips,eachrunningat50MIPS.Notpossibletobuild500,000MIPSsingleprocessorsinceitwouldrequire0.002nsec instructioncycle.Enhancedperformancethroughloaddistributing.
• IncreasedReliability• Exploitindependentfailuresproperty• Ifonemachinecrashes,thesystemasawholecanstillsurvive.
• Anotherdrivingforce:theexistenceoflargenumberofpersonalcomputers,theneedforpeopletocollaborateandshareinformation.
Explain3goals(andchallenges)ofdistributedsystems
Goalsandchallengesofdistributedsystems
• Transparency• Howtoachievethesingle-systemimage
• Performance• Thesystemprovideshigh(computing,storage,..)performance
• Scalability• Theabilitytoservemoreusers,provideacceptableresponsetimeswithincreasedamountofdata
• Openness• Anopendistributedsystemcanbeextendedandimprovedincrementally• Requirespublicationofcomponentinterfacesandstandardsprotocolsforaccessinginterfaces
• Reliability/faulttolerance• Maintainavailabilityevenwhenindividualcomponentsfail
• Heterogeneity• Network,hardware,operatingsystem,programminglanguages,differentdevelopers
• Security• Confidentiality,integrityandavailability
Whichtechniquescanbeusedtomakeasystemscalable?Brieflyexplainthem.
9
Scalingtechniques
Distribution• Splittingaresource(suchasdata)intosmallerparts,andspreadingthepartsacrossthesystem(cf DNS)
10
Scalingtechniques
• Replication• Replicateresources(services,data)acrossthesystem,canaccesstheminmultipleplaces
• Cachingtoavoidrecomputation• Increasedavailabilityreducestheprobabilitythatabiggersystembreaks
• Hidingcommunicationlatencies• Avoidwaitingforresponsestoremoteservicerequests
• Useasynchronouscommunication
11
ShowthesignatureoftheMapfunctionandtheReducefunctioninMapReduce.WhatistheMapphaseandwhataretheReducephaseresponsiblefor?
Functionalprogramming“foundations”
• mapinMapReduce↔mapinFP• map::(a→b)→[a]→[b]• Example:Doubleallnumbersinalist.• >map((*)2)[1,2,3]>[2,4,6]
• Inapurelyfunctionalsetting,anelementofalistbeingcomputedbymapcannotseetheeffectsofthecomputationsonotherelements.
• Iftheorderofapplicationofafunctionftoelementsinalistiscommutative,thenwecanreorderorparallelizeexecution.
13
Note:Thereisnoprecise1-1correspondence.Pleasetakethisjustasananalogy.
Functionalprogramming“foundations”
• Moveoverthelist,applyftoeachelementandanaccumulator.freturnsthenextaccumulatorvalue,whichiscombinedwiththenextelement.
• reduceinMapReduce↔foldinFP• foldl ::(b→a→b)→b→[a]→b• Example:Sumofallnumbersinalist.• >foldl (+)0[1,2,3]foldl (+)0[1,2,3]>6
14
Note:Thereisnoprecise1-1correspondence.Pleasetakethisjustasananalogy.
MapReduceBasicProgrammingModel
• Transformasetofinputkey-valuepairstoasetofoutputvalues:• Map:(k1,v1)→list(k2,v2)• MapReducelibrarygroupsallintermediatepairswithsamekeytogether.
• Reduce:(k2,list(v2))→list(v2)
15
Whatistheproblemwith“stragglers”(slowworkers)andwhatcanbedonetosolvethisproblem?
Stragglers&BackupTasks
• Problem:“Stragglers”(i.e.,slowworkers)significantlylengthenthecompletiontime.
• Solution:Closetocompletion,spawnbackupcopiesoftheremainingin-progresstasks.
• Whicheveronefinishesfirst,“wins”.
• Additionalcost:afewpercentmoreresourceusage.• Example:Asortprogramwithoutbackup=44%longer.
17
SketchtheGFSarchitecturepresentingthecomponentsthatconstitutesitandthemaininteractions.
GFS- Overview
19
Explainwhatafutureis
20
Explainwhatafutureis
21
• Placeholderobjectforavaluethatmaynotyetexist• ThevalueoftheFutureissuppliedconcurrentlyandcansubsequentlybeused
WhichunderlyingdatastructureisusedbyApacheSpark?Showaminimal exampleandindicatewheresuchdatastructureisused.
22
RDD (ResilientDistributedDatasets)
• Restrictedformofdistributedsharedmemory• immutable,partitionedcollectionofrecords• canonlybebuiltthroughcoarse-graineddeterministictransformations
(map,filter,join...)
• Efficientfault-toleranceusinglineage• Logcoarse-grainedoperationsinsteadoffine-graineddataupdates• AnRDDhasenoughinformationabouthowit’sderivedfromother
dataset• Recompute lostpartitionsonfailure
23
SparkandRDDs
• ImplementsResilientDistributedDatasets(RDDs)
• OperationsonRDDs• Transformations:definesnewdatasetbasedonpreviousones• Actions:startsajobtoexecuteoncluster
• Well-designedinterfacetorepresentRDDs• Makesitveryeasyto
implementtransformations• MostSparktransformation
implementation<20LoC
24
MoreonRDDs
Workwithdistributedcollectionsasyouwouldwithlocalones
• Resilientdistributeddatasets(RDDs)• Immutablecollectionsofobjectsspreadacrossacluster• Builtthroughparalleltransformations(map,filter,etc)• Automaticallyrebuiltonfailure• Controllablepersistence(e.g.,cachinginRAM)
• Differentstoragelevelsavailable,fallbacktodiskpossible
• Operations• Transformations (e.g.map,filter,groupBy,join)
• LazyoperationstobuildRDDsfromotherRDDs• Actions (e.g.count,collect,save)
• Returnaresultorwriteittostorage
WorkflowwithRDDs
• CreateanRDDfromadatasource: <list>• ApplytransformationstoanRDD:mapfilter• ApplyactionstoanRDD:collectcount
distFile = sc.textFile("...", 4) • RDDdistributedin4partitions• Elementsarelinesofinput• Lazyevaluationmeansnoexecutionhappensnow
26
GiveapossibleexplanationwhythecomputationofthePageRankissignificantlydifferentbetweenHadoop andSpark
171
80
23
14
020406080100120140160180200
30 60
Iteratio
ntim
e(s)
Numberofmachines
HadoopSpark
Spark
• Fast,expressiveclustercomputingsystemcompatiblewithApacheHadoop
• WorkswithanyHadoop-supportedstoragesystem(HDFS,S3,Avro,…)
• Improvesefficiency through:• In-memorycomputingprimitives• Generalcomputationgraphs
• Improvesusability through:• RichAPIsinJava,Scala,Python• Interactiveshell
Up to 100× faster
Often 2-10× less code
PageRank
• Givepagesranks(scores)basedonlinkstothem• Linksfrommanypagesè highrank• Linkfromahigh-rankpageè highrank
• Goodexampleofamorecomplexalgorithm• Multiplestagesofmap&reduce
• BenefitsfromSpark’sin-memorycaching• Multipleiterationsoverthesamedata
Image: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png
Whatisaresourcemanagementsystem,e.g.,ApacheYARN?
ResourceManagement
• Typicallyimplementedbyasystemdeployedacrossnodesofacluster
• Layerbelow“frameworks”likeHadoop• Onanynode,thesystemkeepstrackofavailabilities• Applicationsontopuseinformationandestimationsofownrequirementstochoosewheretodeploysomething
• RMsystems(RMSs)differinabstractions/interfaceprovidedandactualschedulingdecisions
GiventhescenarioX,whatisthetechnology/approachthatyouwouldrecommendforsolvingproblemY?• MapReduce• HDFS• Adatabase• HBase• ApacheSpark• Sparkstreaming• ...
32
MapReducevs.TraditionalRDBMS
MapReduce TraditionalRDBMSDatasize Petabytes GigabytesAccess Batch Interactiveandbatch
Updates Writeonce,readmanytimes
Readandwritemanytimes
Structure Dynamicschema StaticschemaIntegrity Low High(normalizeddata)
Scaling Linear Non-linear(generalSQL)
33
ASummary
34
35
ProgrammingMod
el
DataOrganization
Declarative
StructuredFlatRawTypes
Proced
ural
Event-drivenapplications
• Canweuseexistingtechnologiesforbatchprocessing?• Theyarenotdesignedtominimizelatency• Weneedawholenewmodel!
Esper inanutshell
• EPL:richlanguagetoexpressrules• GroundedontheDSMSapproach
• Windowing• Relationalselect,join,aggregate,…• Relation-to-streamoperatorstoproduceoutput• Sub-queries
• Queriescanbecombinedtoformagraph• IntroducessomefeaturesofCEPlanguages
• Patterndetection
• Designedforperformance• Highthroughput• Lowlatency
Goals
Batch
Interactive Streaming
One stack to
rule them all?
§ Easy to combine batch, streaming, and interactive computations§ Easy to develop sophisticated algorithms§ Compatible with existing open source ecosystem (Hadoop/HDFS)
39