CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · Counters • Hadoop...

CS378–BigDataProgramming

Lecture5Summariza9onPa:erns

CS378–Fall2017 BigDataProgramming 1

Review

•  Assignment2–Ques9ons?

•  mrunit–Howdoyoutestmap()orreduce()callsthatproducemul9pleoutputs?

•  Issueswithcalcula9ngvariance

Summariza9on

•  Othersummariza9onsofinterest– Min,max,median

•  Supposeweareinterestedinthesemetricsforparagraphlength(Assignment2data)–  Ifparagraphlengthsarenormallydistributed,thenthemedianwillbeverynearthemean

–  Ifthedistribu9onofparagraphlengthsisskewed,thenthemeanandmedianwillbeverydifferent

Summariza9on

•  MinandmaxarestraighWorward•  Foreachparagraph,outputtwovalues– Minlength(thelengthofthecurrentparagraph)– Maxlength(thelengthofthecurrentparagraph)–  Key?

•  Combinerwillgetakeyandlistofvaluespair–  Selectthemin,maxfromthelist,outputthevalues–  Key?

•  Reducerdoesthesame

Summariza9on

•  Median–  Getallthevalues,sortthem,thenfindthemiddle

•  Sinceourcomputa9onisdistributed,nomapperseesallthevalues

•  Shouldwesendthemalltoonereducer?–  Notu9lizingmap-reduce(computa9onnotdistributed)–  Datasizeslikelytoolargetokeepinmemory

Summariza9on

•  Median–  Keeptheuniqueparagraphlengths,and–  Thefrequencyofeachlength

•  Mapoutput:–  <paragraphlength,1>

•  Combinergetsalistofthesepairsandupdatesthecountforrecurringlengths

•  Reducerdoesthesame,thencomputesthemedian

Summariza9on

•  Median–  HadoopprovidestheSortedMapWritableclass–  Canassociateafrequencycountwithaparagraphlength–  Keepsthelengthsinsortedorder

•  SeetheexampleinChapter2ofMap-ReduceDesignPa1erns

•  Howcouldwecomputeallinonepassoverthedata? –  min,max,median

Counters

•  Hadoopmap-reduceinfrastructureprovidescounters–  Accessedbygroupname–  Cannothavealargenumberofcounters

•  Forexample,can’tusecounterstosolveWordCount

– Afewtensofcounterscanbeused

•  CountersarestoredinmemoryonJobTracker

CountersFigure2-6,MapReduceDesignPa:erns

HowHadoopMapReduceWorks

•  We’veseensometermslike:–  Job,JobTracker,TaskTracker(MapReduce1)–  Job,ResourceManager,NodeManager(YARN,MapReduce2)

•  Let’slookatwhattheydo

•  DetailsfromChapter7,Hadoop:TheDefini9veGuide4thEdi9on

HowHadoopMapReduceWorksFigure7-1,Hadoop:TheDefini9veGuide4thEdi9on

JobSubmission

•  Jobsubmission–  Inputfilesexist?Cansplitsbecomputed?– Outputdirectoryexist?

•  Ifyes,itfails.Hadoopexpectstocreatethisdirectory– CopyresourcestoHDFS

•  JARfiles•  Configura9onfile•  Computedfilesplits

ResourceManager•  Createstasks(worktobedone)

–  Maptaskforeachinputsplit–  Requestednumberofreducertasks–  Jobsetup,jobcleanuptasks

•  Maptasksareassignedtotasktrackersthatare“close”totheinputsplitloca9on–  Datalocalpreferred–  Racklocalnext

•  Reducetaskcangoanywhere.Why?•  Schedulingalgorithmordersthetasks

TaskExecu9on

•  Configuredforseveralmapandreducetasks•  Eachtaskhasstatusinfo(state,progress,counters)•  PeriodicallysendsinfotoMRAppMaster –  Running,successfulcomple9on,failed–  Progress(%complete)

•  Foranewtask–  Copyfilestolocalfilesystem(JAR,configura9on)–  LaunchanewJVM(YarnChild drivesexecu9on)–  Loadthemapper/reducerclassandrunthetask

TaskProgress

•  Readininputpair(mapperorreducer)•  Writeanoutputpair(mapperorreducer)•  Setthestatusdescrip9on•  Incrementacounter•  Repor9ngprogress

TaskProgress

•  Mapper–straighWorward– Howmuchoftheinputhasbeenprocessed

•  Reducer–morecomplicated– Sort,shuffleandreduceareconsideredhere– Progressisanes9mateofhowmuchofthetotalworkhasbeendone

– One-thirdallocatedtoeach

ShuffleFigure7-4,Hadoop:TheDefini9veGuide4thEdi9on

MapReduceinHadoopFigure2.4,Hadoop-TheDefini9veGuide

The number of reduce tasks is not governed by the size of the input, but instead isspecified independently. In “The Default MapReduce Job” on page 227, you will seehow to choose the number of reduce tasks for a given job.

When there are multiple reducers, the map tasks partition their output, each creatingone partition for each reduce task. There can be many keys (and their associated values)in each partition, but the records for any given key are all in a single partition. Thepartitioning can be controlled by a user-defined partitioning function, but normally thedefault partitioner—which buckets keys using a hash function—works very well.

The data flow for the general case of multiple reduce tasks is illustrated in Figure 2-4.This diagram makes it clear why the data flow between map and reduce tasks is collo-quially known as “the shuffle,” as each reduce task is fed by many map tasks. Theshuffle is more complicated than this diagram suggests, and tuning it can have a bigimpact on job execution time, as you will see in “Shuffle and Sort” on page 208.

Figure 2-4. MapReduce data flow with multiple reduce tasks

Finally, it’s also possible to have zero reduce tasks. This can be appropriate when youdon’t need the shuffle because the processing can be carried out entirely in parallel (afew examples are discussed in “NLineInputFormat” on page 247). In this case, theonly off-node data transfer is when the map tasks write to HDFS (see Figure 2-5).

Combiner FunctionsMany MapReduce jobs are limited by the bandwidth available on the cluster, so it paysto minimize the data transferred between map and reduce tasks. Hadoop allows theuser to specify a combiner function to be run on the map output, and the combiner

Scaling Out | 33

CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · Counters • Hadoop...

Documents

Transcript of CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · Counters • Hadoop...

Counters BO

Counters Val

Counters & Formula

TIMERS & COUNTERS

Custom Counters

Crimson Skies Counters - · 2008. 1. 8. · Title: Crimson Skies Counters Author: Marcos "TOPO" Hidalgo Subject: Crimson Skies Counters Keywords: topo crimson skies counters fasa

RCP Counters

6 Counters

Remote Particle Counters for Pharmaceutical Applications · environmental monitoring of pharmaceutical, ... Remote Particle Counters for Pharmaceutical ... Remote Particle Counters

Counters and Registers Synchronous Counters. 7-7 Synchronous Down and Up/Down Counters In the previous lecture, we’ve learned how synchronous counters.

MSI Counters

CS 378 – Big Data Programmingdfranke/courses/2017fall/... · 2017. 10. 31. · Hadoop Ecosystem • Many other tools have been implemented on – Hadoop – HDFS (Hadoop Distributed

6-1 Chapter 6 Registers and Counters. 6-2 Outline Registers Shift Registers Ripple Counters Synchronous Counters Other Counters.

EET3350Lec15 Counters

Japanese Counters

Counters, Printers, Preset Counters - FMC Technologiesinfo.smithmeter.com/literature/docs/ss01100.pdf · Model SM-200A Counters are the same as the Model 200A Counters except they

Umts Counters

NEW - Durant Tool Company · 2015. 8. 3. · Durant Counters Durant Counters 2 Electric Counters Stroke Counters Selection When ordering ME Series Miniature Electric Counters. Specify

CS 378 – Big Data Programmingdfranke/courses/2018fall/Lecture23.pdf · Lecture 23 Closures, Caching, Par

CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · CS 378 – Big Data Programming Lecture 5 ... count for recurring lengths ... – Job, ResourceManager, NodeManager