Google’s MapReduce

of 25 /25
Google’s MapReduce Google’s MapReduce Connor Poske Connor Poske Florida State University Florida State University

Embed Size (px)


Google’s MapReduce. Connor Poske Florida State University. Outline. Part I: History MapReduce architecture and features How it works Part II: MapReduce programming model and example. Initial History. There is a demand for large scale data processing. - PowerPoint PPT Presentation

Transcript of Google’s MapReduce

  • Googles MapReduceConnor PoskeFlorida State University

  • Outline

    Part I:HistoryMapReduce architecture and featuresHow it worksPart II:MapReduce programming model and example

  • Initial HistoryThere is a demand for large scale data processing.The folks at Google have discovered certain common themes for processing very large input sizes. - Multiple machines are needed- There are usually 2 basic operations on the input data:1) Map2) Reduce

  • MapSimilar to the Lisp primitiveApply a single function to multiple inputs

    In the MapReduce model, the map function applies an operation to a list of pairs of the form (input_key, input_value), and produces a set of INTERMEDIATE key/value tuples.

    Map(input_key, input_value) -> (output_key, intermediate_value) list

  • ReduceAccepts the set of intermediate key/value tuples as inputApplies a reduce operation to all values that share the same key

    Reduce(output_key, intermediate_value list) -> output list

  • Quick examplePseudo-code counts the number of occurrences of each word in a large collection of documents

    Map(String fileName, String fileContents)//fileName is input key, fileContents is input valueFor each word w in fileContentsEmitIntermediate(w, 1)Reduce(String word, Iterator Values)//word: input key, values: a list of countsint count = 0for each v in valuescount += 1Emit(AsString(count))

  • The idea sounds good, butWe cant forget about the problems arising from large scale, multiple-machine data processingHow do we parallelize everything?How do we balance the input load?Handle failures?

    Enter the MapReduce model

  • MapReduceThe MapReduce implementation is an abstraction that hides these complexities from the programmerThe User defines the Map and Reduce functionsThe MapReduce implementation automatically distributes the data, then applies the user-defined functions on the dataActual code slightly more complex than previous example

  • MapReduce ArchitectureUser program with Map and Reduce functionsCluster of average PCsUpon execution, cluster is divided into:Master workerMap workersReduce workers

  • Execution OverviewSplit up input data, start up program on all machinesMaster machine assigns M Map and R Reduce tasks to idle worker machinesMap function executed and results buffered locallyPeriodically, data in local memory is written to disk. Locations on disk of data are forwarded to master

    --Map phase complete

    Reduce worker uses RPCs to read intermediate data from Map machines. Data is sorted by key.Reduce worker iterates over data and passes each unique key along with associated values to the Reduce functionMaster wakes up the user program, MapReduce call returns.

  • Execution Overview

  • Master workerStores state information about Map and Reduce workersIdle, in-progress, or completedStores location and sizes on disk of intermediate file regions on Map machinesPushes this information incrementally to workers with in-progress reduce tasksDisplays status of entire operation via HTTPRuns internal HTTP serverDisplays progress I.E. bytes of intermediate data, bytes of output, processing rates, etc

  • ParallelizationMap() runs in parallel, creating different intermediate output from different input keys and valuesReduce() runs in parallel, each working on a different keyAll data is processed independently by different worker machinesReduce phase cannot begin until Map phase is completely finished!

  • Load BalancingUser defines a MapReduce spec objectMapReduceSpecification specSpec.set_machines(2000)Spec.set_map_megabytes(100)Spec.set_reduce_megabytes(100)

    Thats it! The library will automatically take care of the rest.

  • Fault ToleranceMaster pings workers periodically

    Switch(ping response) case (idle): Assign task if possible case (in-progress): do nothing case (completed): reset to idle case (no response): Reassign task

  • Fault ToleranceWhat if a map task completes but the machine fails before the intermediate data is retrieved via RPC?Re-execute the map task on an idle machineWhat if the intermediate data is partially read, but the machine fails before all reduce operations can complete?What if the master fails? PWNED

  • Fault ToleranceSkipping bad recordsOptional parameter to change mode of executionWhen enabled, MapReduce library detects records that cause crashes and skips them

    Bottom line: MapReduce is very robust in its ability to recover from failure and handle errors

  • Part II: Programming ModelMapReduce library is extremely easy to useInvolves setting up only a few parameters, and defining the map() and reduce() functionsDefine map() and reduce()Define and set parameters for MapReduceInput objectDefine and set parameters for MapReduceOutput objectMain program

  • Map() Class WordCounter : public Mapper{ public: virtual void Map(const MapInput &input) {//parse each word and for each word//emit(word, 1) }};REGISTER_MAPPER(WordCounter);

  • Reduce()Class Adder : public Reducer {virtual void Reduce(ReduceInput *input) {//Iterate over all entries with same key//and add the values}};REGISTER_REDUCER(Adder);

  • Main()int main(int argc, char ** argv) {MapReduceSpecification spec;MapReduceInput *input;

    //store list of input files into specfor( int i = 0; I < argc; ++i) {input = spec.add_input(); input->set_format(text);input->set_filepattern(argv[i]);input->set_mapper_class(WordCounter); }

  • Main()//Specify the output files

    MapReductOutput *output = spec.output();

    out->set_filebase (/gfs/test/freq);out->set_num_tasks(100); // freq-00000-of-00100 // freq-00001-of-00100 out->set_format(text);out->set_reducer_class(Adder);

  • Main()//Tuning parameters and actual MapReduce call


    MapReduceResult result;if(!MapReduce(spec, &result)) abort();

    Return 0;} //end main

  • Other possible usesDistributed grepMap emits a line if it matches a supplied patternReduce simply copies intermediate data to outputCount URL access frequencyMap processes logs of web page requests and emits (URL, 1)Reduce adds all values for each URL and emits (URL, count)Inverted IndexMap parses each document and emits a sequence of (word, document ID) pairs.Reduce accepts all pairs for a given word, sorts the list based on Document ID, and emits (word, list(document ID))Many more

  • ConclusionMapReduce provides a easy to use, clean abstraction for large scale data processingVery robust in fault tolerance and error handlingCan be used for multiple scenariosRestricting the programming model to the Map and Reduce paradigms makes it easy to parallelize computations and make them fault-tolerant