Hadoop Mapreduce

download Hadoop Mapreduce

of 27

description

About Apache Hadoop - MapReduce concept

Transcript of Hadoop Mapreduce

Table of Contents 1.Introduction22.Key Ideas Behind MapReduce23.What is MapReduce?54.Hadoop implementation of MapReduce95.Anatomy of a MapReduce Job Run145.1. Job Submission145.2. Job Initialization155.3. Task Assignment155.4. Task Execution165.5. Progress and Status Updates165.6. Job Completion176.Shuffle and Sort in Hadoop177.MapReduce example: Weather Dataset19

1. IntroductionMany scientific applications require processes for handling data that no longer fit on a single cost-effective computer. Besides scientific data experiments such as simulations are creating vast data stores that require new scientific methods to analyze and organize the data. Parallel/distributed processing of data-intensive applications typically involves partitioning or subdividing the data into multiple segments which can be processed independently using the same executable application program in parallel on an appropriate computing platform, then reassembling the results to produce the completed output data.A MapReduce programming is able to focus on the problem that needs to be solved since only the map and reduce functions need to be implemented, and its framework takes care of the burden a programmer has to deal with lower-level mechanisms to control the data flow .2. Key Ideas Behind MapReduceAssume failures are common. A well designed, fault tolerant service must cope with failures up to a point without impacting the quality of service, failures should not result in inconsistencies or indeterminism from the user perspective. As servers go down, other cluster nodes should seamlessly step in to handle the load, and overall performance should gracefully degrade as server failures pile up. Just as important, a broken server that has been repaired should be able to seamlessly rejoin the service without manual reconguration by the administrator. Mature implementations of the MapReduce programming model are able to robustly cope with failures through a number of mechanisms such as automatic task restarts on dierent cluster nodes.Move processing to the data. In traditional high-performance computing (HPC) applications (e.g., for climate or nuclear simulations), it is commonplace for a supercomputer to have processing nodes and storage nodes linked together by a high-capacity interconnect. Many data-intensive workloads are not very processor-demanding, which means that the separation of compute and storage creates a bottleneck in the network.As an alternative to moving data around, it is more ecient to move the processing around. That is, MapReduce assumes an architecture where processors and storage(disk) are co-located. In such a setup, we can take advantage of data locality by running code on the processor directly attached to the block of data we need. The distributed le system is responsible for managing the data over which MapReduce operates.Process data sequentially and avoid random access. Data-intensive processing by denition means that the relevant datasets are too large to t in memory and must be held on disk. Seek times for random disk access are fundamentally limited by the mechanical nature of the devices: read heads can only move so fast and platters can only spin so rapidly. As a result, it is desirable to avoid random data access, and instead organize computations so that data is processed sequentially. A simple scenario 10 poignantly illustrates the large performance gap between sequential operations and random seeks: assume a 1 terabyte database containing 100-byte records. Given reasonable assumptions about disk latency and throughput, a back-of-the-envelope calculation will show that updating 1% of the records (by accessing and then mutating each record) will take about a month on a single machine. On the other hand, if one simply reads the entire database and rewrites all the records (mutating those that need updating), the process would finish in under a work day on a single machine. Sequential data access is, literally, orders of magnitude faster than random data access. The development of solid-state drives is unlikely the change this balance for at least two reasons. First, the cost difference between traditional magnetic disks and solid-state disks remains substantial: large-data will for the most part remain on mechanical drives, at least in the near future. Second, although solid-state disks have substantially faster seek times, order-of-magnitude differences in performance between sequential and random access still remain .MapReduce is primarily designed for batch processing over large datasets. To the extent possible, all computations are organized into long streaming operations that take advantage of the aggregate bandwidth of many disks in a cluster. Many aspects of MapReduces design explicitly trade latency for throughput.Hide system-level details from the application developer. According to many guides on the practice of software engineering written by experienced industry professionals, one of the key reasons why writing code is difficult is because the programmer must simultaneously keep track of many details in short term memory ranging from the mundane (e.g., variable names) to the sophisticated (e.g., a corner case of an algorithm that requires special treatment).This imposes a high cognitive load and requires intense concentration, which leads to a number of recommendations about a programmers environment (e.g., quiet office, comfortable furniture, large monitors, etc.). The challenges in writing distributed software are greatly compounded the programmer must manage details across several threads, processes, or machines. Of course, the biggest headache in distributed programming is that code runs concurrently in unpredictable orders, accessing data in unpredictable patterns. This gives rise to race conditions, deadlocks, and other well-known problems. Programmers are taught to use low-level devices such as mutexes and to apply high-level design patterns such as producer consumer queues to tackle these challenges, but the truth remains: concurrent programs are notoriously difficult to reason about and even harder to debug.MapReduce addresses the challenges of distributed programming by providing an abstraction that isolates the developer from system-level details (e.g., locking of data structures, data starvation issues in the processing pipeline, etc.). The programming model specifies simple and well- defined interfaces between a small number of components, and therefore is easy for the programmer to reason about. MapReduce maintains a separation of what computations are to be performed and how those computations are actually carried out on a cluster of machines. The first is under the control of the programmer, while the second is exclusively the responsibility of the execution framework or runtime. The advantage is that the execution framework only needs to be designed once and verified for correctness thereafter, as long as the developer expresses computations in the programming model, code is guaranteed to behave as expected. The upshot is that the developer is freed from having to worry about system-level details (e.g., no more debugging race conditions and addressing lock contention) and can instead focus on algorithm or application design.Seamless scalability.For data-intensive processing, it goes without saying that scalable algorithms are highly desirable. As an aspiration, let us sketch the behavior of an ideal algorithm. We can define scalability along at least two dimensions. First, in terms of data: given twice the amount of data, the same algorithm should take at most twice as long to run, all else being equal. Second, in terms of resources: given a cluster twice the size, the same algorithm should take no more than half as long to run. Furthermore, an ideal algorithm would maintain these desirable scaling characteristics across a wide range of settings: on data ranging from gigabytes to terabytes, on clusters consisting of a few to a few thousand machines. Finally, the ideal algorithm would exhibit these desired behaviors without requiring any modifications whatsoever, not even tuning of parameters. The truth is that most current algorithms are far from the ideal. In the domain of text processing, for example, most algorithms today assume that data fits in memory on a single machine. For the most part, this is a fair assumption. But what happens when the amount of data doubles in the near future, and then doubles again shortly thereafter? Simply buying more memory is not a viable solution, as the amount of data is growing faster than the price of memory is falling. Furthermore, the price of a machine does not scale linearly with the amount of available memory beyond a certain point (once again, the scaling up vs. scaling out argument). Quite simply, algorithms that require holding intermediate data in memory on a single machine will simply break on sufficiently-large datasets moving from a single machine to a cluster architecture requires fundamentally different algorithms.Perhaps the most exciting aspect of MapReduce is that it represents a small step toward algorithms that behave in the ideal manner discussed above. Recall that the programming model maintains a clear separation between what computations need to occur with how those computations are actually orchestrated on a cluster. As a result, a MapReduce algorithm remains fixed, and it is the responsibility of the execution framework to execute the algorithm. Amazingly, the MapReduce programming model is simple enough that it is actually possible, in many circumstances, to approach the ideal scaling characteristics discussed above. If running an algorithm on a particular dataset takes 100 machine hours, then we should be able to finish in an hour on a cluster of 100 machines, or use a cluster of 10 machines to complete the same task in ten hours. With MapReduce, this isnt so far from the truth, at least for some applications.Data/code co-location. The phrase data distribution is misleading, since one of the key ideas behind MapReduce is to move the code, not the data. However, the more general point remains in order for computation to occur, we need to somehow feed data to the code. In MapReduce, this issue is inexplicably intertwined with scheduling and relies heavily on the design of the underlying distributed file system. To achieve data locality, the scheduler starts tasks on the node that holds a particular block of data (i.e., on its local drive) needed by the task. This has the effect of moving code to the data. If this is not possible (e.g., a node is already running too many tasks), new tasks will be started elsewhere, and the necessary data will be streamed over the network. An important optimization here is to prefer nodes that are on the same rack in the datacenter as the node holding the relevant data block, since inter-rack bandwidth is significantly less than intra-rack bandwidth.Synchronization. In general, synchronization refers to the mechanisms by which multiple concurrently running processes join up, for example, to share intermediate results or otherwise exchange state information. In MapReduce, synchronization is accomplished by a barrier between the map and reduce phases of processing. Intermediate key-value pairs must be grouped by key, which is accomplished by a large distributed sort involving all the nodes that executed map tasks and all the nodes that will execute reduce tasks. This necessarily involves copying intermediate data over the network, and therefore the process is commonly known as shuffle and sort. A MapReduce job with m mappers and r reducers involves up to m r distinct copy operations, since each mapper may have intermediate output going to every reducer. Note that the reduce computation cannot start until all the mappers have finished emitting key-value pairs and all intermediate key-value pairs have been shuffled and sorted, since the execution framework cannot otherwise guarantee that all values associated with the same key have been gathered.Error and fault handling. The MapReduce execution framework must accomplish all the tasks above in an environment where errors and faults are the norm, not the exception. Since MapReduce was explicitly designed around low-end commodity servers, the runtime must be especially resilient. In large clusters, disk failures are common and RAM experiences more errors than one might expect . Datacenters suffer from both planned outages (e.g., system maintenance and hardware upgrades) and unexpected outages (e.g., power failure, connectivity loss, etc.). And thats just hardware. No software is bug free exceptions must be appropriately trapped, logged, and recovered from. Large-data problems have a penchant for uncovering obscure corner cases in code that is otherwise thought to be bug-free. Furthermore, any suciently large dataset will contain corrupted data or records that are mangled beyond a programmers imagination resulting in errors that one would never think to check for or trap. The MapReduce execution framework must thrive in this hostile environment.

3. What is MapReduce?

MapReduce is an emerging programming model for a data-intensive application proposed by Google. MapReduce is utilized by Google and Yahoo to power their web search. MapReduce was first describes in a research paper from Google. More than ten thousand distinct programs have been implemented using MapReduce at Google. MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth interconnects. It works like a Unix pipeline: cat input | grep | sort | unique -c | cat > output Input | Map | Shuffle & Sort | Reduce | OutputOne of the most significant advantages of MapReduce is that it provides an abstraction that hides many system-level details from the programmer. Therefore, a developer can focus on what computations need to be performed, as opposed to how those computations are actually carried out or how to get the data to the processes that depend on them. Like OpenMP and MPI, MapReduce provides a means to distribute computation without burdening the programmer with the details of distributed computing (but at a different level of granularity).MapReduce borrows ideas from functional programming, where programmer defines Map and Reduce tasks to process large set of distributed data. The key strengths of MapReduce programming model are the high degree of parallelism combined with the simplicity of the programming model and its applicability to a large variety of application domains. This requires dividing the workload across a large number of machines. The degree of parallelism depends on the input data size. Map function processes the input pairs (key1, value1) returning some other intermediary pair (key2, value2). Then the intermediary pairs are grouped together according to their key. After, each group will be processed by the reduce function which will output some new pairs of the form (key3, value3). The approach assumes that there are no dependencies between the input data. This make it easy to parallelize the problem. The number of parallel reduce task is limited by the number of distinct "key" values which are emitted by the map function.MapReduce incorporates usually also a framework which supports MapReduce operations. A master controls the whole MapReduce process. The MapReduce framework is responsible for load balancing, re-issuing task if a worker as failed or is to slow, etc. The master divides the input data into separate units, send individual chunks of data to the mapper machines and collects the information once a mapper is finished. If the mapper are finished then the reducer machines will be assigned work. All key/value pairs with the same key will be send to the same reducer.

Fig 1. MapReduce Computational ModelMapReduce can refer to three distinct but related concepts. First, MapReduce is a programming model, which is the sense discussed above. Second, MapReduce can refer to the execution framework (i.e., the runtime) that coordinates the execution of programs written in this particular style. Finally, MapReduce can refer to the software implementation of the programming model and the execution framework: for example, Googles proprietary implementation vs. the open-source Hadoop implementation in Java.Part of the design of MapReduce algorithms involves imposing the key-value structure on arbitrary datasets. For a collection of web pages, keys may be URLs and values may be the actual HTML content. For a graph, keys may represent node ids and values may contain the adjacency lists of those nodes.In MapReduce, the programmer defines a mapper and a reducer with the following signatures:map: (k 1 , v 1 ) [(k 2 , v 2 )]reduce: (k 2 , [v 2 ]) [(k 3 , v 3 )]When we start a map/reduce workflow, the framework will split the input into segments, passing each segment to a different machine. Each machine then runs the mapper on the portion of data attributed to it.The mapper is applied to every input key-value pair (split across an arbitrary number of les) to generate an arbitrary number of intermediate key-value pairs. The reducer is applied to all values associated with the same intermediate key to generate output key-value pairs. Implicit between the map and reduce phases is a distributed group by operation on intermediate keys. Intermediate data arrive at each reducer in order, sorted by the key. However, no ordering relationship is guaranteed for keys across different reducers. Output key-value pairs from each reducer are written persistently back onto the distributed file system (whereas intermediate key-value pairs are transient and not preserved). The output ends up in r les on the distributed le system, where r is the number of reducers.

The diagram below illustrates the overall MapReduce word count process.

Fig 2. The overall MapReduce word count process

A simple word count algorithm in MapReduce is shown in Figure 2. This algorithm counts the number of occurrences of every word in a text collection. The mapper takes an input key-value pair, tokenizes the document, and emits an intermediate key-value pair for every word: the word itself serves as the key, and the integer one serves as the value (denoting that weve seen the word once).

Fig 3. Pseudo-code for the word count algorithm in MapReduce

The MapReduce execution framework guarantees that all values associated with the same key are brought together in the reducer. Therefore, in our word count algorithm, we simply need to sum up all counts (ones) associated with each word. The reducer does exactly this, and emits final key-value pairs with the word as the key, and the count as the value. Final output is written to the distributed file system, one file per reducer. Words within each file will be sorted by alphabetical order, and each file will contain roughly the same number of words.

Fig 4. Execution overview

Figure 4 shows the overall flow of a MapReduce operation in our implementation. When the user program calls the MapReduce function, the following sequence of actions occurs (the numbered labels in Figure 4 correspond to the numbers in the list below):

1. The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.2. One of the copies of the program is special , the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.3. A worker who is assigned a map task reads the contents of the corresponding input split. It parseskey/value pairs out of the input data and passes each pair to the user-dened Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.5. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the users Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.7. When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.

After successful completion, the output of the mapreduce execution is available in the R output files (one per reduce task, with file names as specified by the user). Typically, users do not need to combine these R output files into one file they often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.

4. Hadoop implementation of MapReduce

Hadoop is an open source based on MapReduce framework for running applications on large clusters built of commodity hardware from Apache. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides the Hadoop distributed file system (HDFS) that stores data on the compute nodes, providing a very high aggregate bandwidth across the cluster. HDFS is the primary storage system used by Hadoop applications.The MapReduce Framework takes care of scheduling tasks, monitoring them, and re-executing failed tasks.Hadoop commonly refers to the main component of the platform, the one from where the others offer high level services. This's the storage framework with the processing framework, formed by the Hadoop Distributed Filesystem library, the MapReduce library, and a core library, all working together. This represents the first project, that would lead the path for the others to work. Those are: HBase (a columnar database), Hive (a data mining tool), Pig (scripting), Chuckwa (log analysis), they are all subjected to the availability of the platform. Then we have ZooKeeper (coordination service) independent of hadoop availability and used from HBase, and Avro (serialization/deserialization) designed to support the main service component requirements.

Figure 5. A global view on the framework's subprojects dependencies

Avro: A serialization system for efficient, cross-language RPC, and persistent data storage. Pig: A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters. Hive: A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data. Cassandra: A scalable multi-master database with no single points of failure. Chukwa: A data collection system for managing large distributed systems. Hbase: A distributed, column-oriented database. HBase uses HDFS for its underlying storage and supports both batch-style computations using MapReduce and point queries (random reads). ZooKeeper: A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications. Sqoop: A tool for efficiently moving data between relational databases and HDFS. Mahout: A Scalable machine learning and data mining library.

The Google File System (GFS) supports Googles proprietary implementation of MapReduce; in the open-source world, HDFS (Hadoop Distributed File System) is an open-source implementation of GFS that supports Hadoop. Although MapReduce doesnt necessarily require the distributed le system, it is dicult to realize many of the advantages of the programming model without a storage substrate that behaves much like the DFS. Hadoop Distributed File System is designed to reliably store very large files across machines in a large cluster. Hadoop DFS stores each file as a sequence of blocks, all blocks in a file except the last block are the same size. Blocks belonging to a file are replicated for fault tolerance. The block size and replication factor are configurable per file. Files in HDFS are "write once" and have strictly one writer at any time.The distributed le system adopts a masterslave architecture in which the master maintains the le namespace (metadata, directory structure, file to block mapping, location of blocks, and access permissions) and the slaves manage the actual data blocks. In GFS, the master is called the GFS master, and the slaves are called GFS chunkservers. In Hadoop, the same roles are lled by the namenode and datanodes, respectively. In HDFS, an application client wishing to read a le (or a portion thereof) must rst contact the namenode to determine where the actual data is stored. In response to the client request, the namenode returns the relevant block id and the location where the block is held (i.e., which datanode). The client then contacts the datanode to retrieve the data. Blocks are themselves stored on standard single-machine le systems, so HDFS lies on top of the standard OS stack (e.g., Linux). An important feature of the design is that data is never moved through the namenode. Instead, all data transfer occurs directly between clients and datanodes; communications with the namenode only involves transfer of metadata. By default, HDFS stores three separate copies of each data block to ensure both reliability, availability, and performance. To create a new le and write data to HDFS, the application client rst contacts the namenode, which updates the le namespace after checking permissions and making sure the le doesnt already exist. The namenode allocates a new block on a suitable datanode, and the application is directed to stream data directly to it. From the initial datanode, data is further propagated to additional replicas.The architecture of a complete Hadoop cluster is shown in Figure 3.

Figure 6. Architecture of a complete Hadoop clusterThe NameNode will coordinate almost all read/write and access operations between clients and the DataNodes from the cluster, the DataNodes will store, read and write the information, while the BackupNode is in charge of accelerating some heavy operations like boot up, ensuring failover data recovery, among others. In MapReduce, the JobTracker will coordinate all about deploying application tasks over the DataNodes, as well as summarizing their results, and the TaskTracker processes running on them will receive these tasks and execute them.There are some differences between the Hadoop implementation of MapReduce and Googles implementation. In Hadoop, the reducer is presented with a key and an iterator over all values associated with the particular key. The values are arbitrarily ordered. Googles implementation allows the programmer to specify a secondary sort key for ordering the values (if desired) in which case values associated with each key would be presented to the developers reduce code in sorted order. Another difference: in Googles implementation the programmer is not allowed to change the key in the reducer. That is, the reducer output key must be exactly the same as the reducer input key. In Hadoop, there is no such restriction, and the reducer can emit an arbitrary number of output key-value pairs (with dierent keys).In Hadoop, a mapper object is initialized for each map task (associated with a particular sequence of key-value pairs called an input split) and the Map method is called on each key-value pair by the execution framework. In configuring a MapReduce job, the programmer provides a hint on the number of map tasks to run, but the execution framework makes the final determination based on the physical layout of the data .The situation is similar for the reduce phase :a reducer object is initialized for each reduce task, and the Reduce method is called once per intermediate key. In contrast with the number of map tasks, the programmer can precisely specify the number of reduce tasks. The reducer in MapReduce receives all values associated with the same key at once. However, it is possible to start copying intermediate key-value pairs over the network to the nodes running the reducers as soon as each mapper nishes this is a common optimization and implemented in Hadoop.

A Hadoop MapReduce job is divided up into a number of map tasks and reduce tasks. Tasktrackers periodically send heartbeat messages to the jobtracker that also doubles as a vehicle for task allocation. If a tasktracker is available to run tasks (in Hadoop parlance, has empty task slots), the return acknowledgment of the tasktracker heartbeat contains task allocation information. The number of reduce tasks is equal to the number of reducers specied by the programmer. The number of map tasks, on the other hand, depends on many factors: the number of mappers specied by the programmer serves as a hint to the execution framework, but the actual number of tasks depends on both the number of input les and the number of HDFS data Sblocks occupied by those les. Each map task is assigned a sequence of input key-value pairs, called an input split in Hadoop. Input splits are computed automatically and the execution framework strives to align them to HDFS block boundaries so that each map task is associated with a single data block. In scheduling map tasks, the jobtracker tries to take advantage of data locality if possible, map tasks are scheduled on the slave node that holds the input split, so that the mapper will be processing local data.In Hadoop, mappers are Java objects with a Map method (among others). A mapper object is instantiated for every map task by the tasktracker. The life-cycle of this object begins with instantiation, where a hook is provided in the API to run programmer-specied code. This means that mappers can read in side data, providing an opportunity to load state, static data sources, dictionaries, etc. After initialization, the Map method is called (by the execution framework) on all key-value pairs in the input split. Since these method calls occur in the context of the same Java object, it is possible to preserve state across multiple input key-value pairs within the same map task. After all key-value pairs in the input split have been processed, the mapper object provides an opportunity to run programmer specied termination code. The actual execution of reducers is similar to that of the mappers. Each reducer object is instantiated for every reduce task. The Hadoop API provides hooks for programmer-specied initialization and termination code. After initialization, for each intermediate key in the partition (dened by the partitioner), the execution framework repeatedly calls the Reduce method with an intermediate key and an iterator over all values associated with that key. The programming model also guarantees that intermediate keys will be presented to the Reduce method in sorted order. The process is transactional, those map or reduce tasks not executed, (for data availability issues) will be reattempted a number of times, and then redistributed to other nodes.

Figure 7. Hadoop MapReduce lifecycle

5. Anatomy of a MapReduce Job RunThis section uncovers the steps Hadoop takes to run a job. The whole process is illustrated in Figure At the highest level, there are four independent entities: The client, which submits the MapReduce job. The jobtracker, which coordinates the job run. The jobtracker is a Java application whose main class is JobTracker. The tasktrackers, which run the tasks that the job has been split into. Tasktrackers are Java applications whose main class is TaskTracker. The distributed filesystem, which is used for sharing job files between the other entities.

Figure 8. Anatomy of a MapReduce Job Run

5.1. Job Submission

The runJob() method on JobClient is a convenience method that creates a new JobClient instance and calls submitJob() on it (step 1 in Figure ). Having submitted the job, runJob() polls the jobs progress once a second, and reports the progress to the console if it has changed since the last report. When the job is complete, if it was successful, the job counters are displayed. Otherwise, the error that caused the job to fail is logged to the console.The job submission process implemented by JobClients submitJob() method does the following: Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step2). Checks the output specification of the job. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program. Computes the input splits for the job. If the splits cannot be computed, because the input paths dont exist, for example, then the job is not submitted and an error is thrown to the MapReduce program. Copies the resources needed to run the job, including the job JAR file, the configuration file and the computed input splits, to the jobtrackers filesystem in a directory named after the job ID. The job JAR is copied with a high replication factor (controlled by the mapred.submit.replication property, which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job (step 3). Tells the jobtracker that the job is ready for execution (by calling submitJob() on JobTracker) (step 4).

5.2. Job Initialization

When the JobTracker receives a call to its submitJob() method, it puts it into an internal queue from where the job scheduler will pick it up and initialize it. Initialization involves creating an object to represent the job being run, which encapsulates its tasks, and bookkeeping information to keep track of the tasks status and progress (step 5). To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the JobClient from the shared filesystem (step 6). It then creates one map task for each split. The number of reduce tasks to create is determined by the mapred.reduce.tasks property in the JobConf, which is set by the setNumReduceTasks() method, and the scheduler simply creates this number of reduce tasks to be run. Tasks are given IDs at this point.5.3. Task Assignment

Tasktrackers run a simple loop that periodically sends heartbeat method calls to the jobtracker. Heartbeats tell the jobtracker that a tasktracker is alive, but they also double as a channel for messages. As a part of the heartbeat, a tasktracker will indicate whether it is ready to run a new task, and if it is, the jobtracker will allocate it a task, which it communicates to the tasktracker using the heartbeat return value (step 7). Before it can choose a task for the tasktracker, the jobtracker must choose a job to select the task from.Having chosen a job, the jobtracker now chooses a task for the job. Tasktrackers have a fixed number of slots for map tasks and for reduce tasks: for example, a tasktracker may be able to run two map tasks and two reduce tasks simultaneously. (The precise number depends on the number of cores and the amount of memory on the tasktracker) .The default scheduler fills empty map task slots before reduce task slots, so if the tasktracker has at least one empty map task slot, the jobtracker will select a map task; otherwise, it will select a reduce task.To choose a reduce task the jobtracker simply takes the next in its list of yet-to-be-run reduce tasks, since there are no data locality considerations. For a map task, however, it takes account of the tasktrackers network location and picks a task whose input split is as close as possible to the tasktracker. In the optimal case, the task is data-local, that is, running on the same node that the split resides on. Alternatively, the task may be rack-local: on the same rack, but not the same node, as the split. Some tasks are neither data-local nor rack-local and retrieve their data from a different rack from the one they are running on.5.4. Task Execution

Now the tasktracker has been assigned a task, the next step is for it to run the task. First, it localizes the job JAR by copying it from the shared filesystem to the tasktrackers filesystem. It also copies any files needed from the distributed cache by the application to the local disk; Second, it creates alocal working directory for the task, and un-jars the contents of the JAR into this directory. Third, it creates an instance of TaskRunner to run the task.TaskRunner launches a new Java Virtual Machine (step 9) to run each task in (step 10),so that any bugs in the user-defined map and reduce functions dont affect the tasktracker (by causing it to crash or hang, for example). The child process communicates with its parent through the umbilical interface. This way it informs the parent of the tasks progress every few seconds until the task is complete.5.5. Progress and Status Updates

MapReduce jobs are long-running batch jobs, taking anything from minutes to hours to run. Because this is a significant length of time, its important for the user to get feedback on how the job is progressing. A job and each of its tasks have a status, which includes such things as the state of the job or task (e.g., running, successfully completed, failed), the progress of maps and reduces, the values of the jobs counters, and a status message or description (which may be set by user code). When a task is running, it keeps track of its progress, that is, the proportion of the task completed. For map tasks, this is the proportion of the input that has been processed. For reduce tasks, its a little more complex, but the system can still estimate the proportion of the reduce input processed. It does this by dividing the total progress into three parts, corresponding to the three phases of the shuffle. For example, if the task has run the reducer on half its input, then the tasks progress is , since it has completed the copy and sort phases ( each) and is half way through the reduce phase ().If a task reports progress, it sets a flag to indicate that the status change should be sent to the tasktracker. The flag is checked in a separate thread every three seconds, and if set it notifies the tasktracker of the current task status. Meanwhile, the tasktracker is sending heartbeats to the jobtracker every five seconds (this is a minimum, as the heartbeat interval is actually dependent on the size of the cluster: for larger clusters, the interval is longer), and the status of all the tasks being run by the tasktracker is sent in the call.The jobtracker combines these updates to produce a global view of the status of all the jobs being run and their constituent tasks. Finally, as mentioned earlier, the JobClient receives the latest status by polling the jobtracker every second. Clients can also use JobClients getJob() method to obtain a RunningJob instance, which contains all of the status information for the job.

5.6. Job Completion

When the jobtracker receives a notification that the last task for a job is complete, it changes the status for the job to successful. Then, when the JobClient polls for status, it learns that the job has completed successfully, so it prints a message to tell the user, and then returns from the runJob() method.The jobtracker also sends a HTTP job notification if it is configured to do so. This can be configured by clients wishing to receive callbacks, via the job.end.notification.url property. Last, the jobtracker cleans up its working state for the job, and instructs tasktrackers to do the same (so intermediate output is deleted, for example).6. Shuffle and Sort in HadoopMapReduce makes the guarantee that the input to every reducer is sorted by key. The process by which the system performs the sort and transfers the map outputs to the reducers as inputs is known as the shuffle. The shuffle is an area of the codebase where refinements and improvements are continually being made. In many ways, the shuffle is the heart of MapReduce, and is where the magic happens. The following figure illustrates the shuffle and sort phase:

Figure 9. Shuffle and sort in MapReduce Map side Map outputs are buffered in memory in a circular buffer When buffer reaches threshold, contents are spilled to disk Spills merged in a single, partitioned file (sorted within each partition): combiner runs here Reduce side First, map outputs are copied over to reducer machine Sort is a multi-pass merge of map outputs (happens in memory and on disk): combiner runs here Final merge pass goes directly into reducer

7. MapReduce example: Weather Dataset

Create a program that mines weather data Weather sensors collecting data every hour at many locations across the globe, gather a large volume of log data. Source: NCDC The data is stored using a line-oriented ASCII format, in which each line is a record Mission - calculate max temperature each year around the world Problem - millions of temperature measurements records

Figure 10. NCDC raw dataFor our example, we will write a program that mines weather data. Weather sensors collecting data every hour at many locations across the globe gather a large volume of log data, which is a good candidate for analysis with MapReduce, since it is semistructured and record-oriented.The data we will use is from the National Climatic Data Center (NCDC, http://www.ncdc.noaa.gov/). The data is stored using a line-oriented ASCII format, in which each line is a record. The format supports a rich set of meteorological elements, many of which are optional or with variable data lengths. For simplicity, we shall focus on the basic elements, such as temperature, which are always present and are of fixed width. Example 7-1 shows a sample line with some of the salient fields highlighted. The line has been split into multiple lines to show each field: in the real file, fields are packed into one line with no delimiters.

Figure 11. NCDC raw dataData files are organized by date and weather station.There is a directory for each year from 1901 to 2001, each containing a gzipped file for each weather station with its readings for that year. The whole dataset is made up of a large number of relatively small files since there are tens of thousands of weather station.The data was preprocessed so that each years readings were concatenated into a single file.MapReduce works by breaking the processing into 2 phases: the map and the reduce. Both map and reduce phases have key-value pairs as input and output. Programmers have to specify two functions: map and reduce function. The input to the map phase is the raw NCDC data. Here, the key is the offset of the beginning of the line and the value is each line of the data set. The map function pulls out the year and the air temperature from each input value. The reduce function takes pairs as input and produces the maximum temperature for each year as the result. To visualize the way the map works, consider the following sample lines of input data. Original NCDC Format

Input file for the map function, stored in HDFS

Output of the map function, running in parallel for each block

The output from the map function is processed by the MapReduce framework before being sent to the reduce function. This processing sorts and groups the key-value pairs by key. So, continuing the example, our reduce function sees the following input: Each year appears with a list of all its air temperature readings. All the reduce function has to do now is iterate through the list and pick up the maximum reading. This is the final output: the maximum global temperature recorded in each year.

The whole data flow

Start the local hadoop clusterOpen five CYGWIN windows and arrange them in the similar fashion as below.1. Start the namenode in the first window by executing:cd hadoop-0.19.1bin/hadoop namenode2. Start the secondary namenode in the second window by executing:cd hadoop-0.19.1bin/hadoop secondarynamenode3. Start the job tracker the third window by executing:cd hadoop-0.19.1bin/haoop jobtracker4. Start the data node the fourth window by executing:cd hadoop-0.19.1bin/haoop datanode5. Start the task tracker the fifth window by executing:cd hadoop-0.19.1bin/haoop tasktracker

Figure 12. Start the local hadoop cluster

Having run through how the MapReduce program works, the next step is to express it in code. We need three things: a map function, a reduce function, and some code to run the job. The map function is represented by the Mapper class, which declares an abstract map() method. Figure 13- shows the implementation of our map method. Map function

Fig 13. Mapper for maximum temperature exampleThe Mapper class is a generic type, with four formal type parameters that specify the input key, input value, output key, and output value types of the map function. For the present example, the input key is a long integer offset, the input value is a line of text, the output key is a year, and the output value is an air temperature (an integer). Rather than use built-in Java types, Hadoop provides its own set of basic types that are optimized for network serialization. These are found in the org.apache.hadoop.io package. Here we use LongWritable, which corresponds to a Java Long, Text (like Java String), and IntWritable (like Java Integer).The map() method is passed a key and a value. We convert the Text value containing the line of input into a Java String, then use its substring() method to extract the columns we are interested in. The map() method also provides an instance of Context to write the output to. In this case, we write the year as a Text object (since we are just using it as a key), and the temperature is wrapped in an IntWritable. We write an output record only if the temperature is present and the quality code indicates the temperature reading is OK.The reduce function is similarly defined using a Reducer, as illustrated in Figure 14. Reduce function

Fig 14. Reducer for maximum temperature exampleAgain, four formal type parameters are used to specify the input and output types, this time for the reduce function. The input types of the reduce function must match the output types of the map function: Text and IntWritable. And in this case, the output types of the reduce function are Text and IntWritable, for a year and its maximum temperature, which we find by iterating through the temperatures and comparing each with a record of the highest found so far.The third piece of code runs the MapReduce job (see Figure 15). Main function for running the MapReduce jobA Job object forms the specification of the job. It gives you control over how the job is run. When we run this job on a Hadoop cluster, we will package the code into a JAR file (which Hadoop will distribute around the cluster). Rather than explicitly specify the name of the JAR file, we can pass a class in the Jobs setJarByClass() method, which Hadoop will use to locate the relevant JAR file by looking for the JAR file containing this class.Having constructed a Job object, we specify the input and output paths. An input path is specified by calling the static addInputPath() method on FileInputFormat, and it can be a single file, a directory (in which case, the input forms all the files in that directory), or a file pattern. As the name suggests, addInputPath() can be called more than once to use input from multiple paths. The output path (of which there is only one) is specified by the static setOutputPath() method on FileOutputFormat. It specifies a directory where the output files from the reducer functions are written. The directory shouldnt exist before running the job,as Hadoop will complain and not run the job. This precaution is to prevent data loss.Next, we specify the map and reduce types to use via the setMapperClass() and setReducerClass() methods. The setOutputKeyClass() and setOutputValueClass() methods control the output types for the map and the reduce functions, which are often the same, as they are in our case. If they are different, then the map output types can be set using the methods setMapOutputKeyClass() and setMapOutputValueClass().The input types are controlled via the input format, which we have not explicitly set since we are using the default TextInputFormat.After setting the classes that define the map and reduce functions, we are ready to run the job. The waitForCompletion() method on Job submits the job and waits for it to finish. The methods boolean argument is a verbose flag, so in this case the job writes information about its progress to the console.

Fig 15. Application to find the maximum temperature in the weather dataset A test run The output from running the job provides some useful information. For example, we can see that the job was given an ID of job_local_0009, and it ran one map task and one reduce task. Knowing the job and task IDs can be very useful when debugging MapReduce jobs.

Output in HDFS

Map Reduce Chart

1

3