Hadoop Fundamentals
Dr. Kirti WankhedeIT Dept., SIMSR
Data!
• Facebook hosts approximately 10 billion photos, taking up one petabyte of storage
• The New York Stock Exchange generates about one terabyte of new trade data per day
• In last one week tour, I personally took 15 GB photos while I was travelling. So imagine the memory requirements for all photos taken in a day all over the world!
Data Explosion
relational
textaudiovideo
images
Big Data (globally)
– creates over 30 billion pieces of content
per day
– stores 30 petabytes of data
– produces over 90 million tweets per day
30 billion
30 petabytes
90 million
Big Data
– logs over 300 gigabytes of transactions per
day
– stores more than 1,5 terabyte of
aggregated data
300 gigabytes
1,5 terabyte
Big Data
• Information Data Corporation (IDC) estimates data created in 2010 to be large amounts of data, here are some 2011 stats:
1.2 ZETTABYTES(1.2 Trillion Gigabytes)
• Companies continue to generate largeamounts of data, here are some Jan 2014 stats:
– Facebook ~ 6 billion messages every 20 minutes– EBay ~ 2 billion page views a day, ~ 80 Petabytes ofstorage– Satellite Images by Skybox Imaging ~ 1 Terabyte per day
Big Data Challenges
Sort 10TB on 1 node =
100-node cluster =
2,5 days
35 mins
Big Data Challenges
“Fat” servers implies high cost
– use cheap commodity nodes instead
Large # of cheap nodes implies often
failures
– leverage automatic fault-tolerance
commodity
fault-tolerance
Big Data Challenges
We need new data-parallel programming
model for clusters of commodity machines
data-parallel
Hadoop
• Existing tools were not designed to handle such large amounts of data
• "The Apache™ Hadoop™ project developsopen-source software for reliable, scalable,distributed computing." -http://hadoop.apache.org (4 September, 2007)– Process Big Data on clusters of commodity hardware– Vibrant open-source community– Many products and tools reside on top of Hadoop
Use Commodity Hardware
• “Cheap” Commodity Server Hardware– No need for super-computers, use commodity unreliable hardware– Not desktops
Data Storage
• Storage capacity has grown exponentially but read speed has not kept up
– 1990:• Store 1,400 MB• Transfer speed of 4.5MB/s• Read the entire drive in ~ 5 minutes
– 2010:• Store 1 TB• Transfer speed of 100MB/s• Read the entire drive in ~ 3 hours
• Hadoop - 100 drives working at the same time can read 1TB of data in 2 minutes
Why Use Hadoop?
• Cheaper– Scales to
Petabytes or more
• Faster– Parallel data
processing
• Better– Suited for
particular types of BigData problems
• A set of "cheap" commodity hardware• Networked together• Resides in the same location
– Set of servers in a set of racks in a data center
Hadoop Cluster
Hadoop System Principles
• Scale-Out rather than Scale-Up• Bring code to data rather than data to code• Deal with failures – they are common• Abstract complexity of distributed andconcurrent applications
Scale-Out Instead of Scale-Up
• It is harder and more expensive to scale-up– Add additional resources to an existing node (CPU, RAM)– New units must be purchased if required resources can not beadded– Also known as scale vertically• Scale-Out– Add more nodes/machines to an existing distributedapplication– Software Layer is designed for node additions or removal– Hadoop takes this approach - A set of nodes are bondedtogether as a single distributed system– Very easy to scale down as well
Code to Data
• Traditional data processing architecture– nodes are broken up into separate processing and storage nodes connected by high-capacity link– Many data-intensive applications are not CPU demanding causing bottlenecks in network• Hadoop co-locates processors and storage– Code is moved to data (size is tiny, usually in KBs)– Processors execute code and access underlying local storage
Nodes
Processor Storage
Failures are Common
• Given a large number machines, failures arecommon
– Large warehouses may see machine failures weekly or even daily
• Hadoop is designed to cope with nodefailures
– Data is replicated– Tasks are retried
Abstract Complexity
• Hadoop abstracts many complexities in distributed and concurrent applications
– Defines small number of components– Provides simple and well defined interfaces of interactionsbetween these components
• Frees developer from worrying about system level challenges– race conditions, data starvation– processing pipelines, data partitioning, code distribution – etc.
• Allows developers to focus on application development and business logic
History of Hadoop
• Started as a sub-project of Apache Nutch– Nutch’s job is to index the web and expose it for searching– Open Source alternative to Google– Started by Doug Cutting
• In 2004 Google publishes Google File System (GFS) and MapReduce framework papers• Doug Cutting and Nutch team implemented Google’s frameworks in Nutch• In 2006 Yahoo! hires Doug Cutting to work on Hadoop with a dedicated team• In 2008 Hadoop became Apache Top Level Project
– http://hadoop.apache.org
Naming Conventions?
• Doug Cutting drew inspiration from hisfamily
– Lucene: Doug’s wife’s middle name– Nutch: A word for "meal" that his son used as a toddler– Hadoop: Yellow stuffed elephant named by his son
Comparisons to RDBMS• Until recently many applications utilized Relational Database Management Systems (RDBMS) for batch processing
– Oracle, Sybase, MySQL, Microsoft SQL Server, etc.– Hadoop doesn’t fully replace relational products; manyarchitectures would benefit from both Hadoop and aRelational product(s)
• Scale-Out vs. Scale-Up– RDBMS products scale up
• Expensive to scale for larger installations• Hits a ceiling when storage reaches 100s of terabytes
– Hadoop clusters can scale-out to 100s of machines and topetabytes of storage
Comparisons to RDBMS (Continued)
• Structured Relational vs. Semi-Structured vs. Unstructured
– RDBMS works well for structured data - tables thatconform to a predefined schema– Hadoop works best on Semi-structured and Unstructureddata
• Semi-structured may have a schema that is loosely followed• Unstructured data has no structure whatsoever and is usually just blocks of text (or for example images)• At processing time types for key and values are chosen by the implementer
– Certain types of input data will not easily fit intoRelational Schema such as images, JSON, XML, etc...
Comparison to RDBMS
• Offline batch vs. online transactions– Hadoop was not designed for real-time or low latencyqueries– Products that do provide low latency queries such asHBase have limited query functionality– Hadoop performs best for offline batch processing onlarge amounts of data– RDBMS is best for online transactions and low-latencyqueries– Hadoop is designed to stream large files and largeamounts of data– RDBMS works best with small records
Comparison to RDBMS• Hadoop and RDBMS frequently complement each other within an architecture• For example, a website that
– has a small number of users– produces a large amount of audit logs
1. Utilize RDBMS to provide rich User Interface and enforce data integrity
2. RDBMS generates large amounts of audit logs; the logs are moved periodically to the Hadoop cluster
3. All logs are kept in Hadoop; Various analytics are executed periodically
4. Results copied to RDBMS to be used by Web Server; for example "suggestions" based on audit history
Web server RDBMS Hadoop1
23
4
Comparing: RDBMS vs. Hadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query Response Time
Can be near immediate Has latency (due to batch processing)
Common Hadoop Distributions
Open SourceApache
CommercialClouderaHortonworksMapRAWS MapReduceMicrosoft HDInsight
(Beta)
What types of business problems for Hadoop?
• Social Media Data: Win customers’ hearts With Hadoop, you can mine Twitter, Facebook and other social media conversations for sentiment data about you and your competition, and use it to make targeted, real-time, decisions that increase market share.• Web Clickstream Data: Show them the wayHadoop makes it easier to analyze, visualize and ultimately change how visitors behave on your website.• Server Log Data: Fortify security and compliance.Security breaches happen. And when they do, your server logs may be your best line of defense. Hadoop takes server-log analysis to the next level by speeding and improving security forensics and providing a low cost platform to show compliance.
What types of business problems for Hadoop?
• Machine and Sensor Data: Gain insight from your equipment.
Your machines know things. From out in the field to the assembly line floor—machines stream low-cost, always-on data. Hadoop makes it easier for you to store and refine that data and identify meaningful patterns, providing you with the insight to make proactive business decisions.• Geolocation Data: Profit from predictive analytics.Geolocation data is plentiful, and that’s part of the challenge. The costs to store and process voluminous amounts of data often outweigh the benefits. Hadoop helps reduce data storage costs while providing value driven intelligence from asset tracking to predicting behavior to enable optimization.
What types of business problems for Hadoop?
Risk Modeling
Customer Churn Analysis
Recommendation Engine
Ad Targeting
Point of Sale Transactional
Analysis
Threat Analysis
Trade Surveillance
Search Quality
Data Sandbox
Source: http://www.cloudera.com/content/dam/cloudera/Resources/PDF/cloudera_White_Paper_Ten_Hadoopable_Problems_Real_World_Use_Cases.pdf
Companies Using Hadoop
• Facebook• Yahoo• Amazon• eBay• American Airlines• The New York
Times• Federal Reserve
Board• IBM• Orbitz
• A9.com – Amazon: To build Amazon's product search indices; process millions of sessions daily for analytics, using both the Java and streaming APIs; clusters vary from 1 to 100 nodes.
• Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop; biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support research for Ad Systems and Web Search
• AOL : Used for a variety of things ranging from statistics generation to running advanced algorithms for doing behavioral analysis and targeting; cluster size is 50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk giving us a total of 37 TB HDFS capacity.
• Facebook: To store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning; 320 machine cluster with 2,560 cores and about 1.3 PB raw storage;
• FOX Interactive Media : 3 X 20 machine cluster (8 cores/machine, 2TB/machine storage) ; 10 machine cluster (8 cores/machine, 1TB/machine storage); Used for log analysis, data mining and machine learning
• University of Nebraska Lincoln: one medium-sized Hadoop cluster (200TB) to store and serve physics data;
Example Applications and Organizations using Hadoop
More Hadoop Applications• Adknowledge - to build the recommender system for behavioral targeting, plus
other clickstream analytics; clusters vary from 50 to 200 nodes, mostly on EC2. • Contextweb - to store ad serving log and use it as a source for Ad optimizations/
Analytics/reporting/machine learning; 23 machine cluster with 184 cores and about 35TB raw storage. Each (commodity) node has 8 cores, 8GB RAM and 1.7 TB of storage.
• Cornell University Web Lab: Generating web graphs on 100 nodes (dual 2.4GHz Xeon Processor, 2 GB RAM, 72GB Hard Drive)
• NetSeer - Up to 1000 instances on Amazon EC2 ; Data storage in Amazon S3; Used for crawling, processing, serving and log analysis
• The New York Times : Large scale image conversions ; EC2 to run hadoop on a large virtual cluster
• Powerset / Microsoft - Natural Language Search; up to 400 instances on Amazon EC2 ; data storage in Amazon S3
Hadoop Ecosystem
Hadoop Eco System
• At first Hadoop was mainly known for two coreproducts:
– HDFS: Hadoop Distributed FileSystem– MapReduce: Distributed data processing framework
• Today, in addition to HDFS and MapReduce, theterm also represents a multitude of products:
– HBase: Hadoop column database; supports batch and randomreads and limited queries– Zookeeper: Highly-Available Coordination Service– Oozie: Hadoop workflow scheduler and manager– Pig: Data processing language and execution environment– Hive: Data warehouse with SQL interface
Supported Operating Systems
• Each Distribution will support its own list ofOperating Systems (OS)• Common OS supported
– Red Hat Enterprise– CentOS– Oracle Linux– Ubuntu– SUSE Linux Enterprise Server
Forecast growth of Hadoop Job Market
Source: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html
Hadoop is a set of Apache Frameworks and more…
Data storage (HDFS) Runs on commodity hardware (usually Linux) Horizontally scalable
Processing (MapReduce) Parallelized (scalable) processing Fault Tolerant
Other Tools / Frameworks Data Access
HBase, Hive, Pig, Mahout Tools
Hue, Sqoop Monitoring
Greenplum, ClouderaHadoop Core - HDFS
MapReduce API
Data Access
Tools & Libraries
Monitoring & Alerting
What are the core parts of a Hadoop distribution?
HDFS Storage
Redundant (3 copies)For large files – large blocks 64 or 128 MB / blockCan scale to 1000s of nodes
MapReduce API
Batch (Job) processingDistributed and Localized to clusters (Map)Auto-Parallelizable for huge amounts of dataFault-tolerant (auto retries)Adds high availability and more
Other Libraries
PigHiveHBase Others
Typical Hadoop Cluster
Typical Hadoop Cluster
• 40 nodes/rack • 1000-4000 nodes in cluster - comprises of machines based on
the various machine roles (master, slaves, and clients).• 1 Gbps bandwidth within rack, 8 Gbps out of rack• Node specs (Yahoo terasort):8 x 2GHz cores, 8 GB RAM, 4 disks
(= 4 TB)
HDFS –Hadoop Distributed File System
Very Large Distributed File System–10K nodes, 100 million files, 10 PB
Assumes Commodity Hardware–Files are replicated to handle hardware failure–Detect failures and recover from them
Optimized for Batch Processing–Data locations exposed so that computations can move
to where data resides–Provides very high aggregate bandwidth
User Space, runs on heterogeneous OS
Distributed File System
Data CoherencyWrite-once-read-many access modelClient can only append to existing files
Files are broken up into blocksTypically 128 MB block sizeEach block replicated on multiple DataNodes
Intelligent ClientClient can find location of blocksClient accesses data directly from DataNode
Hadoop Core
MapReduce (Job Scheduling / Execution System)
Hadoop Distributed File System (HDFS)
Hadoop Core (HDFS)
MapReduce (Job Scheduling / Execution System)
Hadoop Distributed File System (HDFS)
• Name Node stores file metadata
• files split into 64 MB blocks
• blocks replicated across 3 Data Nodes
Name Node
blocks
3 Data Nodes
Hadoop Core (HDFS)
MapReduce (Job Scheduling / Execution System)
Hadoop Distributed File System (HDFS)
Name Node Data Node
Hadoop Core (HDFS)
• Datanode - where HDFS actually stores the data, there are usually quite a few of these.
• Namenode - the ‘master’ machine. It controls all the meta data for the cluster. Eg - what blocks make up a file, and what datanodes those blocks are stored on.
• Secondary Namenode - this is NOT a backup namenode, but is a separate service that keeps a copy of both the edit logs, and filesystem image, merging them periodically to keep the size reasonable.
– this is soon being deprecated in favor of the backup node and the checkpoint node, but the functionality remains similar (if not the same)
• Data can be accessed using either the Java API, or the Hadoop command line client.
Hadoop Core (HDFS)
• The default replication value is 3, i.e. data is stored on three nodes: two on the same rack, and one on a different rack.
MapReduce
• Simple data-parallel programming model designed for scalability and fault-tolerance
• Framework for distributed processing of large data sets
• Originally designed by Google • Pluggable user code runs in generic framework• Pioneered by Google -Processes 20 petabytes
of data per day
Hadoop Core (MapReduce)
• Job Tracker distributes tasks and handles failures
• tasks are assigned based on data locality
• Task Trackers can execute multiple tasks
MapReduce (Job Scheduling / Execution System)
Hadoop Distributed File System (HDFS)
Name Node Data Node
Job Tracker
data locality
Task Trackers
Hadoop Core (MapReduce)
MapReduce (Job Scheduling / Execution System)
Hadoop Distributed File System (HDFS)
Name Node Data Node
Job Tracker Task Tracker
MapReduce Paradigm
• Programming model developed at Google• Sort/merge based distributed computing• Initially, it was intended for their internal search/indexing
application, but now used extensively by more organizations (e.g., Yahoo, Amazon.com, IBM, etc.)
• It is functional style programming (e.g., LISP) that is naturally parallelizable across a large cluster of workstations or PCs.
• The underlying system takes care of the partitioning of the input data, scheduling the program’s execution across several machines, handling machine failures, and managing required inter-machine communication. (This is the key for Hadoop’s success)
Understanding Hadoop Clusters and the Network
MapReduce Example - WordCount
What is MapReduce used for?
• At Google:– Index construction for Google Search– Article clustering for Google News– Statistical machine translation
• At Yahoo!:– “Web map” powering Yahoo! Search– Spam detection for Yahoo! Mail
• At Facebook:– Data mining– Ad optimization– Spam detection
What is MapReduce used for?
• In research:– Astronomical image analysis (Washington)– Bioinformatics (Maryland)– Analyzing Wikipedia conflicts (PARC)– Natural language processing (CMU) – Particle physics (Nebraska)– Ocean climate simulation (Washington)
Limitations of MapReduce
Batch processing
, not interactive
Designed for a
specific problem domain
MapReduce programmi
ng paradigm
not commonly understood (functional)
Lack of trained support
professionals
API / security
model are moving targets
Basics of Map Reduce Programming
Dr. Kirti WankhedeIT Dept., SIMSR
Traditional way
Map-Reduce way
Map Reduce Workflow• Application writer specifies– A pair of functions called Map and Reduce and a set of input
files and submits the job• Workflow– Input phase generates a number of FileSplits from input files
(one per Map task)– The Map phase executes a user function to transform input kev-
pairs into a new set of kev-pairs– The framework sorts & Shuffles the kev-pairs to output nodes– The Reduce phase combines all kev-pairs with the same key into
new kevpairs– The output phase writes the resulting pairs to files• All phases are distributed with many tasks doing the work– Framework handles scheduling of tasks on cluster– Framework handles recovery when a node fails
Map Reduce Workflow
K3,V3
MapReduceFramework• The MapReduce framework operates exclusively on <key,
value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types.
• The responsibility of the user is to implement mappers and reducers — classes that extend Hadoop-provided base classes to solve a specific problem. As shown in figure a mapper takes input in a form of key/value pairs (k1, v1) and transforms them into another key/value pair (k 2 ,v2). The MapReduce framework sorts a mapper’s output key/value pairs and combines each unique key with all its values (k 2 , {v2 , v2 ,…}). These key/ value combinations are delivered to reducers, which translate them into yet another key/value pair (k3, v3).
Why key/value data?• we need this framework to provide us with a way of expressing our
problems that doesn't require us to be an expert in the execution mechanics; we want to express the transformations required on our data and then let the framework do the rest. MapReduce, with its key/value interface, provides such a level of abstraction, whereby the programmer only has to specify these transformations and Hadoop handles the complex process of applying this to arbitrarily large data sets.
• Some real-world examples - some real-world data that is key/value pair: An address book relates a name (key) to contact information (value)� A bank account uses an account number (key) to associate with the account� details (value) The index of a book relates a word (key) to the pages on which it occurs (value)� On a computer filesystem, filenames (keys) allow access to any sort of data,� such as text, images, and sound (values)
Important features of key/value data
• Keys must be unique but values need not be�• Each value must be associated with a key, but �
a key could have no values• Careful definition of the key is important; �
deciding on whether or not the counts are applied with case sensitivity will give different results
MapReduce framework
• The responsibility of the MapReduce framework (based on the user-supplied code) is to provide the overall coordination of execution. This includes– choosing appropriate machines (nodes) for
running mappers– starting and monitoring the mapper’s execution– choosing appropriate locations for the reducer’s
execution– sorting and shuffling output of mappers and delivering
the output to reducer nodes– starting and monitoring the reducer’s execution.
MapReduce Execution Pipeline
MapReduce Execution Pipeline
➤Driver — This is the main program that initializes a MapReduce job. It defines job-specific configuration, and specifies all of its components (including input and output formats,mapper and reducer, use of a combiner, use of a custom partitioner, and so on). The driver can also get back the status of the job execution.
➤ Context — The driver, mappers, and reducers are executed in different processes, typically on multiple machines. A context object is available at any point of MapReduce execution. It provides a convenient mechanism for exchanging required system and job-wide information. Keep in mind that context coordination happens only when an appropriate phase (driver, map, reduce) of a MapReduce job starts. This means that, for example, values set by one mapper are not available in another mapper (even if another mapper starts after the first one completes), but is available in any reducer.
MapReduce Execution Pipeline ➤ Input data — This is where the data for a MapReduce task is initially
stored. This data can reside in HDFS, HBase, or other storage. Typically, the input data is very large — tens of gigabytes or more.
➤ InputFormat — This defines how input data is read and split. InputFormat is a class that defines the InputSplits that break input data into tasks, and provides a factory for RecordReader objects that read the file. Several InputFormats are provided by Hadoop. InputFormat is invoked directly by a job’s driver to decide (based on the InputSplits) the number and location of the map task execution.➤InputSplit — An InputSplit defines a unit of work for a single map task in a MapReduce program. A MapReduce program applied to a data set is made up of several (possibly several hundred) map tasks. The InputFormat (invoked directly by a job driver) defines the number of map tasks that make up the mapping phase. Each map task is given a single InputSplit to work on. After the InputSplits are calculated, the MapReduce framework starts the required number of mapper jobs in the desired locations.
MapReduce Execution Pipeline➤ RecordReader — Although the InputSplit defines a data subset for a map task, it does not describe how to access the data. The RecordReader class actually reads the data from its source (inside a mapper task), converts it into key/value pairs suitable for processing by the mapper, and delivers them to the map method. The RecordReader class is defined by the InputFormat.
➤ Mapper — The mapper performs the user-defined work of the first phase of the MapReduce program. From the implementation point of view, a mapper implementation takes input data in the form of a series of key/value pairs (k1, v1), which are used for individual map execution. The map typically transforms the input pair into an output pair (k 2 , v2), which is used as an input for shuffle and sort. A new instance of a mapper is instantiated in a separate JVM instance for each map task that makes up part of the total job input. The individual mappers are intentionally not provided with a mechanism to communicate with one another in any way. This allows the reliability of each map task to be governed solely by the reliability of the local machine.
MapReduce Execution Pipeline ➤ Partition — A subset of the intermediate key space (k2, v2) produced by
each individual mapper is assigned to each reducer. These subsets (or partitions) are the inputs to the reduce tasks. Each map task may emit key/value pairs to any partition. All values for the same key are always reduced together, regardless of which mapper they originated from. As a result, all of the map nodes must agree on which reducer will process the different pieces of the intermediate data. The Partitioner class determines which reducer a given key/value pair will go to. The default Partitioner computes a hash value for the key, and assigns the partition based on this result.➤ Shuffle — Each node in a Hadoop cluster might execute several map tasks for a given job. Once at least one map function for a given node is completed, and the keys’ space is partitioned, the run time begins moving the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling.
➤ Sort — Each reduce task is responsible for processing the values associated with several intermediate keys. The set of intermediate key/value pairs for a given reducer is automatically sorted by Hadoop to form keys/values (k2 , {v2 , v2 ,…}) before they are presented to the reducer.
MapReduce Execution Pipeline ➤ Reducer — A reducer is responsible for an execution of user-provided code for
the second phase of job-specific work. For each key assigned to a given reducer, the reducer’s reduce() method is called once. This method receives a key, along with an iterator over all the values associated with the key. The values associated with a key are returned by the iterator in an undefined order. The reducer typically transforms the input key/value pairs into output pairs (k3, v3).
➤ OutputFormat — The way that job output (job output can be produced by reducer or mapper, if a reducer is not present) is written is governed by the OutputFormat. The responsibility of the OutputFormat is to define a location of the output data andRecordWriter used for storing the resulting data.
➤ RecordWriter — A RecordWriter defines how individual output records are written.
MapReduce Execution Pipeline• The following are two optional components of MapReduce execution
➤ Combiner — This is an optional processing step that can be used for optimizing MapReduce job execution. If present, a combiner runs after the mapper and before the reducer. An instance of the Combiner class runs in every map task and some reduce tasks. The Combiner receives all data emitted by mapper instances as input, and tries to combine values with the same key, thus reducing the keys’ space, and decreasing the number of keys (not necessarily data) that must be sorted. The output from the Combiner is then sorted and sent to the reducers.➤ Distributed cache — An additional facility often used in MapReduce jobs is a distributed cache. This is a facility that enables the sharing of data globally by all nodes on the cluster.The distributed cache can be a shared library to be accessed by each task, a global lookup file holding key/value pairs, jar files (or archives) containing executable code, and so on. The cache copies over the file(s) to the machines where the actual execution occurs, andmakes them available for the local usage.
MapReduce Job
Handled parts
MapReduce Program Structure
Class MapReduce{Class Mapper …{ Map ……}Class Reduer …{ Reduce………}Main(){ JobConf Conf=new JobConf(“MR.Class”);
}}
package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } }
Configuration of a Job• JobConf object
– JobConf is the primary interface for a user to describe a map-reduce job to the Hadoop framework for execution.
– JobConf typically specifies the Mapper, combiner (if any), Partitioner, Reducer, InputFormat and OutputFormat implementations to be used
– Indicates the set of input files (setInputPaths(JobConf, Path...) /addInputPath(JobConf, Path)) and (setInputPaths(JobConf, String) /addInputPaths(JobConf, String)) and where the output files should be written (setOutputPath(Path)).
Configuration of a Job
Input Splitting
• An input split will normally be a contiguous group of records from a single input file– If the number of requested map tasks is larger
than number of files – the individual files are larger than the suggested
fragment size, there may be multiple input splits constructed of each input file.
• The user has considerable control over the number of input splits.
Specifying Input Formats• The Hadoop framework provides a large variety of
input formats.– KeyValueTextInputFormat: Key/value pairs, one per line.– TextInputFormat: The key is the line number, and the value
is the line.– NLineInputFormat: Similar to KeyValueTextInputFormat,
but the splits are based on N lines of input rather than Y bytes of input.
– MultiFileInputFormat: An abstract class that lets the user implement an input format that aggregates multiple files into one split.
– SequenceFIleInputFormat: The input file is a Hadoop sequence file, containing serialized key/value pairs.
Specifying Input Formats
Setting the Output Parameters
• The framework requires that the output parameters be configured, even if the job will not produce any output.
• The framework will collect the output from the specified tasks and place them into the configured output directory.
Setting the Output Parameters
A Simple Map Function: IdentityMapper
How Many Maps?
• The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.
• The right level of parallelism for maps seems to be around 10-100 maps per-node,
• it is best if the maps take at least a minute to execute
• setNumMapTasks(int)
A Simple Reduce Function: IdentityReducer
A Simple Reduce Function: IdentityReducer
Configuring the Reduce Phase
• the user must supply the framework with five pieces of information– The number of reduce tasks; if zero, no reduce
phase is run– The class supplying the reduce method– The input key and value types for the reduce task;
by default, the same as the reduce output– The output key and value types for the reduce task– The output file type for the reduce task output
Reducer• Reducer reduces a set of intermediate values which
share a key to a smaller set of values.• Reducer has 3 primary phases: shuffle, sort and
reduce.• Shuffle
– Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.
• Sort– The framework groups Reducer inputs by keys (since
different mappers may have output the same key) in this stage
– The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.
How Many Reduces?
• The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum).
• With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish.
• With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.
How Many Reduces?• Increasing the number of reduces increases the
framework overhead, but increases load balancing and lowers the cost of failures.
• Reducer NONE– It is legal to set the number of reduce-tasks to zero if
no reduction is desired.– In this case the outputs of the map-tasks go directly to
the FileSystem, into the output path set by setOutputPath(Path).
– The framework does not sort the map-outputs before writing them out to the FileSystem
Reporter
• Reporter is a facility for Map/Reduce applications to report progress, set application-level status messages and update Counters.
• Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive.
JobTracker
• JobTracker is the central location for submitting and tracking MR jobs in a network environment.
• JobClient is the primary interface by which user-job interacts with the JobTracker– provides facilities to submit jobs, track their
progress, access component-tasks' reports and logs, get the Map/Reduce cluster's status information and so on.
– JobClient.runJob(conf);
Job Submission and Monitoring
• The job submission process involves:– Checking the input and output specifications of
the job.– Computing the InputSplit values for the job.– Setting up the requisite accounting information for
the DistributedCache of the job, if necessary. – Copying the job's jar and configuration to the
Map/Reduce system directory on the FileSystem. – Submitting the job to the JobTracker and
optionally monitoring it's status.
MapReduce TypesInput Formats
Output Formats
MapReduce Types• General form
• Java interface
– OutputCollector emitts key-value pairs
– Reporter updates counters and status
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)=
≠
MapReduce Types• Combine function
– The same form as the reduce function, except its output types
– Output type is the same as Map
– Often the combine and reduce functions are the same
• Partition function
– Input intermediate key and value types
– Returns the partition index
MapReduce Types• Input types are set by the input format
Ex) setInputFormat(TextInputFormat.class)
• Generate Key type: LongWritable, Value type: Text
• Other types are set explicitly by calling the methods on the JobConf• Ex) JobConf conf; conf.setMapOutputKeyClass(Text.class)
• Intermediate types are also set as the final output types by default• Just need to call setOutputKeyClass() if K2 and K3 are the same
Hadoop Data Types int -------->IntWritable long -------->LongWritable boolean -------->BooleanWritable float -------->FloatWritable byte --------> ByteWritableWe can use the following built-in data types as key and valueText :This stores a UTF8 textByteWritable : This stores a sequence of bytesVIntWritable and VLongWritable : These stores variable length integer and long valuesNullwritable:This is zero-length Writable type that can be used when you don’t want to use a key or value type
MapReduce Types
• Why can’t the types be determined from a combination of the mapper and the reducer?– Type Erasure in JAVA Compiler deletes generic
types at compile time• The configuration isn’t checked at compile time• Possible to configure a MapReduce job with
incompatible types• Type conflicts are detected at runtime
– Give types to Hadoop explicitly
MapReduce Types
The Default MapReduce Job• Default Input Format
– TextInputFormat• LongWritable (Key)• Text (Value)
• setNumMapTasks(1)– Does not set the number of
map tasks to one• 1 is a hint
MapReduce Types
Choosing the Number of Reducers• The optimal number of reducers is related to
the total number of available reducer slots– Total number of available reducers =
Total nodes * mapred.tasktracker.reduce.tasks.maximum
• To have slightly fewer reducers than total slots– Tolerate a few failures without extending job
execution time
MapReduce Types
Keys and values in Streaming
• A streaming application can control the separator– Default separator: tab character– Separators may be configured independently for maps and reduces– The number of fields separated by itself to treat as the map output key
• Set the first n fields in stream.num.map.output.key.fields• Ex) Output was a,b,c (and separator is a comma), n=2
– Key: a,b Value:c
• MapReduce Types
• Input Formats– Input Splits and Records
– Text Input
– Binary Input
– Multiple Inputs
– Database Input(and Output)
• Output Formats
Input Formats
Input Splits and Records
• InputSplit (org.apache.hadoop.mapred package)
– A chunk of the input processed by a single map– Each split is divided into records– Split is just a reference to the data (Doesn’t contain the input
data)
• Ordering the splits – To process the largest split first (minimize the job runtime)
Input Formats
Input Splits and Records - InputFormat
• Create the input splits, and dividing them into records• numSplits argument of getSplits() method is a hint
– InputFormat is free to return a different number of splits• The client sends the calculated splits to the jobtracker
– Schedule map tasks to process on the tasktrackers• RecordReader
– Iterate over records– Used by the map task to generate record key-value pairs
Input Formats
Input Splits and Records - FileInputFormat
• The base class for all InputFormat that use files as their data source
Input Formats
Input Splits and Records - FileInputFormat
• FileInputFormat offers 4 static methods for setting a JobConf’s input paths– addInputPath() and addInputPaths()
• Add a path or paths to the list of inputs• Can call these methods repeatedly
– setInputPaths()• Set entire list of paths in one time (Replacing any paths that were set in previous calls)
– A path may represent• A file• A directory (consider all files in the directory as input)
– Error when subdirectory exists (solved by glob or filter)
• A collection of files and directories by using a glob
Input Formats
Input Splits and Records - FileInputFormat
• Filters– Use FileInputFormat as a default filter
• Exclude hidden files
– Use setInputPathFilter() method• Act in addition to the default filter• Refer page 61
Input Formats
Input Splits and Records – FileInputFormat input splits
• FileInputFormat splits only large files that larger than an HDFS block– Normally the split size is the size of an HDFS block
• Possible to control the split size– Effect maximum split size: The maximum size is
less than block size
Input Formats
Input Splits and Records – FileInputFormat input splits
• The split size calculation (computeSplitSize() method)
• By default
– Split size is blockSize• Control the split size
Input Formats
Input Splits and Records – Small files and CombineFileInputFormat
• Hadoop works better with a small number of large files than a large number of small files– FileInputFormat generates splits that each split is all or part of a single file– Bookkeeping overhead with a lot of small input data
• Use CombineFileInputFormat to pack many files into splits– Designed to work well with small files– Take node and rack locality when packing blocks into split– Worth when already have a large number of small files in HDFS
• Avoiding the many small files is a good idea– Reduce the number of seeks – Merge small files into larger files by using a SequenceFile
Input Formats
Input Splits and Records – Preventing splitting
• Some application don’t want files to be split– Want to process entire data by a single mapper
• Two ways of ensuring an existing file is not split– Set the minimum split size larger than the largest
file size– Override the isSplitable() method to return false
Input Formats
Input Splits and Records – Processing a whole file as a record
• WholeFileRecordReader– Take a FileSplit and convert it into a single record
Input Formats
Text Input - TextInputFormat
• TextInputFormat is the default InputFormat– Key: The byte offset of the beginning of the line (LongWritable) ; Not line number– Value: The contents of the line excluding any line terminators (Text)
• Each split knows the size of the preceding splits– A global file offset = The offsets within the split + The size of preceding splits
• The logical records do not usually fit into HDFS
Input Formats
Text Input - NLineInputFormat
• Each mapper receives a variable number of lines of input using:– TextInputFormat, KeyValueTextInputFormat
• To receive a fixed number of lines of input, use– NLineInputFormat as InputFormat
• N: The number of lines of input • Control N in Mapred.line.input.format.linespermap property
– Inefficient if a map task takes a small number of lines of input • Due to task setup overhead
Input Formats
Text Input - XML
• Use StreamXmlRecordReader class for XML– Org.apache.hadoop.streaming package– Set stream.recordreader.class to org.apache.hadoop.streamin.StreamXmlRecordReader
Input Formats
Binary Input• SequenceFileInputFormat
– Hadoop’s sequence file format stores sequences of binary key-value pairs• Data is splittable (Data has sync points)• Use SequenceFileInputFormat
• SequenceFileAsTextInputFormat– Convert the sequence file’s keys and values to Text objects
• Use toString() method
• SequenceFileAsBinaryInputFormat– Retrieve the sequence file’s keys and values as opaque binary objects
Input Formats
Multiple Inputs• Use MultipleInput when
– Have data sources that provide the same type of data but in different formats• Need to be parsed differently• Ex) One might be tab-separated plain text, the other a binary sequence file
– Use different mappers– The map outputs have the same types
• Reducers are not aware of the different mappers
Outline
• MapReduce Types
• Input Formats
• Output Formats– Text Output
– Binary Output
– Multiple Outputs
– Lazy Output
– Database Output
Output Formats
• The OutputFormat class hierarchy
Output Formats
Text Output• TextOutputFormat (default)
– Write records as lines of text– Keys and Values may be of any type
• It calls toString() method
– Each key-value pair is separated by a tab character• Set the separator in
mapred.textoutputformat.separator property
Output Formats
Binary Output• SequenceFileOutputFormat
– Write sequence files for its output– Compact, readily compressed (Useful for a further MapReduce job)
• SequenceFileAsBinaryOutputFormat– Write keys and values in raw binary format into a SequenceFile container
• MapFileOutputFormat– Write MapFiles as output– The keys in MapFile must be added in order
• Ensure that the reducers emit keys in sorted order (only for this format)
Output Formats
Multiple Output• MultipleOutputFormat and MultipleOutputs
– Help to produce multiple files per reducer
• MultipleOutputFormat– The names of multiple files are derived from the output keys and values – Is an abstract class with
• MultipleTextOutputFormat• MultipleSequenceFileOutputFormat
• MultipleOutputs– Can emit different types for each output (Differ from MultipleOutputFormat)– Less control over the naming of outputs
Output Formats
Multiple Output• Difference between MultipleOutputFormat and MultiplOutputs
– MultipleOutputs is more fully featured– MultipleOutputFormat has more control over the output directory structure
and file naming
Output Formats
Lazy Output• LazyOutput helps some applications that doesn’t want to create
empty files– Since FileOutputFormat subclasses create output files even if they are empty
• To use it– Call its setOutputFormatClass() method with the JobConf option
Top Related