Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and...

51
Hadoop Introduction Rob Hughes 1

Transcript of Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and...

Page 1: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Hadoop Introduction Rob Hughes

1

Page 2: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Why the need for Hadoop?

• LOTS!!! of data causes some problems:

• In 1990 a typical hard drive could store 1,370 MB of data with a transfer speed of 4.4MB/s. So you could read all the data from a full drive in around five minutes.

• Over 20 years later, one terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk.

• Hard drive access speeds have not kept up with storage capacity. Stupid seek time!

2

Page 3: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

How to mitigate access times

• Partition data into more-or-less equal size chunks spread across separate drives. Process data by working in parallel.

• With a hundred drives each containing a hundredth of the data, working in parallel, you could read an entire 1 terabyte drive in under 2 minutes.

• But this creates other problems…

3

Page 4: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

More Problems

• Increasing the amount of hardware increases the likelihood of hardware failure.

• If computer fails you lose part of the computation.

• If drive fails, you lose part of your data.

• Intermediate results of processing now stored on multiple drives. Data may need to be combined to produce final result.

4

Page 5: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Again, why the need for Hadoop?

• Distributed processing (parallelism).

• Distributed data (replication).

• Fault-tolerance.

• Mechanism to combine data at key points during processing.

5

Page 6: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Hadoop Features

• Data Compression.

• Separation of concerns:

• Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant platform for data access and job execution.

• Developers develop instead of becoming distributed system experts. Hadoop defines an API for packaging and submitting jobs, an API hook to update job progress, and a file system to capture job results.

• Data locality.

• Scalable + Commodity hardware.

6

Page 7: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

HDFS Hadoop Distributed File System • Core component of Hadoop.

• Exhibits all the characteristics of a distributed file system.

• Files managed across a network of servers.

• Data file size can grow beyond limits of physical server.

• Scalable storage of data.

• Tolerate failure of nodes without losing access to data.

• HDFS does this by high replication count.

7

Page 8: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

HDFS

• Designed for storing very large files and large amounts of data.

• Capable of storing petabytes of data.

• Designed to run on commodity hardware.

• Doesn’t require expensive or highly available hardware.

• Runs on large clusters of inexpensive commonly available hardware.

• Chance of node failure is high for large clusters but HDFS is designed to survive in the face of failure.

• Optimized for write-once, read many times.

• Optimized for high throughput to read entire dataset.

8

Page 9: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

HDFS – Not Ideal For:

• Low-latency data access (tens of milliseconds).

• Large number of files.

• Due to memory constraints, the number of files is currently limited. Scales to millions but not billions of files.

• Multiple writers. Writes in middle of files.

• Files may only be written by a single writer. File modifications are always done at the end of a file.

9

Page 10: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

HDFS -- Blocks

• Disks organized into blocks which is basic unit of storage • Minimum amount of data than can be read or written.

• Filesystems perform I/O to individual disks in terms of multiple disk blocks.

• HDFS has notion of a block as well but block size is much larger than typical filesystems—64 MB by default.

• Goal is to keep the number of relatively slow disk seeks low.

• Hadoop operations designed to operate on data the size of an HDFS block. Allows operations to access data with a single disk seek.

• Favors throughput over low-latency.

• Files stored in HDFS as one or more HDFS blocks.

• File replication in HDFS occurs at block level.

10

Page 11: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

HDFS Architecture

• Collection of HDFS nodes/servers is known as an HDFS cluster

• An HDFS cluster comprised of two types of nodes (Namenode and Datanode) operating in a master-worker relationship

• Namenode (the master) – Server running a special piece of software called the NameNode.

• Datanodes (the worker) – Servers running a special piece of software called the DataNode.

11

Page 12: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

HDFS Architecture—Nodes

• Namenode

• Maintains the filesystem tree and the metadata for all the files and directories in the tree.

• Information is stored persistently on the local disk in two files: the namespace image and the edit log.

• Recommended to configure HDFS to write copies of namespace image and edit log files to remote NFS mounted filesystem.

• Determines the mapping of blocks to Datanodes.

• Knows the Datanodes on which all the blocks for a given file are located. This information is stored in memory and not persisted to disk.

• Datanodes

• Store and retrieves blocks when requested.

• Perform block creation, deletion, and replication upon instruction from the Namenode.

• Periodically (and at system startup) reports list of blocks back to Namenode.

12

Page 13: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

HDFS Architecture (misc.)

• Namenode is a single point of failure.

• Cannot use the filesystem without the Namenode.

• In the Apache Hadoop distribution, manual procedures are needed to bring another node online as the new Namenode. Involves recovering namespace image and edit log from failed Namenode server or from external copy of those files.

• Hadoop 2.x offers federated Namenodes and HA features

• Secondary Namenode—Optional node type.

• Not a standby for Namenode as the name may imply.

• Main role is to periodically merge namespace image with the edit log to prevent the edit log from becoming too large.

13

Page 14: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

HDFS Access

• Hadoop and third party clients used to access HDFS filesystem

• Command Line Interface (CLI) program named “hadoop”.

• Various Hadoop java libraries provide programmatic filesystem access.

• C library called libhdfs also bundled with Hadoop.

• Clients access filesystem on behalf of user or program by hiding interaction with Namenodes and Datanodes.

14

Page 15: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

rack

server

Local Storage

HD

FS

DataNode

server

Local Storage

HD

FS

DataNode

server

Local Storage

HD

FS

DataNode

Sample HDFS Cluster

rack

server

Local Storage

H

DFS

DataNode

server

Local Storage

NameNode

server

Local Storage

HD

FS

DataNode

rack

server

Local Storage

HD

FS

DataNode

server

Local Storage

HD

FS

DataNode

server

Local Storage

HD

FS

DataNode

Page 16: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Hadoop Network Topology

• Hadoop takes a simple approach in which the network is represented as a tree.

• “Distance” between nodes is important.

• For high-volume data processing the limiting factor is how rapidly data can be transferred between nodes.

• Levels in the tree are not predefined, but it is common to have levels that correspond to the data center, the rack, and the node.

• Network distance and data locality optimizations are key features that distinguish HDFS from other distributed file systems. 16

Page 17: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Sample Hierarchical/Tree Network Topology

Nodes Racks Logical Root

HDFS Cluster

Rack 1

Datanode1

DatanodeN

Namenode

RackN

DN1

DN2

DNN 17

Page 18: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Sample Topology Adding Data Center Layer

HDFS Cluster

/

Data Center 1

Rack 1

Datanode1 DatanodeN

RackN

DN1 Namenode

DC2

R1

DN1 DN2

R2

DN1 DN2

RN

DN1 DN2 DNN

18

Distance between two nodes is the sum of their distances to their closest common ancestor.

D=0 D=2 D=4 D=6

Page 19: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

HDFS Cluster—Flat Topology

HDFS Cluster

/

Datanode1 Datanode2 Namenode DatanodeN

19

D=0 D=2 D=2 D=2

Page 20: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

client node

client jvm

HDFS Write: Replication Factor=3

rack

server

Local Storage

DataNode

server

Local Storage

HD

FS

NameNode

rack

server

Local Storage

HD

FS

DataNode

server

Local Storage

HD

FS

DataNode

HDFS Client

DistributedFileSystem

FSDataOutputStream

1. create 2. create

3. write

4. write packet

5. ack packet

b1

b1

b1

b1

4

5

HDFS block

pipelined write

pipelined ack

4

4

5

5

4. write packet

4. write packet

4. write packet

block locations

6. close

7. complete

Page 21: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Hadoop Modes

• Standalone

• All Hadoop/MapReduce/HDFS daemons run as threads within a single Java Virtual Machine (JVM).

• Debugging distributed programs across multiple JVMs and servers notoriously difficult. Standalone mode simplifies debugging experience.

• Pseudo-distributed

• All Hadoop/MapReduce/HDFS daemons run as threads within separate JVMs but on the same node. Closer to full-up cluster mode but all processing occurs on local node.

• Cluster

• Hadoop/MapReduce/HDFS daemons run as threads within JVMs spread across nodes of Hadoop/HDFS cluster. 21

Page 22: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Exercise 2

22

Page 23: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

MapReduce • Skirted the issue so far. What good is all that data without analysis. • MapReduce is one of the core components of Hadoop and provides a data model for

processing data. • Analyzing data with Hadoop is broken up into two primary phases: the Map phase

and the Reduce phase. (hence the name). • Each phase has key-value pairs as input and output. The types of those key-value pairs

is selectable by the programmer. • The MapReduce API provides a number of available input and output types. • Types are extensible.

• The programmer must specify two functions: the map function and the reduce function.

• The output key-value types for the Map phase must be the same as the input key-value types for the Reduce phase.

• The output from the map function is processed by the MapReduce framework before • being sent to the reduce function. This processing sorts and groups the key-value pairs • by key—Known as “The Shuffle”.

• MapReduce provides an API for creating a ‘job’ and submitting it to Hadoop for execution.

23

Page 24: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Map Phase

• Input: Key/value pairs

• Value represents data set to be processed

• Map Function: User-defined and is applied to every value in data set.

• Output: New list of key/value pairs.

• Output key/value types may be different than input types.

24

Page 25: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Reduce Phase

Map Function Output:

(K3, V1)

(K1, V1)

(K1, V2)

(K2, V1)

=>

25

• Input: Intermediate key/value pairs output from Map Phase.

– Data is sorted and grouped by key before being passed to reduce function.

Input to Reduce Function:

(K1, [V1, V2])

(K2, [V1])

(K3, [V1])

• Reduce Function: User-defined function applied to each grouping (by key) of values.

– Typically a function that takes a large number of key/value pairs and produces a smaller number of key/value pairs.

– Hence the name “reduce”.

• Output: Finalized set of key/value pairs.

• All values with the same key will eventually be processed by the same reduce task.

Page 26: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

MapReduce (cont)

• In addition to the Map and Reduce phases there are a few data processing steps.

• Input – Turn raw data into key-value pairs for input into Map phase.

• “Shuffle” – Step to turn output key-value pairs from Map phase into input key-value pairs for Reduce phase. The output from the map function is processed by the MapReduce framework before being sent to the reduce function. This processing sorts and groups the key-value pairs by key.

• Output – Write results of Reduce phase to file system.

26

Page 27: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Scaling MapReduce

• Leverage data stored in HDFS.

• Use Hadoop to move computations to nodes hosting part of the data.

• MapReduce Job—unit of work to be completed for client.

• Consists of:

• Input data.

• MapReduce program.

• Configuration information.

• Hadoop divides job into two types of tasks:

• Map tasks.

• Reduce tasks.

27

Page 28: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Scaling MapReduce Division Of Labor • Hadoop divides input data into fixed-sized pieces called splits.

• Hadoop creates one map task for each split. The map task executes the user-defined map function for each record in the split.

28

Page 29: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

output

reduce

shuffle

map/ combine

input

Map and Reduce Tasks

29

split 1 map

part 1

part n

sort

map task

split 2 map

part 1

part n

sort

map task

split 3 map

part 1

part n

sort

map task

part 1 reduce part 1

merge/sort

reduce task

output HDFS HDFS replication

part 1

part n reduce part n

merge/sort

reduce task

output HDFS HDFS replication

part n

Page 30: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Hadoop Architecture

• Hadoop cluster uses two node types to facilitate job execution:

• Jobtracker – Server running a special piece of software known as JobTracker.

• Tasktracker – Server(s) running a special piece of software known as TaskTracker.

30

Page 31: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Hadoop Architecture--Nodes

• Jobtracker--The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers.

• Keeps track of progress for each job.

• Reschedules failed tasks on another tasktracker.

• Tasktracker--Runs tasks and sends progress reports to the jobtracker.

• Note: This architecture is for Hadoop 1.x. Architecture has changed for Hadoop 2.x to overcome 1.x scaling issues.

31

Page 32: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

rack

server

Local Storage

HD

FS

DataNode

TaskTracker

server

Local Storage

HD

FS

DataNode

TaskTracker

server

Local Storage

HD

FS

DataNode

TaskTracker

Sample HDFS/Hadoop Cluster

rack

server

Local Storage

JobTracker

server

Local Storage

NameNode

server

Local Storage

HD

FS

DataNode

TaskTracker

rack

server

Local Storage

HD

FS

DataNode

TaskTracker

server

Local Storage

HD

FS

DataNode

TaskTracker

server

Local Storage

HD

FS

DataNode

TaskTracker

Page 33: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Hadoop Data Locality Optimization

rack

server

Local Storage

DataNode H

DFS

rack

a

b

HDFS block

Data-local

Rack-local

TaskTracker

maptask

server

Local Storage

DataNode

HD

FS

TaskTracker

maptask

c Off-rack

a

b

server

Local Storage

DataNode

HD

FS

TaskTracker

c

server

Local Storage

DataNode

HD

FS

TaskTracker

• Hadoop does its best to run map tasks on the same node containing the input split.

• Optimal split size is one HDFS block

Page 34: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

MapReduce Example

Input dataset: SINGLE.TXT* aardwolves

draftable

flatbread

tutor

trout

tourt

34

• Find all input rows containing a word that is an anagram of another word in the input. • Input to Map Phase is a single file SINGLE.TXT containing all words

• Output of Map Phase is a set of key/value pairs. One key/value pair for each line of the input dataset. Key is sorted letters for word. The value is the word itself.

* Data from Project Gutenberg @ http://www.gutenberg.org/dirs/etext02/mword10.zip

aadelorsvw aardwolves

aabdeflrt draftable

aabdeflrt flatbread

orttu tutor

orttu trout

orttu tourt

Page 35: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

MapReduce Example

Output: part-00000

<snip>

aabdeflrt draftable,flatbread

<snip>

orttu tutor,trout,tourt

<snip>

35

• Input to the Reduce phase is the sorted and grouped keys and values.

• The idea being that anagrams will show up as key/value pairs with multiple values in the group (keyed by common sorted letters)

• Output of the Reduce phase will be records with multiple values for the same key:

<snip>

aadelorsvw [aardwolves]

aabdeflrt [draftable,flatbread]

<snip>

orttu [tutor,trout,tourt]

<snip>

Page 36: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Map Function package com.hadoop.examples.anagrams;

import java.io.IOException;

import java.util.Arrays;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.MapReduceBase;

import org.apache.hadoop.mapred.Mapper;

import org.apache.hadoop.mapred.OutputCollector;

import org.apache.hadoop.mapred.Reporter;

public class AnagramMapper extends MapReduceBase implements

Mapper<LongWritable, Text, Text, Text> {

private Text sortedText = new Text();

private Text orginalText = new Text();

public void map(LongWritable key, Text value,

OutputCollector<Text, Text> outputCollector, Reporter reporter)

throws IOException {

String word = value.toString();

char[] wordChars = word.toCharArray();

Arrays.sort(wordChars);

String sortedWord = new String(wordChars);

sortedText.set(sortedWord);

orginalText.set(word);

outputCollector.collect(sortedText, orginalText);

}

}

36

Page 37: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Reduce Function package com.hadoop.examples.anagrams;

import java.io.IOException;

import java.util.Iterator;

import java.util.StringTokenizer;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.MapReduceBase;

import org.apache.hadoop.mapred.OutputCollector;

import org.apache.hadoop.mapred.Reducer;

import org.apache.hadoop.mapred.Reporter;

public class AnagramReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> {

private Text outputKey = new Text();

private Text outputValue = new Text();

public void reduce(Text anagramKey, Iterator<Text> anagramValues,

OutputCollector<Text, Text> results, Reporter reporter) throws IOException {

String output = "";

while(anagramValues.hasNext())

{

Text anagram = anagramValues.next();

output = output + anagram.toString() + "~";

}

StringTokenizer outputTokenizer = new StringTokenizer(output,"~");

if(outputTokenizer.countTokens()>=2)

{

output = output.replace("~", ",");

outputKey.set(anagramKey.toString());

outputValue.set(output);

results.collect(outputKey, outputValue);

}}}

37

Page 38: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Package The Job package com.hadoop.examples.anagrams;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.FileInputFormat;

import org.apache.hadoop.mapred.FileOutputFormat;

import org.apache.hadoop.mapred.JobClient;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapred.TextInputFormat;

import org.apache.hadoop.mapred.TextOutputFormat;

public class AnagramJob {

public static void main(String[] args) throws Exception{

JobConf conf = new JobConf(com.hadoop.examples.anagrams.AnagramJob.class);

conf.setJobName("anagramcount");

conf.setKeepTaskFilesPattern(".*");

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(Text.class);

conf.setMapperClass(AnagramMapper.class);

conf.setReducerClass(AnagramReducer.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

}

}

38

Page 39: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

MapReduce Example 2

Dataset 1: NAMES-F.TXT* Aaren

Aarika

Abagael

Abagail

Abbe

Zsazsa

Zulema

Zuzana

Dataset 2: NAMES-M.TXT * Aaron

Ab

Abba

Abbe

Abbey

Zerk

Zollie

Zolly

39

• Compare two datasets for a set of common names

• Input to Map Phase is both data sets (Female and Male names)

• Output of Map Phase is a set of key/value pairs. One key/value pair for each line of the input datasets. Both key and value set to the name from the input record.

• Input to the Reduce phase is the sorted and grouped keys and values.

• The idea being that common names will show up as key/value pairs with multiple values in the group.

* Data from Project Gutenberg @ http://www.gutenberg.org/dirs/etext02/mword10.zip

Page 40: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Map Function package com.hadoop.examples.commonnames;

import java.io.IOException;

import java.util.Arrays;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.MapReduceBase;

import org.apache.hadoop.mapred.Mapper;

import org.apache.hadoop.mapred.OutputCollector;

import org.apache.hadoop.mapred.Reporter;

public class CommonNamesMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {

private Text originalText = new Text();

public void map(LongWritable key, Text value,

OutputCollector<Text, Text> outputCollector, Reporter reporter)

throws IOException {

String word = value.toString();

originalText.set(word);

outputCollector.collect(originalText, originalText);

}

}

40

Page 41: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Reduce Function package com.hadoop.examples.commonnames;

import <snip>

public class CommonNamesReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> {

private Text outputKey = new Text();

private Text outputValue = new Text();

public void reduce(Text namesKey, Iterator<Text> namesValues,

OutputCollector<Text, Text> results, Reporter reporter) throws IOException {

String output = "";

while(namesValues.hasNext())

{

Text names = namesValues.next();

output = output + names.toString() + "~"; }

StringTokenizer outputTokenizer = new StringTokenizer(output,"~");

if(outputTokenizer.countTokens()>=2)

{

output = output.replace("~", ",");

outputKey.set(namesKey.toString());

outputValue.set(output);

results.collect(outputKey, outputValue);

}

}

}

41

Page 42: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Package The Job package com.hadoop.examples.commonnames;

import <snip>

public class CommonNamesJob {

public static void main(String[] args) throws Exception{

JobConf conf = new JobConf(com.hadoop.examples.commonnames.CommonNamesJob.class);

conf.setJobName("commonnames");

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(Text.class);

conf.setMapperClass(CommonNamesMapper.class);

conf.setReducerClass(CommonNamesReducer.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

}

}

42

Page 43: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Hadoop Modes

• Standalone

• All Hadoop/MapReduce/HDFS daemons run as threads within a single Java Virtual Machine (JVM).

• Debugging distributed programs across multiple JVMs and servers notoriously difficult. Standalone mode simplifies debugging experience.

• Pseudo-distributed

• All Hadoop/MapReduce/HDFS daemons run as threads within separate JVMs but on the name node. Closer to full-up cluster mode but all processing occurs on local node.

• Cluster

• Hadoop/MapReduce/HDFS daemons run as threads within JVMs spread across nodes of Hadoop/HDFS cluster. 43

Page 44: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Exercise 3

44

Page 45: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Hadoop Introduction

• Vendor: Copyright © 2011 The Apache Software Foundation. Website: http://hadoop.apache.org/ Description: The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

• The project includes these subprojects: • Hadoop Common: The common utilities that support the other Hadoop

subprojects. • Hadoop Distributed File System (HDFS™): A distributed file system that provides

high-throughput access to application data. • Hadoop MapReduce: A software framework for distributed processing of large

data sets on compute clusters. 45

Page 46: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Additional Resources

• “Hadoop: The Definitive Guide, Third Edition, by Tom White. Copyright 2011 Tom White, 978-1-449-31152-0.”

• Cloudera:

• Linux packages and Virtual Machines:

• https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads#CDHDownloads-CDH3PackagesandDownloads

• Apache Hadoop Motivation Webinars (free registration required)

• http://www.cloudera.com/resource/cloudera-essentials-for-apache-hadoop-the-motivation-for-hadoop/

46

Page 47: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Backup Slides

47

Page 48: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

Alternatives

• MPI

• PVM

48

Page 49: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

HDFS Access

• Talk about HTTP/Proxy access??

• Talk about pluggable filesystems???

49

Page 50: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

HDFS

• File structures

• SequenceFile

• MapFIle

50

Page 51: Hadoop Introduction - Amazon S3 · •Hadoop manages the complexity of data storage and replication, coordination of hundreds to thousands of machines, and provides a fault tolerant

MapReduce Features

• Counters

• Sorting

• Joins

• Side Data

51