Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming...

22
Hadoop Map Reduce 10/17/2018 1

Transcript of Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming...

Page 1: Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

Hadoop Map Reduce

10/17/2018 1

Page 2: Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

MapReduce

2-in-1

A programming paradigm

A query execution engine

A kind of functional programming

We focus on the MapReduce execution

engine of Hadoop through YARN

10/17/2018 2

Page 3: Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

Overview

10/17/2018 3

Driver

Slave

nodes

Master node

Developer

MR

Program

MR Job

Page 4: Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

Code Example

10/17/2018 4

Page 5: Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

Job Execution Overview

10/17/2018 5

Driver

Job

submission

Job

preparation

Map Shuffle Reduce Cleanup

Page 6: Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

Job Submission

Execution location: Driver node

A driver machine should have the following

Compatible Hadoop binaries

Cluster configuration files

Network access to the master node

Collects job information from the user

Input and output paths

Map, reduce, and any other functions

Any additional user configuration

Packages all this in a Hadoop Configuration

10/17/2018 6

Page 7: Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

Hadoop ConfigurationKey: String Value: String

Input hdfs://user/eldawy/README.txt

Output hdfs://user/eldawy/wordcount

Mapper edu.ucr.cs.cs226.eldawy.WordCount

Reducer …

JAR File …

User-defined User-defined

10/17/2018 7

Master node

Serialized over network

Page 8: Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

Job Preparation

Runs on the master node

Gets the job ready for parallel execution

Collects the JAR file that contains the user-

defined functions, e.g., Map and Reduce

Writes the JAR and configuration to HDFS to

be accessible by the executors

Looks at the input file(s) to decide how many

map tasks are needed

Makes some sanity checks

Finally, it pushes the BRB (Big Red Button)10/17/2018 8

Page 9: Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

Job Preparation

10/17/2018 9

Configuration

JAR File

Master node

HDFS

InputFormat#getSplits()

Split1Split2

..

SplitM

Mapper1

Mapper2

..

MapperM

FileInputSplit

Path

Start

End

Page 10: Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

Map Phase

Runs in parallel on worker nodes

M Mappers:

Read the input

Apply the map function

Apply the combine function (if configured)

Store the map output

There is no guaranteed ordering for

processing the input splits

10/17/2018 10

Page 11: Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

Map Phase

10/17/2018 11

Master node

IS1 IS2 IS3 IS4 IS5 ISM…

Page 12: Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

Map Task

Reads the job configuration and task

information (mostly, InputSplit)

Instantiates an object of the Mapper class

Instantiates a record reader for the assigned

input split

Calls Mapper#setup(Context)

Reads records one-by-one from the record

reader and passes them to the map function

The map function writes the output to the

context10/17/2018 12

Page 13: Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

MapContext

Keeps track of which input split is being read

and which records are being processed

Holds all the job configuration and some

additional information about the map task

Materializes the map output

10/17/2018 13

Page 14: Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

Map Output

What really happens to the map output?

It depends on the number of reducers

0 reducers: Map output is written directly to HDFS

as the final answer

1+ reducers: Map output is passed to the shuffle

phase

10/17/2018 14

Page 15: Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

Shuffle Phase

Executed only in the case of one or more

reducers

Transfers data between the mappers and

reducers

Groups records by their keys to ensure local

processing in the reduce phase

10/17/2018 15

Page 16: Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

Shuffle Phase

10/17/2018 16

Map1 Map2 Map3 MapM…

Reduce1 Reduce2 ReduceN…

Page 17: Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

Mapi

Shuffle Phase (Map-side)

10/17/2018 17

Input S

plit

map

k vk vk vk vk vk vk v

Part

itio

n

k vk vk vk v

k vk v

k vk vk v

kA

kZ

k vk vk vk v

k vk v

k vk vk v

k vk vk vk vk v

k v

k vk vk v

k vk vk v

k v

k vk vk vk vk v

Reduce1 Reduce2 ReduceN…

0

1

N-1

0

1

N-1

0

1

N-1

0

1

N-1

Page 18: Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

Shuffle Phase (Reduce-side)

10/17/2018 18

Reducej

Map1 Map2 Map3 MapM…

Copy

Sort

Reduce

part1 part2 part3 partM

k vk vk v

k vk vk vk vk vk vk v

Page 19: Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

Reduce Phase

Apply the reduce function to each group of

similar keys

10/17/2018 19

k1 vk1 vk2 vk2 vk3 vk3 vk3 v

reduce

reduce

reduce

k… v

kN vkN vkN vkN vkN v

reduce

reduce

output

Page 20: Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

Output Writing

Materializes the final output to disk

All results are from one process

(mapper/reducer) are stored in a subdirectory

An OutputFormat is used to

Create any files in the output directory

Write the output records one-by-one to the output

Merge the results from all the tasks (if needed)

While the output writing runs in parallel, the

final commit step runs on a single machine

10/17/2018 20

Page 21: Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

MapReduce Examples

Input: A log file

Filter

Aggregation

Conversion

10/17/2018 21

Page 22: Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

Advanced Issues

Map failures

Reduce failures

Straggler problem

Custom keys and values

Efficient sorting on serialized data

Pipeline MapReduce jobs

10/17/2018 22