Distributed Computing with Apache Hadoop. Introduction to MapReduce.

42
Distributed Computing with Apache Hadoop Introduction to MapReduce Konstantin V. Shvachko Birmingham Big Data Science Group October 19, 2011

Transcript of Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Page 1: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Distributed Computing with

Apache Hadoop

Introduction to MapReduce

Konstantin V. Shvachko

Birmingham Big Data Science Group October 19, 2011

Page 2: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Computing

• History of computing started long time ago

• Fascination with numbers

– Vast universe with simple strict rules

– Computing devices

– Crunch numbers

• The Internet

– Universe of words, fuzzy rules

– Different type of computing

– Understand meaning of things

– Human thinking

– Errors & deviations are a

part of study

2

Computer History Museum, San Jose

Page 3: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Words vs. Numbers

• In 1997 IBM built Deep Blue supercomputer

– Playing chess game with the champion G. Kasparov

– Human race was defeated

– Strict rules for Chess

– Fast deep analyses of current state

– Still numbers

3

• In 2011 IBM built Watson computer to

play Jeopardy

– Questions and hints in human terms

– Analysis of texts from library and the

Internet

– Human champions defeated

Page 4: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Big Data

• Computations that need the power of many computers

– Large datasets: hundreds of TBs, PBs

– Or use of thousands of CPUs in parallel

– Or both

• Cluster as a computer

4

What is a PB?

1 KB = 1000 Bytes

1 MB = 1000 KB

1 GB = 1000 MB

1 TB = 1000 GB

1 PB = 1000 TB

???? = 1000 PB

Page 5: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Examples – Science

• Fundamental physics: Large Hadron Collider (LHC)

– Smashing high-energy protons at the speed of light

– 1 PB of event data per sec, most filtered out

– 15 PB of data per year

– 150 computing centers around the World

– 160 PB of disk + 90 PB of tape storage

• Math: Big Numbers

– 2 quadrillionth (1015) digit of π is 0

– pure CPU workload

– 12 days of cluster time

– 208 years of CPU-time on a cluster with 7600 CPU cores

• Big Data – Big Science

5

Page 6: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Examples – Web

• Search engine Webmap

– Map of the Internet

– 2008 @ Yahoo, 1500 nodes, 5 PB raw storage

• Internet Search Index

– Traditional application

• Social Network Analysis

– Intelligence

– Trends

6

Page 7: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

The Sorting Problem

• Classic in-memory sorting

– Complexity: number of comparisons

• External sorting

– Cannot load all data in memory

– 16 GB RAM vs. 200 GB file

– Complexity: + disk IOs (bytes read or written)

• Distributed sorting

– Cannot load data on a single server

– 12 drives * 2 TB = 24 TB disc space vs. 200 TB data set

– Complexity: + network transfers

7

Worst Average Space

Bubble Sort O(n2) O(n2) In-place

Quicksort O(n2) O(n log n) In-place

Merge Sort O(n log n) O(n log n) Double

Page 8: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

What do we do?

• Need a lot of computers

• How to make them work together

8

Page 9: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Hadoop

• Apache Hadoop is an ecosystem of

tools for processing “Big Data”

• Started in 2005 by D. Cutting and M. Cafarella

• Consists of two main components: Providing unified cluster view

1. HDFS – a distributed file system

– File system API connecting thousands of drives

2. MapReduce – a framework for distributed computations

– Splitting jobs into parts executable on one node

– Scheduling and monitoring of job execution

• Today used everywhere: Becoming a standard of distributed computing

• Hadoop is an open source project

9

Page 10: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

MapReduce

• MapReduce

– 2004 Jeffrey Dean, Sanjay Ghemawat. Google.

– “MapReduce: Simplified Data Processing on Large Clusters”

• Computational model

– What is a comp. model ???

• Turing machine, Java

– Split large input data into small enough pieces, process in parallel

• Execution framework

– Compilers, interpreters

– Scheduling, Processing, Coordination

– Failure recovery

10

Page 11: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Functional Programming

• Map a higher-order function

– applies a given function to each element of a list

– returns the list of results

• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]

• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]

11

Page 12: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Functional Programming: reduce

• Map a higher-order function

– applies a given function to each element of a list

– returns the list of results

• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]

• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]

• Reduce / fold a higher-order function

– Iterates given function over a list of elements

– Applies function to previous result and current element

– Return single result

• Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15

12

Page 13: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Functional Programming

• Map a higher-order function

– applies a given function to each element of a list

– returns the list of results

• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]

• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]

• Reduce / fold a higher-order function

– Iterates given function over a list of elements

– Applies function to previous result and current element

– Return single result

• Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15

• Reduce( x * y, [0,1,2,3,4,5] ) = ?

13

Page 14: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Functional Programming

• Map a higher-order function

– applies a given function to each element of a list

– returns the list of results

• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]

• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]

• Reduce / fold a higher-order function

– Iterates given function over a list of elements

– Applies function to previous result and current element

– Return single result

• Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15

• Reduce( x * y, [0,1,2,3,4,5] ) = 0

14

Page 15: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Example: Sum of Squares

• Composition of

– a map followed by

– a reduce applied to the results of the map

• Example.

– Map( x2, [1,2,3,4,5] ) = [0,1,4,9,16,25]

– Reduce( x + y, [1,4,9,16,25] ) = ((((1 + 4) + 9) + 16) + 25) = 55

• Map easily parallelizable

– Compute x2 for 1,2,3 on one node and for 4,5 on another

• Reduce notoriously sequential

– Need all squares at one node to compute the total sum.

15

Square Pyramid Number

1 + 4 + … + n2 =

n(n+1)(2n+1) / 6

Page 16: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Computational Model

• MapReduce is a Parallel Computational Model

• Map-Reduce algorithm = job

• Operates with key-value pairs: (k, V)

– Primitive types, Strings or more complex Structures

• Map-Reduce job input and output is a list of pairs {(k, V)}

• MR Job as defined by 2 functions

• map: (k1; v1) → {(k2; v2)}

• reduce: (k2; {v2}) → {(k3; v3)}

16

Page 17: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Job Workflow

17

dogs C, 3

like

cats

V, 1

C, 2 V, 2

C, 3 V, 1

C, 8

V, 4

Page 18: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

The Algorithm

18

Map ( null, word)

nC = Consonants(word)

nV = Vowels(word)

Emit(“Consonants”, nC)

Emit(“Vowels”, nV)

Reduce(key, {n1, n2, …})

nRes = n1 + n2 + …

Emit(key, nRes)

Page 19: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Computation Framework

• Two virtual clusters: HDFS and MapReduce

– Physically tightly coupled. Designed to work together

• Hadoop Distributed File System. View data as files and directories

• MapReduce is a Parallel Computation Framework

– Job scheduling and execution framework

19

Page 20: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

HDFS Architecture Principles

• The name space is a hierarchy of files and directories

• Files are divided into blocks (typically 128 MB)

• Namespace (metadata) is decoupled from data

– Fast namespace operations, not slowed down by

– Data streaming

• Single NameNode keeps the entire name space in RAM

• DataNodes store data blocks on local drives

• Blocks are replicated on 3 DataNodes for redundancy and availability

20

Page 21: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

MapReduce Framework

• Job Input is a file or a set of files in a distributed file system (HDFS)

– Input is split into blocks of roughly the same size

– Blocks are replicated to multiple nodes

– Block holds a list of key-value pairs

• Map task is scheduled to one of the nodes containing the block

– Map task input is node-local

– Map task result is node-local

• Map task results are grouped: one group per reducer

Each group is sorted

• Reduce task is scheduled to a node

– Reduce task transfers the targeted groups from all mapper nodes

– Computes and stores results in a separate HDFS file

• Job Output is a set of files in HDFS. With #files = #reducers

21

Page 22: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Map Reduce Example: Mean

• Mean

• Input: large text file

• Output: average length of words in the file µ

• Example: µ({dogs, like, cats}) = 4

22

n

ixn 1

1

Page 23: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Mean Mapper

• Map input is the set of words {w} in the partition

– Key = null Value = w

• Map computes

– Number of words in the partition

– Total length of the words ∑length(w)

• Map output

– <“count”, #words>

– <“length”, #totalLength>

23

Map (null, w)

Emit(“count”, 1)

Emit(“length”, length(w))

Page 24: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Single Mean Reducer

• Reduce input

– {<key, {value}>}, where

– key = “count”, “length”

– value is an integer

• Reduce computes

– Total number of words: N = sum of all “count” values

– Total length of words: L = sum of all “length” values

• Reduce Output

– <“count”, N>

– <“length”, L>

• The result

– µ = L / N

24

Reduce(key, {n1, n2, …})

nRes = n1 + n2 + …

Emit(key, nRes)

Analyze ()

read(“part-r-00000”)

print(“mean = ” + L/N)

Page 25: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Mean: Mapper, Reducer

25

public class WordMean {

private final static Text COUNT_KEY = new Text(new String("count"));

private final static Text LENGTH_KEY = new Text(new String("length"));

private final static LongWritable ONE = new LongWritable(1);

public static class WordMeanMapper

extends Mapper<Object, Text, Text, LongWritable> {

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

String word = itr.nextToken();

context.write(LENGTH_KEY, new LongWritable(word.length()));

context.write(COUNT_KEY, ONE);

} } }

public static class WordMeanReducer

extends Reducer<Text,LongWritable,Text,LongWritable> {

public void reduce(Text key, Iterable<LongWritable> values,

Context context) throws IOException, InterruptedException {

int sum = 0;

for (LongWritable val : values)

sum += val.get();

context.write(key, new LongWritable(sum));

} }

. . . . . . . . . . . . . . . .

Page 26: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Mean: main()

26

. . . . . . . . . . . . . . . .

public static void main(String[] args) throws IOException {

Configuration conf = new Configuration();

String[] otherArgs = new GenericOptionsParser(

conf, args).getRemainingArgs();

if (otherArgs.length != 2) {

System.err.println("Usage: wordmean <in> <out>");

System.exit(2);

}

Job job = new Job(conf, "word mean");

job.setJarByClass(WordMean.class);

job.setMapperClass(WordMeanMapper.class);

job.setReducerClass(WordMeanReducer.class);

job.setCombinerClass(WordMeanReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(LongWritable.class);

job.setNumReduceTasks(1);

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

Path outputpath = new Path(otherArgs[1]);

FileOutputFormat.setOutputPath(job, outputpath);

boolean result = job.waitForCompletion(true);

analyzeResult(outputpath);

System.exit(result ? 0 : 1);

}

. . . . . . . . . . . . . . . .

Page 27: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Mean: analyzeResult()

27

. . . . . . . . . . . . . . . .

private static void analyzeResult(Path outDir) throws IOException {

FileSystem fs = FileSystem.get(new Configuration());

Path reduceFile = new Path(outDir, "part-r-00000");

if(!fs.exists(reduceFile)) return;

long count = 0, length = 0;

BufferedReader in =

new BufferedReader(new InputStreamReader(fs.open(reduceFile)));

while(in != null && in.ready()) {

StringTokenizer st = new StringTokenizer(in.readLine());

String key = st.nextToken();

String value = st.nextToken();

if(key.equals("count")) count = Long.parseLong(value);

else if(key.equals("length")) length = Long.parseLong(value);

}

double average = (double)length / count;

System.out.println("The mean is: " + average);

}

} // end WordMean

Page 28: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

MapReduce Implementation

• Single master JobTracker shepherds the distributed heard of TaskTrackers

1. Job scheduling and resource allocation

2. Job monitoring and job lifecycle coordination

3. Cluster health and resource tracking

• Job is defined

– Program: myJob.jar file

– Configuration: conf.xml

– Input, output paths

• JobClient submits the job to the JobTracker

– Calculates and creates splits based on the input

– Write myJob.jar and conf.xml to HDFS

28

Page 29: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

MapReduce Implementation

• JobTracker divides the job into tasks: one map task per split.

– Assigns a TaskTracker for each task, collocated with the split

• TaskTrackers execute tasks and report status to the JobTracker

– TaskTracker can run multiple map and reduce tasks

– Map and Reduce Slots

• Failed attempts reassigned to other TaskTrackers

• Job execution status and results reported back to the client

• Scheduler lets many jobs run in parallel

29

Page 30: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Example: Standard Deviation

• Standard deviation

• Input: large text file

• Output: standard deviation σ of word lengths

• Example: σ({dogs, like, cats}) = 0

• How many jobs

30

n

ixn 1

2)(1

Page 31: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Standard Deviation: Hint

31

2

1

22

1

2

11

22

1

22

1

1)

1(2

1

)(1

n

i

nn

i

n

i

n

i

xn

nx

nx

n

xn

Page 32: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Standard Deviation Mapper

• Map input is the set of words {w} in the partition

– Key = null Value = w

• Map computes

– Number of words in the partition

– Total length of the words ∑length(w)

– The sum of lengths squared ∑length(w)2

• Map output

– <“count”, #words>

– <“length”, #totalLength>

– <“squared”, #sumLengthSquared>

32

Map (null, w)

Emit(“count”, 1)

Emit(“length”, length(w))

Emit(“squared”, length(w)2)

Page 33: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Standard Deviation Reducer

• Reduce input

– {<key, {value}>}, where

– key = “count”, “length”, “squared”

– value is an integer

• Reduce computes

– Total number of words: N = sum of all “count” values

– Total length of words: L = sum of all “length” values

– Sum of length squares: S = sum of all “squared” values

• Reduce Output

– <“count”, N>

– <“length”, L>

– <“squared”, S>

• The result

– µ = L / N

– σ = sqrt(S / N - µ2)

33

Reduce(key, {n1, n2, …})

nRes = n1 + n2 + …

Emit(key, nRes)

Analyze ()

read(“part-r-00000”)

print(“mean = ” + L/N)

print(“std.dev = ” +

sqrt(S/N – L*L / N*N))

Page 34: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Combiner, Partitioner

• Combiners perform local aggregation before the shuffle & sort phase

– Optimization to reduce data transfers during shuffle

– In Mean example reduces transfer of many keys to only two

• Partitioners assign intermediate (map) key-value pairs to reducers

– Responsible for dividing up the intermediate key space

– Not used with single Reducer

34

Input

Data

Input

Data

Map Reduce

Input Map Shuffle

& sort

Reduce Output Combiner

Partitioner

Page 35: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Distributed Sorting

• Sort a dataset, which cannot be entirely stored on one node.

• Input:

– Set of files. 100 byte records.

– The first 10 bytes of each record is the key and the rest is the value.

• Output:

– Ordered list of files: f1, … fN

– Each file fi is sorted, and

– If i < j then for any keys k Є fi and r Є fj (k ≤ r)

– Concatenation of files in the given order must form a completely sorted record set

35

Page 36: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Input

Data

Naïve MapReduce Sorting

• If the output could be stored on one node

• The input to any Reducer is always sorted by key

– Shuffle sorts Map outputs

• One identity Mapper and one identity Reducer would do the trick

– Identity: <k,v> → <k,v>

36

Input

Data

Map Reduce

dogs

like

cats

cats

dogs

like

Input Map Shuffle Reduce Output

cats dogs like

Page 37: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Naïve Sorting: Multiple Maps

• Multiple identity Mappers and one identity Reducer – same result

– Does not work for multiple Reducers

37

Input

Data

Output

Data Map

Map

Map

Reduce

dogs

like

cats

cats

dogs

like

Input Map Shuffle Reduce Output

Page 38: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Sorting: Generalization

• Define a hash function, such that

– h: {k} → [1,N]

– Preserves the order: k ≤ s → h(k) ≤ h(s)

– h(k) is a fixed size prefix of string k (2 first bytes)

• Identity Mapper

• With a specialized Partitioner

– Compute hash of the key h(k) and assigns <k,v> to reducer Rh(k)

• Identity Reducer

– Number of reducers is N: R1, …, RN

– Inputs for Ri are all pairs that have key h(k) = i

– Ri is an identity reducer, which writes output to HDFS file fi

– Hash function choice guarantees that

keys from fi are less than keys from fj if i < j

• The algorithm was implemented to win Gray’s Terasort Benchmark in 2008

38

Page 39: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Undirected Graphs

• “A Discipline of Programming” E. W. Dijkstra. Ch. 23.

– Good old classics

• Graph is defined by V = {v}, E = {<v,w> | v,w Є V}

• Undirected graph. E is symmetrical, that is <v,w> Є E ≡ <w,v> Є E

• Different representations of E

1. Set of pairs

2. <v, {direct neighbors}>

3. Adjacency matrix

• From 1 to 2 in one MR job

– Identity Mapper

– Combiner = Reducer

– Reducer joins values for each vertex

39

Page 40: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

Connected Components

• Partition set of nodes V into disjoint subsets V1, …, VN

– V = V1 U … U VN

– No paths using E from Vi to Vj if i ≠ j

– Gi = <Vi, Ei >

• Representation of connected component

– key = min{Vi}

– value = Vi

• Chain of MR jobs

• Initial data representation

– E is partitioned into sets of records (blocks)

– <v,w> Є E → <min(v,w), {v,w}> = <k, C>

40

Page 41: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

MR Connected Components

• Mapper / Reducer Input

– {<k, C>}, where C is a subset of V, k = min(C)

• Mapper

• Reducer

• Iterate. Stop when stabilized

41

Map {<k, C>}

For all <ki, Ci> and <kj, Cj>

if Ci ∩ Cj ≠ Ǿ then

C = Ci U Cj

Emit(min(C), C)

Reduce(k, {C1, C2, …})

resC = C1 U C2 U …

Emit(k, resC)

Page 42: Distributed Computing with Apache Hadoop. Introduction to MapReduce.

The End

42