CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and...

15
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu

Transcript of CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and...

Page 1: CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.

CPS216: Advanced Database Systems (Data-intensive

Computing Systems)

Introduction to MapReduce and Hadoop

Shivnath Babu

Page 2: CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.

Word Count over a Given Set of Web Pages

see bob throw see 1

bob 1

throw 1

see 1

spot 1

run 1

bob 1

run 1

see 2

spot 1

throw 1

see spot run

Can we do word count in parallel?

Page 3: CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.

The MapReduce Framework (pioneered by Google)

Page 4: CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.

Automatic Parallel Execution in MapReduce (Google)

Handles failures automatically, e.g., restarts tasks if a node fails; runs multiples copies of the same task to

avoid a slow task slowing down the whole job

Page 5: CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.

MapReduce in Hadoop (1)

Page 6: CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.

MapReduce in Hadoop (2)

Page 7: CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.

MapReduce in Hadoop (3)

Page 8: CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.

Data Flow in a MapReduce Program in Hadoop

• InputFormat• Map function• Partitioner• Sorting & Merging• Combiner• Shuffling• Merging• Reduce function• OutputFormat

1:many

Page 9: CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Page 10: CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.

Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as aMapReduce job

Page 11: CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.

Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as aMapReduce job

Page 12: CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.

Map Wave 1

ReduceWave 1

Map Wave 2

ReduceWave 2

Input Splits

Lifecycle of a MapReduce JobTime

How are the number of splits, number of map and reducetasks, memory allocation to tasks, etc., determined?

Page 13: CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.

Job Configuration Parameters

• 190+ parameters in Hadoop

• Set manually or defaults are used

Page 14: CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.

How to sort data using Hadoop?

Page 15: CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.

Let us look at a complete example MapReduce program

in Hadoop