Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.

14
Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Shivnath Babu Duke University Duke University

Transcript of Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.

Page 1: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.

Towards Automatic Optimization of MapReduce Programs

(Position Paper)

Shivnath BabuShivnath Babu

Duke UniversityDuke University

Page 2: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.

JAQL

Roadmap

• Call to action to improve automatic optimization techniques in MapReduce frameworks

• Challenges & promising directions

Hadoop

HDFS

Pig Hive …

Page 3: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.

Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as aMapReduce job

Page 4: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.

Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as aMapReduce job

Page 5: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.

Map Wave 1

ReduceWave 1

Map Wave 2

ReduceWave 2

Input Splits

Lifecycle of a MapReduce JobTime

How are the number of splits, number of map and reducetasks, memory allocation to tasks, etc., determined?

Page 6: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.

Job Configuration Parameters

• 190+ parameters in Hadoop

• Set manually or defaults are used

• Are defaults or rules-of-thumb good enough?

Page 7: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.

Ru

nn

ing

tim

e (

se

co

nd

s)

Experiments

On EC2 andlocal clusters

Ru

nn

ing

tim

e (s

eco

nd

s)

Ru

nn

ing

tim

e (

min

ute

s)

Ru

nn

ing

tim

e (m

inu

tes)

Page 8: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.

• Performance at default and rule-of-thumb settings can be poor

• Cross-parameter interactions are significant

Illustrative Result: 50GB Terasort17-node cluster, 64+32 concurrent map+reduce slots

mapred.reduce.tasks

io.sort.factor

io.sort.record.percent

10 10 0.15

Runningtime

10 500 0.15

28 10 0.15

300 10 0.15

300 500 0.15

Based onpopularrule-of-thumb

Page 9: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.

Problem Space

Current approaches:• Predominantly manual• Post-mortem analysis

Job configurationparameters

Declarative HiveQL/Pigoperations

Multi-jobworkflows

Performanceobjectives

Cost in pay-as-you-goenvironment

Energyconsiderations

Com

plex

ity

Spa

ce o

f ex

ecut

ion

choi

ces

Is this where we want to be?

Page 10: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.

Good planGood settingof parameters

Can DB Query Optimization Technology Help?

But:

– MapReduce jobs are not declarative

– No schema about the data

– Impact of concurrent jobs & scheduling?

– Space of parameters is huge

Optimizer:• Enumerate• Cost • Search

QueryDatabaseExecution

Engine

MapReducejob

Hadoop Results

Can we:

– Borrow/adapt ideas from the wide spectrum of query optimizers that have been developed over the years

• Or innovate!

– Exploit design & usage properties of MapReduce frameworks

Page 11: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.

Spectrum of Query Optimizers

Conventional

OptimizersRule-based

Cost models + statistics about data

AT’s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks

Insight: Predictability(RBO) >> Predictability(CBO)

Page 12: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.

Spectrum of Query Optimizers

Conventional

OptimizersRule-based

Cost models + statistics about data

AT’s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks

Insight: Predictability(RBO) >> Predictability(CBO)

LearningOptimizers(learn from

execution & adapt)

TuningOptimizers(proactively

try different plans)

Page 13: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.

Spectrum of Query Optimizers

Conventional

OptimizersRule-based

Cost models + statistics about data

LearningOptimizers(learn from

execution & adapt)

Exploit usage & design properties of MapReduce frameworks:

• High ratio of repeated jobs to new jobs

• Schema can be learned (e.g., Pig scripts)

• Common sort-partition-merge skeleton

• Mechanisms for adaptation stemming from design for robustness (speculative execution, storing intermediate results)

• Fine-grained and pluggable scheduler

TuningOptimizers(proactively

try different plans)

Page 14: Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.

Summary• Call to action to improve automatic optimization

techniques in MapReduce frameworks– Automated generation of optimized Hadoop configuration

parameter settings, HiveQL/Pig/JAQL query plans, etc.

– Rich history to learn from

– MapReduce execution creates unique opportunities/challenges