Fei Dong Shivnath Babu - Duke University

26
Herodotos Herodotou Fei Dong Shivnath Babu Duke University

Transcript of Fei Dong Shivnath Babu - Duke University

Page 1: Fei Dong Shivnath Babu - Duke University

Herodotos Herodotou Fei Dong

Shivnath Babu

Duke University

Page 2: Fei Dong Shivnath Babu - Duke University

Analysis  in  the  Big  Data  Era   Data Parallel Systems  Scalability  Fault-tolerance  Reliability  Flexibility

Cloud Computing  Elasticity  Reduced costs  Disaster relief  Highly automated

10/28/2011 2 Duke University

Page 3: Fei Dong Shivnath Babu - Duke University

Data  Parallel  Systems  in  the  Cloud   Power to the (non-expert) users

 Can provision cluster of choice in minutes  Can avoid the middleman (system administrator)  Pay only for the resources used

 Burden to the (non-expert & expert) users  Make decisions at various levels (resources → job)  Meet workload requirements

10/28/2011 3 Duke University

Page 4: Fei Dong Shivnath Babu - Duke University

Execu:ng  MR  Workloads  on  the  Cloud  

10/28/2011 Duke University 4

 Number of nodes  Node type

(~machine specs)

Goal: Execute MapReduce workload

Decisions:

TT   TT  

TT  DN  

TT   TT  

TT  DN  

 Number of M/R slots  Memory settings  …

 Number of M/R tasks  Use compression  …

Map  

Map  Red  

Inpu

t

Out

put

Goal: Execute MapReduce workload in < 2 hours and < $10

Page 5: Fei Dong Shivnath Babu - Duke University

Mul:-­‐objec:ve  Cluster  Provisioning  

0 200 400 600 800

1,000 1,200

m1.small m1.large m1.xlarge c1.medium c1.xlarge

Run

ning

Tim

e (m

in)

0.00 2.00 4.00 6.00 8.00

10.00

m1.small m1.large m1.xlarge c1.medium c1.xlarge

Cos

t ($)

Amazon EC2 Instance Type

10/28/2011 5 Duke University

Different machine specs

Page 6: Fei Dong Shivnath Babu - Duke University

Moving  from  Develop.  to  Produc:on  

10/28/2011 Duke University 6

 Smaller, less powerful than production cluster

 Develop & test MapReduce jobs

 How will the jobs behave?

 What settings to use for running the jobs?

Production Cluster

Development Cluster

Page 7: Fei Dong Shivnath Babu - Duke University

Cluster  Sizing  Problem  

Determine cluster resources & job-level configuration parameters to meet requirements on execution time & cost for a given workload

10/28/2011 Duke University 7

Page 8: Fei Dong Shivnath Babu - Duke University

Solu:on:  Elas:sizer  

10/28/2011 Duke University 8

 Elastisizer, part of the Starfish system http://www.cs.duke.edu/starfish  

Cluster Sizing Query Interface

Resource Optimizer

Configuration Optimizer

What-if Engine

Profiler

Page 9: Fei Dong Shivnath Babu - Duke University

Cluster  Sizing  Query  Interface  1.  Workload

  MapReduce jobs   Input data properties

2.  Cluster Resources   Number of nodes   Type of nodes

3.  Job Configurations   Number of map/reduce tasks   Task-level memory allocation

4.  Performance Requirements   Execution time   Monetary cost

10/28/2011 9 Duke University

Abstracted using job profiles

Either specify or ask for recommendations

Optimize for one (possibly) subject to a constraint on the other

Page 10: Fei Dong Shivnath Babu - Duke University

Cluster  Sizing  Example  Queries  

 How will the jobs behave?  Specify workload, production cluster’s resources &

particular configuration settings. Ask for performance.  What settings to use?

 Specify workload, production cluster’s resources. Ask for configuration settings.

10/28/2011 Duke University 10

Production Cluster

Development Cluster

What-if Question

Optimization Request

Page 11: Fei Dong Shivnath Babu - Duke University

What-if Engine

What-­‐if  Ques:on  

Job Profiles

Properties of Hypothetical job

Input Data Properties

Cluster Resources

Configuration Settings

Possibly Hypothetical

10/28/2011 11 Duke University

Page 12: Fei Dong Shivnath Babu - Duke University

Job  Profile   Concise representation of program execution

 Records information at the level of “task phases”  Generated by Profiler through measurement or by the

What-if Engine through estimation

Memory Buffer

Merge

Sort, [Combine], [Compress]

Serialize, Partition map

Merge

split

DFS

Spill Collect Map Read 10/28/2011 12 Duke University

Page 13: Fei Dong Shivnath Babu - Duke University

Job  Profile  Fields  Dataflow:  amount  of  data  flowing  through  task  phases  Map output bytes Number of spills Number of records in buffer per spill

Costs:  execution  times  at  the  level  of  task  phases  Read phase time in the map task Map phase time in the map task Spill phase time in the map task

Dataflow    Statistics:    statistical  information  about  dataflow  Width of input key-value pairs Map selectivity in terms of records Map output compression ratio

Cost    Statistics:    statistical  information  about  resource  costs  I/O cost for reading from local disk per byte

CPU cost for executing the Mapper per record

CPU cost for uncompressing the input per byte

10/28/2011 13 Duke University

Page 14: Fei Dong Shivnath Babu - Duke University

What-­‐if  Ques:on  

Job Profiles

Properties of Hypothetical job

Input Data Properties

Cluster Resources

Configuration Settings

Possibly Hypothetical

10/28/2011 14 Duke University

What-if Engine Virtual Profile Estimation

Virtual Job Profiles

Task Scheduler Simulator

Page 15: Fei Dong Shivnath Babu - Duke University

Virtual  Profile  Es:ma:on  Given profile for job j, estimate profile for job j' based on new input data, cluster resources, & configuration

Dataflow Statistics

Dataflow

Cost Statistics

Costs

Profile for j

Resources

Relative Black-box

Models

Confi-guration

White-box Models

White-box Models

(Virtual) Profile for j'

Costs

Cost Statistics Dataflow

Dataflow Statistics

Input Data

Cardinality Models

10/28/2011 15 Duke University

Page 16: Fei Dong Shivnath Babu - Duke University

Rela:ve  Models  for  Cost  Sta:s:cs  (CS)  

CSdev for job j

dev cluster resources r1   prod cluster resources r2  

CSprod for job j Field   Avg.  

Local I/O cost / byte 17 Sort CPU cost / key 6

Field   Avg.  Local I/O cost / byte ? Sort CPU cost / key ?

Trai

ning

jobs

j1

j2

jn

… … Supervised learning algorithm

10/28/2011 16 Duke University

Page 17: Fei Dong Shivnath Babu - Duke University

Genera:ng  Training  Samples   Requirements

  Good coverage of prediction space   Generate samples efficiently

  Idea 1: Try to cover the space based on workload

 Apriori Training Benchmark   Run a sample of the jobs in the workload   Assumes knowledge of full workload   Running time grows with workload & data   New jobs not represented in training data

10/28/2011 17 Duke University

Page 18: Fei Dong Shivnath Babu - Duke University

Genera:ng  Training  Samples   Requirements

  Good coverage of prediction space   Generate samples efficiently

  Idea 2: Focus on spectrum of resource usage

  Fixed Training Benchmark   Run representative MapReduce jobs with different resource

usage behavior   Independent from actual workload & data

10/28/2011 18 Duke University

Page 19: Fei Dong Shivnath Babu - Duke University

Genera:ng  Training  Samples   Requirements

  Good coverage of prediction space   Generate samples efficiently

  Idea 3: Focus on spectrum of resource usage + Exploit job profile abstraction

 Custom Training Benchmark   Small, synthetic workload   Tasks within jobs behave differently in terms of CPU, I/O,

memory, and network   Diverse training samples   No knowledge of actual workload/input data

10/28/2011 19 Duke University

Page 20: Fei Dong Shivnath Babu - Duke University

Experimental  Evalua:on   Different scenarios: Cluster × Workload × Data

EC2 Node Type

CPU: EC2 units

Mem I/O Perf. Cost /hour

#Maps /node

#Reds/node

MaxMem /task

m1.small 1 (1 x 1) 1.7 GB moderate $0.085 2 1 300 MB

m1.large 4 (2 x 2) 7.5 GB high $0.34 3 2 1024 MB

m1.xlarge 8 (4 x 2) 15 GB high $0.68 4 4 1536 MB

c1.medium 5 (2 x 2.5) 1.7 GB moderate $0.17 2 2 300 MB

c1.xlarge 20 (8 x 2.5) 7 GB high $0.68 8 6 400 MB

cc1.4xlarge 33.5 (8) 23 GB very high $1.60 8 6 1536 MB

10/28/2011 20 Duke University

Page 21: Fei Dong Shivnath Babu - Duke University

Experimental  Evalua:on   Different scenarios: Cluster × Workload × Data

Abbr. MapReduce Program Domain Dataset

CO Word Co-occurrence Natural Lang Proc. Wikipedia (10GB – 22GB)

WC WordCount Text Analytics Wikipedia (30GB – 180GB)

TS TeraSort Business Analytics TeraGen (30GB – 1TB)

LG LinkGraph Graph Processing Wikipedia (compressed ~6x)

JO Join Business Analytics TPC-H (30GB – 1TB)

TF Term Freq. - Inverse Document Freq.

Information Retrieval Wikipedia (30GB – 180GB)

10/28/2011 21 Duke University

Page 22: Fei Dong Shivnath Babu - Duke University

Mul:-­‐objec:ve  Cluster  Provisioning  

0 200 400 600 800

1,000 1,200

m1.small m1.large m1.xlarge c1.medium c1.xlarge

Run

ning

Tim

e (m

in) Actual

Predicted

0.00 2.00 4.00 6.00 8.00

10.00

m1.small m1.large m1.xlarge c1.medium c1.xlarge

Cos

t ($)

EC2 Instance Type for Target Cluster

Actual Predicted

Workload: All jobs, 22GB – 180GB per job Instance Type for Source Cluster: m1.large

10/28/2011 22 Duke University

Page 23: Fei Dong Shivnath Babu - Duke University

Moving  from  Develop.  to  Produc:on  

10/28/2011 Duke University 23

Profile a job on Development cluster, optimize for Production cluster Development cluster: 10 m1.large nodes (40 EC2 units), ~60 GB Production cluster: 30 m1.xlarge nodes (240 EC2 units), ~180 GB

0 5

10 15 20 25 30 35 40

TS WC LG JO TF CO

Run

ning

Tim

e (m

in)

MapReduce Jobs

Rules-of-Thumb Profile from Development Profile from Production

Page 24: Fei Dong Shivnath Babu - Duke University

Tuning  Cluster  Size  

10/28/2011 Duke University 24

0 10 20 30 40

5 10 20 30 Run

ning

Tim

e (m

in)

Number of Nodes

LinkGraph Actual Predicted

0 50

100 150 200

5 10 20 30 Run

ning

Tim

e (m

in)

Number of Nodes

Word Co-occurrence Actual Predicted

Profile a job on 10-node cluster, predict for other sizes Profile cluster: 10 m1.large nodes (40 EC2 units), ~60 GB Prediction clusters: 5-30 m1.large nodes (40 EC2 units), ~60 GB

Page 25: Fei Dong Shivnath Babu - Duke University

Training  Benchmarks  Evalua:on  

10/28/2011 Duke University 25

-50

50

150

250

m1.small m1.large m1.xlarge c1.medium c1.xlarge Trai

ning

Tim

e (m

in)

Instance Type

Apriori Fixed Custom

0.0

0.5

1.0

1.5

m1.small m1.xlarge c1.medium c1.xlarge Rel

ativ

e Pr

edic

tion

Err

or o

ver A

prio

ri

Ben

chm

ark

Instance Type

Apriori Fixed Custom

Page 26: Fei Dong Shivnath Babu - Duke University

More  info:  www.cs.duke.edu/starfish  

Job-level MapReduce

configuration

Workflow optimization

Workload management

Data layout tuning

Cluster sizing

J1 J2

J3

J4

10/28/2011 26 Duke University