Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

40
CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, Ming Zhang

Transcript of Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Page 1: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics

Omid Alipourfard, Hongqiang Harry Liu, Jianshu Chen, Shivaram Venkataraman, Minlan Yu, Ming Zhang

Page 2: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

22

Apps & Data Frameworks

Recurring

Recurring jobs are popular

Microsoft reports 40% of key jobs at Bing are rerun periodically.

Page 3: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

3

Cluster SizeMachine TypeProviders

r3.8xlarge, i2.8xlarge, m4.8xlarge, c4.8xlarge,

r4.8xlarge, m3.8xlarge, …

A0, A1, A2, A3, A4, A11, A12, D1, D2, D3, L4s, …

n1-standard-4, n1-highmem-2, n1-highcpu-4, f1-micro, ...

+ configurable VMs

10s~ optionsX XHundreds of instance types and instance count combinations

Page 4: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Choosing a good configuration is important

Better performance:

- For the same cost: best/worst running time is up to 3x- Worst case has good CPUs whereas the memory is bottlenecked.

Lower cost:

- For the same performance: best/worst cost is up to 12x- No need for expensive dedicated disks in the worst config.- 2$ vs. 24$ per job with 100s of monthly runs

4

Page 5: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

How to find the best cloud configuration One that minimizes the cost given a performance constraint

for a recurring job, given its representative workload?

5

Page 6: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

6

High Accuracy Adaptivity

Close the optimal configurations

Works across all big-data apps

Only runs a few configuration

Key Challenges

Low Overhead

Modeling Searching

Page 7: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Existing solution: searching

- Systematically search each dimension (Coordinate descent)- On each resources: RAM, CPU, disk, cluster sizes

- Problem: not accurate- Non-convex performance/cost curves across many resources- If you search one dimension, drop early, it would mislead you later

7

Page 8: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Existing solution: modeling

8

- Modeling the resource-perf/cost tradeoffs - Ernest [NSDI 16] models machine learning apps for each machine type

- Problem: not adaptive- Frameworks (e.g., MapReduce or Spark)

- Applications (e.g., machine learning or database)

- Machine type (e.g., memory vs CPU intensive)

Page 9: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

9

High Accuracy Adaptivity

ModelingErnest

Not general

SearchingCoord. Descent

Not accurate

ExhaustiveHigh overhead

CherryPick

Low Overhead

Page 10: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Key idea of CherryPick

10

- Adaptivity: black-box modeling - Without knowing the structure of each application

- Accuracy: modeling for ranking configurations- No need to be accurate everywhere

- Low overhead: interactive searching - Smartly select next run based on existing runs

Page 11: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Workflow of CherryPick

11

Run the config

Black-box Modeling

Rank and choose next config

Return config

Interactive search

Start with any config

Page 12: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Workflow of CherryPick

12

Run the config

Black-box Modeling

Rank and choose next config

Return config

Interactive search

Page 13: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Workflow of CherryPick

13

Run the config

Black-box Modeling

Rank and choose next config

Return config

Interactive search

Page 14: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Workflow of CherryPick

14

Run the config

Black-box Modeling

Rank and choose next config

Return config

Interactive search

Page 15: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Workflow of CherryPick

15

Run the config

Black-box Modeling

Rank and choose next config

Return config

Interactive search

Page 16: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Workflow of CherryPick

16

Bayesian Optimization

Run the config

Black-box Modeling

Rank and choose next config

Return config

Interactive search

Page 17: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Workflow of CherryPick

17

Prior function

Bayesian Optimization

Run the config

Black-box Modeling

Rank and choose next config

Return config

Interactive search

Page 18: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Workflow of CherryPick

18

Acquisition function Prior function

Bayesian Optimization

Run the config

Black-box Modeling

Rank and choose next config

Return config

Interactive search

Page 19: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Workflow of CherryPick

19

Prior function

Bayesian Optimization

Run the config

Black-box Modeling

Rank and choose next config

Return config

Interactive search

Page 20: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

20

This is the actual cost function curve across all configurations.

Prior function for black box modeling

Feature vector:CPU/RAM/Machine count/…

Page 21: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

21

Challenge: what can we infer about configurations with two runs?

Prior function for black box modeling

Page 22: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Challenge: what can we infer about configurations with two runs? There are many valid functions passing through two points

22

Prior function for black box modeling

Page 23: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Challenge: what can we infer about configurations with two runs? Solution: confidence intervals capture where the function likely lies

23

Prior function for black box modeling

Page 24: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Workflow of CherryPick

24

Acquisition function Prior function

Run the config

Black-box Modeling

Rank and choose next config

Return config

Interactive search

Page 25: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

25

Build an acquisition function based on the prior functionCalculate the expected improvement in comparison to the current

best configuration

Acquisition function for choosing the next config

Page 26: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

26

Highest expected improvement

Acquisition function for choosing the next config

Page 27: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

27

It then runs the new configuration.

Acquisition function for choosing the next config

Page 28: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

28

And update the model.

Iterative searching

Page 29: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

CherryPick searches in the area that matters

29

A database application with 66 configurations

Page 30: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

CherryPick searches in the area that matters

30

A database application with 66 configurations

Page 31: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

31

CherryPick searches in the area that matters

A database application with 66 configurations

Page 32: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Noises in the cloud

32

- Noises are common in the cloud- Shared environment means inherent noise from other tenants- Even more noisy under failures, stragglers- Strawman solution: run multiple times, but high overhead.-

- Bayesian optimization is good at handling additive noise- Merge the noise in the confidence interval

(learned by monitoring or historical data)

Noise

Page 33: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Challenge: Multiplicative noise

33

A lot of the noise in cloud is multiplicative

- Example: if VMs IO slows down due to overloading

- Writing 1G to disk (5 sec normally) now takes 6 secs (by 20%)- Writing 10G to disk (50 sec normally) now takes 60 secs (by 20%)

Use the log function to change to additive noise

The max. relative variance in a given cloud.

Page 34: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Further customizations

34

- Discretize features:- Deal with infeasible configs and large searching space- Discretize the feature space

- Stopping condition:- Trade-off between accuracy and searching cost- Use the acquisition function knob

- Starting condition:- Should fully cover the whole space- Use quasi-random search

Page 35: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Evaluation settings

5 big-data benchmarks

- Database: - TPC-DS- TPC-H

- MapReduce:- TeraSort

- Machine Learning:- SparkML Regression- SparkML Kmeans

66 cloud configurations

- 30 GB-854 GB RAM- 12-112 cores- 5 machine types

35

Page 36: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Evaluation Settings

36

Metrics

- Accuracy: running cost compared to the optimal configuration- Overhead: searching cost of suggesting a configuration

Comparing with

- Searching: random search, coordinate descent- Modeling: Ernest [NSDI’16]

Page 37: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

37

100

100

Run

ning

co

st

norm

aliz

ed b

y o

pt.

Co

nfig

. (%

)

Searching cost normalized by CherryPick’s median (%)

125

150

175

150 350200 250 300 400

Random search

Coordinate search

Ernest

Not accurate

Not adaptable

CherryPick has high accuracy with low overhead

Coord. search has 20% more searching cost in median and

78% higher running cost in the tail for TPC-DS

3.7 times searching cost

TPC-DS

CherryPick

Page 38: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

CherryPick corrects Amazon guidelines

38

- Machine type- AWS suggests R3,I2,M4 as good options for running TPC-DS- CherryPick found that at our scale C4 is the best.

- Cluster size- AWS has no suggestion- Choosing a bad cluster size can have 3 times higher cost than

choosing a right cluster config.

- Combining the two becomes even harder.

Page 39: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Conclusions

39

Adaptivity

Low overhead

High accuracy

Restricted amount of information:Only a few runs available.

Black-box modeling:Requires running the cloud configuration.

“... do not solve a more general problem ... try to get the answer that you really need” —Vladimir Vapnik

Page 40: Shivaram Venkataraman, Minlan Yu, Ming Zhang Omid ...

Thanks!

40

Please try our tool at:

https://github.com/yale-cns/cherrypick