ACM Symposium on Cloud Computing (SoCC) - Hotspot ......Established in 2009 Tech unicorn in 2013...

15
Hotspot mitigation for the masses Fabien Hermenier, Aditya Ramesh, Abhinay Nagpal, Himanshu Nagpal, Ramesh Chandra ACM Symposium On Cloud Computing 2019

Transcript of ACM Symposium on Cloud Computing (SoCC) - Hotspot ......Established in 2009 Tech unicorn in 2013...

Page 1: ACM Symposium on Cloud Computing (SoCC) - Hotspot ......Established in 2009 Tech unicorn in 2013 ~3,700 employees Doing ok for the 99% Exact approach on top of Inside ADS BtrPlace

Hotspot mitigation for the masses

Fabien Hermenier, Aditya Ramesh, Abhinay Nagpal, Himanshu Nagpal, Ramesh Chandra

ACM Symposium On Cloud Computing 2019

Page 2: ACM Symposium on Cloud Computing (SoCC) - Hotspot ......Established in 2009 Tech unicorn in 2013 ~3,700 employees Doing ok for the 99% Exact approach on top of Inside ADS BtrPlace

Entreprise cloud company

~ 15,000 customers worldwide

~ 40,000 private clouds deployments

Page 3: ACM Symposium on Cloud Computing (SoCC) - Hotspot ......Established in 2009 Tech unicorn in 2013 ~3,700 employees Doing ok for the 99% Exact approach on top of Inside ADS BtrPlace

Private clouds

to hyper-converged infrastructures (HCI)

SAN based, remote I/Os

Distributed file-system favouring local I/Os, one controller VM per node

From converged

Page 4: ACM Symposium on Cloud Computing (SoCC) - Hotspot ......Established in 2009 Tech unicorn in 2013 ~3,700 employees Doing ok for the 99% Exact approach on top of Inside ADS BtrPlace

602 private clouds

~ 4 node clusters, 13 VMs per node long tail distribution

~ 1.31:1 vCPU/thread, up to 9:1

~25% CPU, ~2% I/Os (dynamic allocation) ~44% memory (static allocation)

small clusters and beefy nodes fit SMB needs

oversubscribed cores

moderate load

no relationship between dimensions

see the distributions in the paper

Page 5: ACM Symposium on Cloud Computing (SoCC) - Hotspot ......Established in 2009 Tech unicorn in 2013 ~3,700 employees Doing ok for the 99% Exact approach on top of Inside ADS BtrPlace

Fix hotspots induced by dynamic resources allocation

Cron based Threshold based

NP-hard No holy grail

Scheduler specialisation may alter its applicability

cpu

mem

cpu

mem

cpu

mem

Acropolis Dynamic Scheduler (ADS)

Page 6: ACM Symposium on Cloud Computing (SoCC) - Hotspot ......Established in 2009 Tech unicorn in 2013 ~3,700 employees Doing ok for the 99% Exact approach on top of Inside ADS BtrPlace

Doing great for the 1%

Page 7: ACM Symposium on Cloud Computing (SoCC) - Hotspot ......Established in 2009 Tech unicorn in 2013 ~3,700 employees Doing ok for the 99% Exact approach on top of Inside ADS BtrPlace

~10,000 customers, 130+ countries +50% growth YoY

Established in 2009

Tech unicorn in 2013

~3,700 employees

Doing ok for the 99%

Page 8: ACM Symposium on Cloud Computing (SoCC) - Hotspot ......Established in 2009 Tech unicorn in 2013 ~3,700 employees Doing ok for the 99% Exact approach on top of Inside ADS BtrPlace

Exact approach on top of

Inside ADSBtrPlace

Resource model

Constraint programming backend

Consumptions retrieved from monitoring systemResource demand is a projection plus conditional scale-upStorage controller CPU usage as a proxy for I/O usage

ObjectiveMinimise data movementTend to balance

ActuationVM migrations (up to 2 in parallel)Admin notification upon no solutions

Page 9: ACM Symposium on Cloud Computing (SoCC) - Hotspot ......Established in 2009 Tech unicorn in 2013 ~3,700 employees Doing ok for the 99% Exact approach on top of Inside ADS BtrPlace

Lessons learntLooking at 2,668 clusters that called ADS at least once

Page 10: ACM Symposium on Cloud Computing (SoCC) - Hotspot ......Established in 2009 Tech unicorn in 2013 ~3,700 employees Doing ok for the 99% Exact approach on top of Inside ADS BtrPlace

Service latency is good enough

0.5% undecidable problems

Working with an exact approach

Continuous search helps yield better mitigation plans

de-facto sizing limit

Scale beyond sizing limits

0

5

10

15

90 93 94 95 99percentiles

dura

tion

(sec

.) first solutionlast solutionlatency

0.00

0.25

0.50

0.75

1.00

0 25 50 75 100saved migrations (%)

CC

DF

0.0

0.2

0.4

4 32 64 128 256cluster size (nodes)

dura

tion

(sec

.)

In the paper: engineering particularities

Page 11: ACM Symposium on Cloud Computing (SoCC) - Hotspot ......Established in 2009 Tech unicorn in 2013 ~3,700 employees Doing ok for the 99% Exact approach on top of Inside ADS BtrPlace

..

..

....

. . .

..

.

..

.

. ...

..

..

.. .. .

FEAT-42

Still NP-hard, still no holy grail

Optimise to reduce undecidable rate, migrations

Beware of false quick wins

The dataset bias dilemma

Looking for workload agnostic optimisations

..

..

....

. . .

..

..

.

.

. ...

..

.. .

. .. .

Page 12: ACM Symposium on Cloud Computing (SoCC) - Hotspot ......Established in 2009 Tech unicorn in 2013 ~3,700 employees Doing ok for the 99% Exact approach on top of Inside ADS BtrPlace

Low overall load, local hotspots.

Manage only supposed mis-placed VMs

Pin “well placed VM”

Local search to reduce the problem size

Available in BtrPlace

Enabled in ADS 1.0 during the prototyping phase

Page 13: ACM Symposium on Cloud Computing (SoCC) - Hotspot ......Established in 2009 Tech unicorn in 2013 ~3,700 employees Doing ok for the 99% Exact approach on top of Inside ADS BtrPlace

Local search considered useful and harmful

Over-filtering issues reported Moved to a 2-phases resolution

62.991.38

0.3

34.411.61

0.22

32.261.62−2.34

retry without local searchon timeout

retry without local searchif unsolvable

pure local search

0 20 40 60improvement wrt. full resolution (%)

latencymigrationssolved problems

Local search enabled, then disabled if needed Trigger reconsidered over time

Page 14: ACM Symposium on Cloud Computing (SoCC) - Hotspot ......Established in 2009 Tech unicorn in 2013 ~3,700 employees Doing ok for the 99% Exact approach on top of Inside ADS BtrPlace

Practical effectiveness

73.28% if ADS issues a plan

12.24% If unsolvable

Complex to analyse without a/b testing The success rate is a consequence of subjective modelling choices How many clusters in a clean state after a call to ADS ?

Page 15: ACM Symposium on Cloud Computing (SoCC) - Hotspot ......Established in 2009 Tech unicorn in 2013 ~3,700 employees Doing ok for the 99% Exact approach on top of Inside ADS BtrPlace

ConclusionIt is about supporting diverse workload

Not all enhancements are safe

Tools and knowledge bases are crucial

Incremental improvements from observation small wins matter

Trading quality for capability

It is not about developing a new feature, it is about checking its side effects

Exhibit and characterise outliers Tests changes to detect regressions