Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in...

24
Distributed Autonomous Virtual Resource Management in Datacenters Using Finite- Markov Decision Process Liuhua Chen, Haiying Shen and Karan Sapra Department of Electrical and Computer Engineering Clemson University

Transcript of Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in...

Page 1: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-

Markov Decision Process

Liuhua Chen, Haiying Shen and Karan Sapra

Department of Electrical and Computer EngineeringClemson University

Page 2: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

Load Balancing is Critical For IaaS Cloud

VMPM PM PM PM

VM – Virtual MachinePM – Physical MachineDatacenter DC Network

Increase resource utilization Reduce response time / SLA violations

2

Increase profit

Page 3: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

Existing LB Methods Have Limitations

3

Reactive Methods

Proactive Methods

Reactively perform VM migration upon the occurrence of PM overloads

Predict VM resource demand in a short time for sufficient resource provision or load balancing

× Selecting migration VMs and destination PMs generates:-- high delay -- high overhead

× Cannot guarantee a long-term load balance state

Page 4: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

Requirements for High Performance LB

Proactively handle the potential load imbalance problemGenerate low overhead and delay for load

balancingMaintain a long-term load balance state

4

Page 5: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

Solution for High Performance LBo Solution

5

× Consider current situation

× Consider next step

Consider many steps

s1 s2

a1

a2

a2

a1

0.2

0.86-1

0.5

0.5

0.3

0.70.40.6

-58

-2-3

4

2

Finite-Markov Decision Process (MDP)

State

Transition probability

Action Reward

Page 6: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

Optimal action policy π

Transition probability RewardPM state PM action

MDP model

Check state

Take action π

Periodically

Resource utilization monitor

Refer to π

Periodicalupdate

MDP-based Load Balancing

6

Determine s, a and r

Monitor & calculate

probabilityCompute optimal π

PM check self state Refer to π Take

action

Page 7: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

Advantages of MDP-based Load Balancing

o Achieves long-term load balance hence reduces SLA violations

7

SLA

tenant provider

MDP

PM

PMPM

o Builds one MDP used by all PMs

o Reduces the overhead and delay of load balancing

Page 8: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

Challenges of the MDP Design

1. MDP components must be well designed for lowoverhead

8

2. Transition probabilities in the MDP must be stable

VM resc. utilization changes over time

VMs have continuous resc. utilization values

Large action space

Frequent updates of transition probabilities

Probability changes over time

Probability changes over load thresholds

Difficult to determine the load threshold for statesFrequent updates of probabilities

Page 9: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

Solution to Challenge 1

o Action: moving out a specific VMneeds to record the state transitions of a PM for

moving out each VMgenerates a prohibitive cost due to many VMsis not accurate due to time-varying VM load

9

Page 10: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

Solution to Challenge 1

10

CPU

MEM

Low Med High

Low

Med

High

s9 s6 s3

s8 s5 s2

s7 s4 s1

VM/PM states

o Action: moving out a VM state (high, med, low)uses a PM load state as a staterecords the transitions between PM load states

by moving out a VM in a load state

Page 11: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

Solution to Challenge 1

11

o Action: moving out a VM state (high, med, low)uses a PM load state as a staterecords the transitions between PM load states

by moving out a VM in a load state

o Advantages:the total number of VM-states in the action set

does not changeeach VM-state itself does not change, so the

associated transition probability does not change

Page 12: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

Challenge 2

o MDP’s transition probabilities should be stableotherwise, it cannot accurately provide guidancemust be updated very frequently to keep the

transition probabilities accurate

o We have studied transition probabilities on VMmigrations based on traces (Google Cluster trace andPlanetLab VM trace)stable under slightly varying load thresholdstable over time

12

Page 13: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

Trace Study on Transition Probabilities

13

0.00.30.50.81.0

0.7 0.8 0.9Load threshold

VM-high VM-medVM-low

Tran

sitio

n pr

obab

ility

Stable under varying threshold and over time

0.00.30.50.81.0

25 50 75Simulation time (hr)

VM-high VM-medVM-low

Tran

sitio

n pr

obab

ility

The probabilities that PM-high PM-medium by taking different actions, using Google Cluster trace

Page 14: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

MDP Components

o State: classification of resource utilization of a PM

o Action: a migration of VM in a certain state (VM-State)

o Transition probability: the probability that state s will transit to state s’ after taking action a

o Reward: given after transition to state s’ from state s by taking action a

14

s1 s2

a1

a2

a2

a1

0.2

0.86-1

0.5

0.5

0.3

0.70.40.6

-58

-2-3

4

2

MDP model

Page 15: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

Rewarding Policy

15

PM-high PM-low Positivereward A

PM-high PM-med Positivereward B

B > A

PM-low Positivereward C

C > D

no

PM-med Positivereward Dno

PM-high Negativereward no

Rewards for state changes Rewards for no actions

• Goal: avoid heavily loaded state of PMs while constrain the number of VM migrations

Page 16: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

Transition probability RewardPM state PM action

MDP model

Optimal action policy π

Check state

Take action π

Periodically

Resource utilization monitor

Refer to π

Periodicalupdate

MDP-based Load Balancing

16

Page 17: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

Optimal Action Determination

17

Value-iteration algorithm: find an action for each specific state that can quickly lead to the maximum reward

Transition probability Reward Value for a specific state, calculated based on rewards

Basic idea:• Set V=0• Iteratively update V by the equations• Update optimal actions for states in each iteration

The number of iterations indicates how many steps are considered in the future

Page 18: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

Experimental Setup

18

Simulator: CloudSim

Traces: PlanetLab VM trace Google Cluster trace

Implement two versions: MDP uses the MDP model for migration VM selection;MDP* uses the model for both migration VM selection and destination PM selection.

Comparison methods: Sandpiper [1], CloudScale [2]

VMPM

trace

trace …

DC Network

X 1000 X 100Load balancing is conducted every 300 seconds for 30 times

PM

VM PM

[1] T. Wood, P. J. Shenoy, A. Venkataramani, and M. S. Yousif. Black-box and gray-box strategies for virtual machine migration. In Proc. of NSDI, 2007.[2] Z. Shen, S. Subbiah, X. Gu, and J. Wilkes. CloudScale: Elastic resource scaling for multi-tenant cloud systems. In Proc. of SOCC, 2011.

Page 19: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

The Number of VM Migrations

19

0

20

40

60

80

1.5 2 2.5Totla

l num

ber o

f m

igra

tions

MDP MDP*Sandpiper CloudScale

Load (x original load in trace)

Result: MDP*<MDP<Sandpiper<CloudScaleConclusion: MDP* has the least number of migrations

the lowest load balancing overhead

Page 20: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

The Number of Overloaded PMs

20

Result: MDP*<MDP<Sandpiper<CloudScaleConclusion: MDP* has the least number of overloads

the longest time for the load balance state

01020304050

1.5 2 2.5

MDP MDP*Sandpiper CloudScale

Tota

l num

ber o

fov

erlo

aded

PM

s

Load (x original load in trace)

Page 21: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

The CPU Time Consumptions

21

Result: MDP*<MDP<CloudScale<SandpiperConclusion: MDP* is the fastest in load balancing

0.0

2.0

4.0

6.0

2.5 3 3.5

CPU

tim

e (m

s)

VM/PM ratio

MDPMDP*SandpiperCloudScale

Page 22: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

Total Amount of Energy Consumptions

22

Result: MDP*<MDP<Sandpiper<CloudScaleConclusion: MDP* consumes the least amount of energy0.0

2.0

4.0

6.0

2.5 3 3.5

Ener

gy (k

Wh)

VM/PM ratio

MDP MDP*Sandpiper CloudScale

Page 23: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

Conclusion from Experimental Results

23

o The MDP model:Maintain the load balance state for a longer timeReduce SLA violations

Reduce the load balancing overhead and delay

SLA

tenant provider

Page 24: Distributed Autonomous Virtual Resource …Distributed Autonomous Virtual Resource Management in Datacenters Using Finite-Markov Decision Process Liuhua Chen, Haiying Shen and Karan

Summaryo Motivation: Provide a load balancing method that

can reduce SLA violations and meanwhile reduce the load balancing overhead and delay

o We propose an MDP-based load balancing methodas an online decision making strategy to enhance the performance of cloud datacenters

o We conducted trace-driven experiments to show that our method outperforms previous reactive and proactive methods

o Future work: make our method fully distributed to increase its scalability

24