Thesis Proposal VARUN GUPTA 1. Thesis Proposal VARUN GUPTA 2.

Stochastic Models and Analysis for Resource

Management in Server Farms

Thesis Proposal

VARUN GUPTA

1

SMART : Stochastic Models and Analysis for Resource

managemenT in server farms

Thesis Proposal

VARUN GUPTA

2

SARCASM : Stochastic models and Analysis for ResourCe

mAnagement in Server farMs

Thesis Proposal

VARUN GUPTA

3

Supercomputers

Data center pods

Cloud computing

Multi-core chipsArray-of-Wimpy-Nodes4

Server farm popularity is on the rise

+ high compute capacity+ incremental growth+ fault-tolerance+ efficient resource utilization+ energy efficiency+ high parallelism

5

BackendserversFrontend

dispatcher/router

Design Choice 1:Which server to send a job to?

Design Choice 2:Scheduling policy

for backend servers?

Design Choice 3: When to turn servers on/off for efficiency?

Design Choice 4: How many servers to buy? Of what capacity?

A simple server farm “template”

6

BackendserversFrontend

dispatcher/router

Design Choice 1:Dispatching policy

Design Choice 2:Scheduling policy

Design Choice 3: Dynamic capacity scaling

Design Choice 4: Provisioning

A simple server farm “template”

Thesis GoalStochastic modeling and analysis to answer the

questions faced by server farm designers/managers

Long history of stochastic modeling and analysis• Erlang (1909): Operator provisioning in telephone exchanges• Inventory/production management• Call center staffing

Several gaps between traditional models and compute server farms• New constraints• New opportunities• New metrics

Bridge these gaps by developing new models andanalysis techniques relevant to requirements of today’sserver farms

7

8

Application GAPDesign Choice

Dispatching

Backend Scheduling

Capacity scaling

Provisioning

Web server farms

No analysis for PS server farms

Energy management

in Data centers

Setup penalties + unpredictable demands

Fully replicated DBs

High variance in job sizes

Database servers Thrashing

VM management

VM migration

and departures

A summary of explored gaps

A glimpse of results

9

10


Dispatching

Backend Scheduling

Capacity scaling

Provisioning

Web server farms


Energy management

in Data centers





VM management

VM migration

and departures


11

Application 1 : Web server farms/cluster computing

Immediatedispatch

12

PS

12

Q: Good load balancing dispatchers? How many servers? Existing work limited to First-Come-First-Served servers or Exponential job size distribution

GAP : Processor Sharing servers + high-variance job sizes

PS

PS

Application 1 : Web server farms/cluster computing

Immediatedispatch

13

PS

13

PS

PS

???Poisson

arrivals

Join-Shortest-Queue (JSQ) : most popular

• Balances load• Greedy

KHomogeneous

Servers

Q: Is JSQ optimal for general job size distribution?Bonomi [90] : Optimal for Exponential job size distribution when job sizes unknown

Q: Analysis of JSQ for general job size distribution?

Model : Dispatching policies for M/G/K-PS

10

12

14

16

18

20

Det

Exp

Bim-1

Wei

b-1

Wei

b-2

Bim-2

MeanResponseTime

RANDOM

???PS

PSSimulation Results

Increasing job-size variance (same mean) 14

10

12

14

16

18

20

Det

Exp

Bim-1

Wei

b-1

Wei

b-2

Bim-2

MeanResponseTime

RANDOM

JSQ

???PS

PS

Increasing job-size variance (same mean)

Simulation Results

15

10

12

14

16

18

20

Det

Exp

Bim-1

Wei

b-1

Wei

b-2

Bim-2

MeanResponseTime

RANDOM

Round-Robin

JSQ

???PS



10

12

14

16

18

20

Det

Exp

Bim-1

Wei

b-1

Wei

b-2

Bim-2

MeanResponseTime

RANDOM

Round-Robin

JSQOPT-0

???PS



18

PS

18

PS

PS

JSQPoisson

arrivals

KHomogeneous

Servers

Conjecture: JSQ is near-optimal (even among size-aware dispatching policies)

Performance of JSQ is “nearly-insensitive” to the job size distribution

Model : Dispatching policies for M/G/K-PS

19

Contribution 1: The Single-Queue-Approximation

Goal : Approx. for mean response time under Exponential job sizes

Compensate for the effect of other queues via state-dependent arrival rates

• λ(n) easier to approximate (only need to worry about λ(1), λ(2))

• < 2% error in mean response time for up to 64 servers

PS

Mn/M/1/PS

λ(n)≈

M/M/K-JSQ/PS

JSQ

PS

PSλ(n) = state-dependent arrival rate

[Performance’07]

Contribution 2: Many-server heavy-traffic analysis (PROPOSED)

Goal 1: Quantify the “near-insensitive” behavior Goal 2: Optimal dispatching policies for heterogeneous servers

Hard to prove anything in general, must resort to limiting regimes

The many-server heavy-traffic scaling

• Shows the effect of job size variability• Intuition into behavior of JSQ 20

JSQλ = K - constant

PS

PS

PS

PS

K → ∞

21


Dispatching

Backend Scheduling

Capacity scaling

Provisioning

Web server farms


Energy management

in Data centers


fully replicated DBs



VM management

VM migration

and departures


22

Q: When to turn servers ON/OFF to adapt to demand?

Existing work assumes zero setup delays, knowledge of future demand pattern

GAP : setup penalties non-zero + unpredictable demand patterns

Application 2 : Energy-Performance trade-off in

Data centers/Cloud computing

23

First-In-First-Out buffer

Poisson

arrivals

ON

ON

SETUP

OFF

Contribution: New traffic-oblivious policy DELAYEDOFF• Servers turn off after idle for twait

• If arrival sees all servers busy, turn a new server ON• Most-Recently-Busy (MRB) dispatching: send job to server which

idled last

Theorem: Under DELAYEDOFF, as the load , the number of ON servers is concentrated around

DELAYEDOFF is asymptotically

optimal

Model : Dynamic capacity scaling in M/M/∞ with setup delays

[Performance’10]

24

Simulation Results for DELAYEDOFF

PROPOSED: Refine DELAYEDOFF, prove performance guarantees

25


Dispatching

Backend Scheduling

Capacity scaling

Provisioning

Web server farms


Energy management

in Data centers


fully replicated DBs



VM management

VM migration

and departures


26

Application 3 : Fully replicated databases

27


Q: How many servers, and what speed?

No exact analysis, approximations good for low job size variance

GAP : Job sizes have very high variance


28


Poisson

arrivals


29


The Holy Grail of queueing theory (model for many other applications)

yet no exact analysis!

Lee-Longton [1959] :

Poisson

arrivalssquared coeff. of

variation of job size dist.

typically C2 > 20

Model : M/G/K/FCFS

30

Contribution 1: Inapproximability results

Goal: No accurate approx. based only on first 2 moments

Pick a subclass of distributions• Analytically tractable • Large enough to fix 2 moments, but wiggle room to prove gap

0

2

4

6

Increasing 3rd moment →

E[Delay]

Lee-Longton Approximation

H2

{G | 2 moments}

[QUESTA’10]

Contribution 2: Tight moment-based bounds

Goal: Better approximation using n moments?

[QUESTA ’10] : Conjectured extremal distributions

M/G/K/FCFS under light-traffic• Extremality should be invariant to load• Verify conjectures for n = 2,3• Also for other queueing systems with no exact analysis

31

E[Delay]

{G | n moments}

?

?

tight bounds | n moments

, 4, 5, 6… proposed work

32


Dispatching

Backend Scheduling

Capacity scaling

Provisioning

Web server farms


Energy management

in Data centers





VM management

VM migration

and departures


33

500MB

1.5GB

1GB

2GB

1GB

1GB

Application 5 : Managing VMs in the cloud

34

Q: Which server to start VM on? What capacity servers to buy?Assumption of permanent itemsGAP : VMs depart + VM migration possible

Contribution : Stochastic bin packing model with job departure/migrationPROPOSED: develop packing/migration schemes for efficient packing

Application 5 : Managing VMs in the cloud

500M

1G

250M

500M

Time line for proposed work

35

Application GAP Status Proposed Work

Web server farms

No analysisfor PS server

farms

70% CompletedNov ‘10-Jan ‘11

• Optimal dispatch policies for heterogeneous servers

• Characterizing “near-insensitivity”

Energymanagement in Data centers

Setup penalties +

unpredictable demands

80% CompletedFeb-Mar ’11

• Refine the proposed DELAYEDOFF policy

• Performance guarantees for traffic- oblivious capacity scaling



CompletedVerifying conjectures on tight

moment- based bounds beyond the scope of the thesis

Database servers

Thrashing Completed

VM managementVM migration

and departures

10% CompletedOct’ 10-Jan ’11

Develop and analyze heuristics for online stochastic bin packing with item departures and migrations

Expected Graduation: MAY2011

# jobs at server

Speed

ON

SETUP

OFF

JSQPS

PS

M/G/K

36

Dispatch

VM

VM

VM

37

[Performance’07]

V. Gupta, M. Harchol-Balter, K. Sigman, and W. Whitt. Analysis of join-the-shortest-queue routing for web server farms. PERFORMANCE 2007

[Performance’10]

V. Gupta, A. Gandhi, M. Harchol-Balter, and M. Kozuch. Optimality analysis of energy-performance trade-off for server farm management. PERFORMANCE 2010.

[QUESTA’10] V. Gupta, J. Dai, M. Harchol-Balter, and B. Zwart. On the inapproximability of M/G/K: why two moments of job size distribution are not enough. Queueing Systems, Vol 64, 2010.

[TR’08] V. Gupta, J. Dai, M. Harchol-Balter, and B. Zwart. The effect of higher moments of job size distribution on the performance of an M/G/K queueing system. Technical Report ,CMU, 2008.

[Sigmetrics’09] V. Gupta and M. Harchol-Balter. Self-adaptive admission control policies for resource-sharing systems. SIGMETRICS 2009.

V. Gupta and T. Osogami, On Markov-Krein characterization of mean sojourn time in M/G/K. Under submission.

Other Work

Time-varying systems

[Sigmetrics’06] V. Gupta, M. Harchol-Balter, A. Scheller-Wolf, and U. Yechiali. Fundamental characteristics of queues with fluctuating load. Sigmetrics 2006

[MAMA’08a] V. Gupta, and P. Harrison. Fluid level in a reservoir with ON-OFF source. MAMA 2008.

Single Server

Scheduling

[MAMA’08b] V. Gupta. Finding the optimal quantum size: Sensitivity analysis of the M/G/1 round-robin queue. MAMA 2008.

[Performance’10b]

V. Gupta, M. Burroughs, and M. Harchol-Balter. Analysis of scheduling policies under correlated job sizes. Performance 2010.

Distributed Data

placement

[INFOCOM’10] S. Borst, V. Gupta, and A. Walid. Distributed caching algorithms for Content Distribution Networks. INFOCOM 2010.

[SOCC’10]H. Amur, J. Cipar, V. Gupta, M. Kozuch, G.Ganger, and K. Schwan, Robust and flexible power-proportional storage. Symposium on Cloud Computing, 2010.

Epidemics [INFOCOM’08]M. Vojnovic, V. Gupta, T. Karagiannis, and C. Gkantsidis. Sampling strategies for epidemic-style information dissemination. INFOCOM 2008.

Stability analysis

A. Busic, V.Gupta, and J. Mairesse. Stability of the bipartite matching model. Under Submission

References

Thesis Proposal VARUN GUPTA 1. Thesis Proposal VARUN GUPTA 2.

Documents

Transcript of Thesis Proposal VARUN GUPTA 1. Thesis Proposal VARUN GUPTA 2.