OnCall Defeating Traffic Spikes with a Free-Market Application Cluster James Norris Keith Coleman...

27
OnCal l Defeating Traffic Spikes with a Free- Market Application Cluster James Norris • Keith Coleman • Armando Fox • George Candea Stanford University
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of OnCall Defeating Traffic Spikes with a Free-Market Application Cluster James Norris Keith Coleman...

OnCallDefeating Traffic Spikes with a Free-Market

Application Cluster

James Norris • Keith Coleman • Armando Fox • George Candea

Stanford University

Average 9/11 9/12

Motivation

CNN.com: September 114x traffic in a single day8x traffic on second day

Offline for 2.5 hours, diminished service afterwards

Slashdot Effect

Variable Traffic

Ticket Sales

Contestsetc

40 M

162.4 M

337.4 M

CN

N.c

om

Pa

ge

Vie

ws

What to do?

Three Options

One Option: Overprovision+ Works for steady state fluctuations (but not optimal)

– Too expensive for spike conditions (8x servers for CNN)

Another Option: Graceful Degradation+ Provides basic service continuity

– Full features (including revenue-generating features) may be lost

Better Option: Dynamic Allocation

What is OnCall?OnCall is…

a cluster management system designed to multiplex several (possibly competing) dynamic web applications onto a single cluster.

Goal:Make spike handling possible while providing useful resource guarantees to all apps

Solution:Marketplace of Applications

Applications rent and lend computing resources according to pre-defined market policies

Generic PlatformBased on VMs

application generic fast app swapping

Marketplace

Market Rounds

OfflineOfflineEach application assigned ownership of G computers

at a fixed price (or rate)

OnlineOnline1. Determine market equilibrium price, P, by querying

each application

2. Calculate new allocation sizes at price P

3. Adjust allocations, moving computers from sellers to buyers

4. Repeat every time quantum, t

Offline Market: G

“G”Each app “owns” G nodes

Resource guarantees

Never have to sell: no matter what the price or what other apps’ demands, an app is guaranteed use of its G nodes

Can lend by choice (if there are renters at desired price)

Can rent extra nodes (if it needs to and/or can afford to)

Online Market

Marketplace

Policy Policy Policy

How many nodes do you want for $5 each?7 nodes 5 nodes 2 nodes

10 nodes in cluster7 + 5 + 2 = 14, but I only have

10 nodes!

7 + 5 + 2 = 14, but I only have

10 nodes!

How many nodes do you want for $10 each?5 nodes 3 nodes 2 nodes

5 + 3 + 2 = 10Perfect!

5 + 3 + 2 = 10Perfect!

Online Market: Policies

Inputs:

Output: # of computers desired at price P

Price P

Performance statsCPU usageDisk I/Oetc.

Fro

m M

arke

tpla

ce

Application inputsTime of day

Historical usage

Fro

m A

pp

lica

tio

n

Example Market Policy

• For each round, application A computes the number of nodes, n, it needs to handle current traffic

• Ex: Application A has a price threshold of $6:

– If (P < $6), A will ask for n nodes

– If (P ≥ $6), A will only ask for min(n, G) nodes – it can’t afford to rent extras

n < G (no spike)

n > G (spike)

0

1

2

3

4

5

6

0 2 4 6 8 10

Price (P)

No

des

Req

ues

ted

n

Gprice threshold

0

1

2

3

4

5

6

0 2 4 6 8 10

Price (P)

No

des

Req

ues

ted

n

G price threshold

Finding the Equilibrium

Combined Policy Functions

0

5

10

15

20

25

30

0 1 2 3 4 5 6 7 8 9 10 11 12

Price

No

des

Individual Policy Functions

0

2

4

6

8

10

12

0 1 2 3 4 5 6 7 8 9 10 11 12

Price

No

de

s

• Sample points along the different policy functions

• Determine the price at which the total number of nodes desired by all apps equals the total number of nodes available on the cluster

Competitive vs Cooperative

Competitive EnvironmentsEx: ASP, where app owners may be in competition

Cooperative EnvironmentsEx: Search engine, Yahoogle

Quick Case Study

App 1: Paid web search (very high value in low latency)

App 2: Ad-supported web search (high value in low latency)

App 3: Crawler (latency OK, starvation not)

For each app, model utility of running at a given time

Benefit: If you add an app, just need to model that app, not remodel whole system

Platform

Platform Overview

L7 Load Balancers

Internet

Network Attached Storage containing Application VM

capsules

Cluster node running VMM with OnCall Manager &

Marketplace

Cluster nodes running VMMs, OnCall Responders, and Application VMs

Does this work?

Simulation Testbed

Three Simulations, Four Traits– Spike handling under unconstrained resources– Spike handling under constrained resources– Resource guarantees– Fast server activation

U.C. Berkeley X Cluster– 30 Nodes (double CNN.com)– Dual 1 GHz PIII, 1.5 GB RAM– VMware GSX Server on Linux

Sim 1: Spike Handling

• G = 10 for both apps• App 1 handles spikes, App 2 makes $$• Notice: Lag time between node assigned node active

App 1

0

5

10

15

20

25

1 11 21 31 41

Market Round

# N

od

es

0

200

400

600

800

1000

1200

Pri

ce

# Assigned

# Active

Usage

Price

App 2

0

5

10

15

20

25

1 11 21 31 41

Market Round#

No

des

0

200

400

600

800

1000

1200

Pri

ce

# Assigned

# Active

Usage

Price

Sim 2: Resource Constraints

• G1 = 12, G2 = 6, G3 = 12• App 1 has higher budget than App 2, but both spike• App 1 handles spikes, App 2 sees guarantee, App 3 makes $$• App 2 buys more when App 1’s spike subsides

App 1

0

5

10

15

20

25

1 11 21 31 41 51 61

Market Round

# N

odes

0

1000

2000

3000

4000

5000

6000

7000

Pric

e

App 2

0

5

10

15

20

25

1 11 21 31 41 51 61 71

Market Round

# N

odes

0

1000

2000

3000

4000

5000

6000

7000

Pric

e

App 3

0

5

10

15

20

25

1 11 21 31 41 51 61

Market Round

# N

odes

0

1000

2000

3000

4000

5000

6000

7000

Pric

e

Sim 3: Fast Activation

OnCall Optimal: Load VMs from suspended stateOnCall Limited: Load VMs from shutdown stateStandard with OS: OS already installed on nodeStandard without OS: Must install OS first

Significance: • Worst case, > 2x improvement

– When spike lasts only 30 minutes, this is significant

• If you can startup quickly, accurate predictor is not critical

Platform OnCall Optimal OnCall Limited Standard with OS Standard w/out OS

Time until Active (s)

5-10 50-120 270-330 710-750

Questions?

Notes and Assumptions

Homogeneity AssumptionCluster is assumed to be homogeneous—all nodes rented at same price (for simplicity)

Swapping Costs

Time delay cost in start up / shut down of an app on a node.

If a rental contract is renewed, app runs on same node.

“P” Only for Extras

Apps only pay price P for nodes above and beyond their own G

Ex: Using 40, G = 30

40 – 30 = 10 nodes at price P

Runtime Operation

Runtime cycle repeats every Runtime cycle repeats every tt

1. Marketplace calculates equilibrium price (and thus application allocations)

2. Managers assigns apps to physical nodes (minimizing shutdowns and startups)

3. Manager signals Responders to shutdown and start new app, as necessary

4. At end of round, Manager gathers new usage stats; reports stats to Market Policies

5. Repeat

Marketplace Optimality

What is “optimal?”Under resource constraints, those applications with the most utility to derive from the use of additional nodes are given those nodes

Utility CurvesCurve specifies: dollar value an application derives from possessing a certain number of nodes for a specific time quantum.

Trivially: Utility curves are always monotonically non-decreasing (i.e. it is never worse to own more nodes at a given total cost)

To be optimal: Marginal utility curves are always monotonically non-increasing (i.e. every additional node is worth same or less than one before) Number of Nodes

Uti

lity

Marginal Utility

Utility

Profit Through Efficiency

“Shut Down” AppASP shuts down servers when it can buy them for less than the cost of keeping them running (A/C, utilities, etc)

ASP can then add additional capacity and sell only when profitable

Marketplace Fairness

Markets are optimal if……they are free and fair

Anti-competitive behaviorMonopoly/Oligopoly

Aggressive tactics

Fairness through RegulationEnsure enough distinct owners no monopoly

Fine or ban app that engages in overtly anti-competitive behavior

Future WorkVM cachingCache VMs to local disk (speculatively or as read from NAS)

Fault toleranceAdd master-backup fault tolerance to the OnCall Manager

Performance statisticsProvide market policies with additional statistics (e.g. end-to-end response time)

Scalable data layerAdd support for scalable persistent stores that would allow replication on the data tier.

MultiplexingStudy trade-offs of running several applications on one node