Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure...

25
Joint work with: Jeremy Bogle, Nikhil Bhatia (MIT), Ishai Menache, Nikolaj Bjorner (MSR), Asaf Valadarsky, Michael Schapira (Hebrew U) Striking the Right Utilization- Availability Balance in the WAN Manya Ghobadi MIT

Transcript of Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure...

Page 1: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

Joint work with: Jeremy Bogle, Nikhil Bhatia (MIT),

Ishai Menache, Nikolaj Bjorner (MSR), Asaf Valadarsky, Michael Schapira (Hebrew U)

Striking the Right Utilization-Availability Balance in the WAN

Manya GhobadiMIT

Page 2: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

𝑥"𝑥#

Control Uncertainty

Loss will be ≤ $100 with probability 99%

$10,000 Gain/Loss𝑥$

𝑦"𝑦#𝑦$

How to invest smartly in the stock market?

Solution: financial risk theory2

Page 3: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

Traffic Engineering (TE) problem

10 Gbps

BOS NYC10 Gbps

10 Gbps

Traffic demand:10 Gbps

Throughput/Packet loss

• How to configure the allocation of traffic on network paths?• Goal: efficiently utilizing the network to match the current traffic

demand (periodic process)

𝑥"𝑥#𝑥$

3

Page 4: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

Extensive research on TE in a broad variety of environments• Wide-area networks

• Kumar et al. [NSDI’18]• Liu et al. [SIGCOMM’14]• Kumar et al. [SIGCOMM’15]• Jain et al. [SIGCOMM’13]• Hong et al. [SIGCOMM’13]

• ISP networks• Jiang et al.[SIGMETRICS’09] • Kandula et al. [SIGCOMM’05]• Fortz et al. [INFOCOM’2000]

• Data center networks• Alizadeh et al. [SIGCOMM’14]• Akyildiz et al. [Journal of Comp. Nets.’14]• Benson et al. [CoNEXT’11]

4

Page 5: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

TE problem in Wide-Area Networks

5

Microsoft WAN

Page 6: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

TE problem in Wide-Area Networks

Challenging:•Billion dollar infrastructure•High efficiency and availability

[B. Fortz, Internet Traffic Engineering by Optimizing OSPF Weights, INFOCOM’2000]

Solution: •Model the network as a graph•Solve a Linear Program

Objective

Constraints

Microsoft WAN

6

Page 7: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

Failures

AvailabilityUtilization

Competing goals: high utilization and availability

7

Page 8: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

Failures

AvailabilityUtilization

Competing goals: high utilization and availability

8

Page 9: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

Traffic engineering under failures

10 Gbps

BOS NYC10 Gbps

5 Gbps

Robust against k simultaneous link failures

Admissible traffic: 5 Gbps all the time

• Today: optimize for the worst conceivable (potentially unlikely) failure scenarios

• Problem: under-utilizing the network

p(fail) = 10-2

p(fail) = 10-1

p(fail) = 10-2

99.999% of the time9

Page 10: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

Traffic engineering under failures

10 Gbps

BOS NYC10 Gbps

5 Gbps

Robust against k simultaneous link failures

Admissible traffic: 5 Gbps 99.999% of the time

p(fail) = 10-2

p(fail) = 10-1

p(fail) = 10-2

Failures

AvailabilityUtilization

10

Page 11: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

Our approach to traffic engineering under failures

10 Gbps

BOS NYC

10 Gbps

5 Gbps

p(fail) = 10-2

p(fail) = 10-1

p(fail) = 10-2

Admissible traffic Availability 5 Gbps 99.999%

10 Gbps 99.99%15 Gbps 99.8%20 Gbps 98%

• Use the failure probabilities to reason about the likelihood of failure scenarios• Provide a mathematical probabilistic guarantee for availability

11

Page 12: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

Our approach to traffic engineering under failures

10 Gbps

BOS NYC

10 Gbps

5 Gbps

• Use the failure probabilities to reason about the likelihood of failure scenarios• Provide a mathematical probabilistic guarantee for availability

For all flows, 90% of the demand is satisfied 99.9% of the timeFor all flows, loss will be ≤ 10% of the demand 99.9% of the time

demand𝑑'

𝑦"𝑦#𝑦$

Uncertainty vector

𝑥"𝑥#𝑥$

Flow allocation vector

12

Page 13: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

𝑦"𝑦#𝑦$

𝑥"𝑥#𝑥$

Main idea𝑥"𝑥#𝑥$

𝑦"𝑦#𝑦$

The loss will be ≤ $100 with probability 99%Find x that minimizes the loss with probability β

The loss will be ≤ 10% of the demand with probability 99%Find x that minimizes the loss with probability β

demand𝑑'

13

Page 14: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

Loss (%)

Prob

abili

ty

Key technique: scenario-based formulation

0 5 10

A failure scenario • One link failure• Correlated link failures

• Unsatisfied demand• Packet loss

Target probability β = 0.95

β = 0.95

VaRᵦ = 10%

Loss ≤ 10% of demand w probability 0.95

14

Page 15: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

Prob

abili

ty

0 5 10

β = 0.95

𝜓(𝑥, 𝑉𝑎𝑅) = 𝑃(𝑞|𝐿(𝑥, 𝑦(𝑞)) ≤ 𝑉𝑎𝑅)

𝑚𝑖𝑛{𝑉𝑎𝑅|𝜓(𝑥, 𝑉𝑎𝑅) ≥ 𝛽}

Target probability β = 0.95VaRᵦ = 10%

Key technique: scenario-based formulation

15

Loss (%)

Page 16: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

Prob

abili

ty

0 5 10

β = 0.95

𝜓(𝑥, 𝑉𝑎𝑅) = 𝑃(𝑞|𝐿(𝑥, 𝑦(𝑞)) ≤ 𝑉𝑎𝑅)

𝑚𝑖𝑛{𝑉𝑎𝑅|𝜓(𝑥, 𝑉𝑎𝑅) ≥ 𝛽}What about the worst 5% of scenarios?

𝑚𝑖𝑛{𝐸[𝐿𝑜𝑠𝑠|𝐿𝑜𝑠𝑠 ≥ 𝑉𝑎𝑅]}

TeaVaR: Traffic Engineering Applying Value-at-Risk

16

Loss (%)

Page 17: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

• Achieving fairness across network users.

• Enabling computational tractability as the network scales.

• Capturing fast rerouting of traffic in data plane.

• Accounting for correlated failures.

Challenges unique to networking

17

Page 18: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

Achieving fairness

Starvation-aware loss function: • Worst case normalized unmet demand

demand𝑑'

Ri: routes for flow ixr: flow allocation on route r ∈ Ri

yr: binary variable indicating if route r is up

Objective: Find x that minimizes the loss with probability β

Satisfied demand for flow i: ΣC∈DE𝑥C𝑦C

𝐿(𝑥, 𝑦) = 𝑚𝑎𝑥'[1 −ΣC∈DE𝑥C𝑦C

𝑑']H

𝑥"𝑥#

𝑦"𝑦#

𝑥$ 𝑦$

18

Page 19: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

Handling scale

Google’s B4 topology

Topology # Edges # Scenarios

B4 38 O(1E11)

IBM 48 O(1E14)

MSFT 100 O(1E30)

ATT 112 O(1E33)

98

98.5

99

99.5

100

B4 IBM MSFT ATT

Cove

rage

(%

)

All Scenarios Our approach

0255075

100125

B4 IBM MSFT ATT

Run

time

(s)

19

Page 20: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

System architecture

TeaVaRLinear

Optimization

Topology

Flow allocationsFailure probability

of scenarios

Target availability(0.99, 0.999,..)

Flow demands

𝑥"𝑥#

𝑦"demand

𝑑' 𝑥$𝑦#𝑦$

20

Page 21: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability
Page 22: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

Evaluations•Topologies: B4, IBM, ATT, and MSFT•Traffic matrix: •Four months of MSFT traffic matrix (one sample/hour), for the rest of topologies, used 24 TMs from YATES [SOSR’18]

•Tunnel selection:•Our optimization framework is orthogonal to tunnel selection•Oblivious paths, link disjoint paths, and k-shortest paths

•Baselines: •SMORE [NSDI’18]•FFC [SIGCOMM’14]•B4 [SIGCOMM’13]•ECMP

22

Page 23: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

Availability vs. demand scale

90

92

94

96

98

100

1.0 1.5 2.1 2.6 3.1

Avai

labi

lity

(%)

Demand Scale

SMORE B4 FFC-1

FFC-2 ECMP TeaVaR

• Availability is measured as the probability mass of scenarios in which demand is fully satisfied (“all-or-nothing” requirement)• If a TE scheme’s bandwidth allocation is unable to fully satisfy demand in 0.1% of

scenarios, it has an availability of 99.9%

23

Page 24: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

Robustness to probability estimates

Noise in probability estimations

% error in throughput

1% 1.43%

5% 2.95%

10% 3.07%

15% 3.95%

20% 6.73%

24

Page 25: Striking the Right Utilization- Availability Balance inthe WANA failure scenario •One link failure •Correlated link failures •Unsatisfied demand •Packet loss Target probability

Summary• TeaVaR uses financial risk theory for solving Traffic

Engineering under failures.

• TeaVaR’s approach is applicable to networking resource allocation problems such as capacity planning.

• Code and demo available at: http://teavar.csail.mit.edu/

25