Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in...

34
Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford

Transcript of Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in...

Page 1: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

Failure Resilient RoutingSimple Failure Recovery with Load Balancing

Martin Suchara

in collaboration with:D. Xu, R. Doverspike,

D. Johnson and J. Rexford

Page 2: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

2

Acknowledgement

Special thanks to Olivier Bonaventure, Quynh Nguyen, Kostas Oikonomou, Rakesh Sinha, Robert Tarjan, Kobus van der Merwe, and Jennifer Yates.

We gratefully acknowledge the support of the DARPA CORONET Program, Contract N00173-08-C-2011. The views, opinions, and/or findings contained in this article/presentation are those of the author/presenter and should not be interpreted as representing the official views or policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the Department of Defense.

Page 3: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

3

Failure Recovery and Traffic Engineering in IP Networks Uninterrupted data delivery when links or

routers fail

Major goal: re-balance the network load after failure

Failure recovery essential for Backbone network operators Large datacenters Local enterprise networks

Page 4: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

4

Overview

I. Failure recovery: the challenges

II. Architecture: goals and proposed design

III. Optimizations: of routing and load balancing

IV. Evaluation: using synthetic and realistic topologies

V. Conclusion

Page 5: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

5

Challenges of Failure Recovery Existing solutions reroute traffic to avoid failures

Can use, e.g., MPLS local or global protection

Prompt failure detection Global path protection is slow

Balance the traffic after rerouting Challenging with local path protection

primary tunnel

backup tunnel

primary tunnel

backup tunnellocal

global

Page 6: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

6

Overview

I. Failure recovery: the challenges

II. Architecture: goals and proposed design

III. Optimizations: of routing and load balancing

IV. Evaluation: using synthetic and realistic topologies

V. Conclusion

Page 7: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

7

Architectural Goals

3. Detect and respond to failures quickly

1. Simplify the network Allow use of minimalist cheap routers Simplify network management

2. Balance the load Before, during, and after each failure

Page 8: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

8

The Architecture – Components

Management system Knows topology, approximate traffic

demands, potential failures Sets up multiple paths and calculates load

splitting ratios

Minimal functionality in routers Path-level failure notification Static configuration No coordination with other routers

Page 9: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

9

The Architecture• topology design• list of shared risks• traffic demands

t

s

• fixed paths• splitting ratios

0.25

0.25

0.5

Page 10: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

10

The Architecture

t

slink cut

• fixed paths• splitting ratios

0.25

0.25

0.5

Page 11: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

11

The Architecture

t

slink cut

• fixed paths• splitting ratios

0.25

0.25

0.5path probing

Page 12: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

12

The Architecture

t

slink cutpath

probing

• fixed paths• splitting ratios

0.5

0.5

0

Page 13: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

13

The Architecture: Summary

1. Offline optimizations

2. Load balancing on end-to-end paths

3. Path-level failure detection

How to calculate the paths and

splitting ratios?

Page 14: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

14

Overview

I. Failure recovery: the challenges

II. Architecture: goals and proposed design

III. Optimizations: of routing and load balancing

IV. Evaluation: using synthetic and realistic topologies

V. Conclusion

Page 15: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

15

Goal I: Find Paths Resilient to Failures

A working path needed for each allowed failure state (shared risk link group)

Example of failure states:S = {e1}, { e2}, { e3}, { e4}, { e5}, {e1, e2}, {e1, e5}

e1 e3e2e4 e5

R1 R2

Page 16: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

16

Goal II: Minimize Link Loads

minimize ∑s ws∑e

Φ(ues)

while routing all trafficlink utilization ue

s

costΦ(ues)

aggregate congestion cost weighted for all failures:

links indexed by e

ues =1

Cost function is a penalty for approaching capacity

failure state weight

failure states indexed by s

Page 17: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

17

Our Three Solutions

capabilities of routers

cong

estio

n

Suboptimal solution

Solution not scalable

Good performance and practical?

Too simple solutions do not do well Diminishing returns when adding functionality

Page 18: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

18

1. Optimal Solution: State Per Network Failure Edge router must learn which links failed

Custom splitting ratios for each failure

0.40.4

0.2

Failure Splitting Ratios

- 0.4, 0.4, 0.2

e40.7, 0, 0.3

e1 & e20, 0.6, 0.4

… …

configuration:

0.7

0.3

e4e3

e1 e2

e5 e6

one entry per failure

Page 19: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

19

1. Optimal Solution: State Per Network Failure Solve a classical multicommodity flow for each

failure case s:

min load balancing objectives.t. flow conservation

demand satisfaction edge flow non-negativity

Decompose edge flow into paths and splitting ratios

Does not scale with number of potential failure states

Page 20: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

20

2. State-Dependent Splitting: Per Observable Failure Edge router observes which paths failed Custom splitting ratios for each observed

combination of failed paths

0.40.4

0.2

Failure Splitting Ratios

- 0.4, 0.4, 0.2

p2 0.6, 0, 0.4

… …

configuration:

0.6

0.4

p1

p2

p3

NP-hard unless paths are fixed

at most 2#paths entries

Page 21: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

21

2. State-Dependent Splitting: Per Observable Failure

If paths fixed, can find optimal splitting ratios:

Heuristic: use the same paths as the optimal solution

min load balancing objectives.t. flow conservation

demand satisfaction path flow non-negativity

Page 22: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

22

3. State-Independent Splitting: Across All Failure Scenarios Edge router observes which paths failed Fixed splitting ratios for all observable failures

0.40.4

0.2

p1, p2, p3:

0.4, 0.4, 0.2

configuration:

0.667

0.333

Non-convex optimization even with fixed paths

p1

p2

p3

Page 23: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

23

3. State-Independent Splitting: Across All Failure Scenarios

Heuristic to compute splitting ratios Use averages of the optimal solution

weighted by all failure case weights

Heuristic to compute paths The same paths as the optimal solution

ri = ∑s ws ris

fraction of traffic on the i-th path

Page 24: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

24

Our Three Solutions

1. Optimal solution

2. State-dependent splitting

3. State-independent splitting

How well do they work in practice?

Page 25: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

25

Overview

I. Failure recovery: the challenges

II. Architecture: goals and proposed design

III. Optimizations: of routing and load balancing

IV. Evaluation: using synthetic and realistic topologies

V. Conclusion

Page 26: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

26

Simulations on a Range of Topologies

Single link failures

Shared risk failures for the tier-1 topology 954 failures, up to 20 links simultaneously

Page 27: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

27

Congestion Cost – Tier-1 IP Backbone with SRLG Failures

increasing load

Additional router capabilities improve performance up to a point

obje

ctiv

e va

lue

network traffic

State-dependent splitting indistinguishable from optimum

State-independent splitting not optimal but simple

How do we compare to OSPF? Use optimized OSPF link weights [Fortz, Thorup ’02].

Page 28: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

28

Congestion Cost – Tier-1 IP Backbone with SRLG Failures

increasing load

OSPF uses equal splitting on shortest paths. This restriction makes the performance worse.

obje

ctiv

e va

lue

network traffic

OSPF with optimized link weights can be suboptimal

Page 29: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

29

Average Traffic Propagation Delay in Tier-1 Backbone Service Level Agreements guarantee 37 ms

mean traffic propagation delay

Need to ensure mean delay doesn’t increase much

Page 30: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

30

Number of Paths– Tier-1 IP Backbone with SRLG Failures

Number of paths almost independent of the load

number of paths

cdf

number of paths

For higher traffic load slightly more paths

Page 31: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

31

Number of Paths – Various Topologies

More paths for larger and more diverse topologies

number of pathsnumber of paths

cdf

Greatest number of paths in the tier-1 backbone

Page 32: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

32

Overview

I. Failure recovery: the challenges

II. Architecture: goals and proposed design

III. Optimizations: of routing and load balancing

IV. Evaluation: using synthetic and realistic topologies

V. Conclusion

Page 33: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

33

Conclusion Simple mechanism combining path protection

and traffic engineering

Favorable properties of state-dependent splitting algorithm:

Path-level failure information is just as good as complete failure information

Page 34: Failure Resilient Routing Simple Failure Recovery with Load Balancing Martin Suchara in collaboration with: D. Xu, R. Doverspike, D. Johnson and J. Rexford.

34

Thank You!