Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula,...

23
Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu , Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University) 1

Transcript of Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula,...

Page 1: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

Traffic Engineering with Forward Fault Correction (FFC)

Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang,

David Gelernter (Yale University)

1

Page 2: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

Cloud services require large network capacity

2

Cloud Services

Cloud NetworksGrowing traffic

Expensive(e.g. cost of WAN: $100M/year)

Page 3: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

TE is critical to effectively utilizing networks

3

Traffic Engineering

WAN Network

• Microsoft SWAN• Google B4• ……

Datacenter Network

• Devoflow• MicroTE• ……

Page 4: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

But, TE is also vulnerable to faults

4

TE controller

Network

Network view

Network configuration

Frequent updates for high utilization

Control-plane faults

Data-plane faults

Page 5: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

Control plan faults

5

Failures or long delays to configure a network device

TE Controller Switch

TE configurations

Memory shortage

RPC failure

Firmware bugs Overloaded CPU

Page 6: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

Congestion due to control plane faults

s1

s2

s3

s4

Link Capacity: 10 7

3

3

7

10s1

10

10

10

10

s2

s3

s4

6

New Flows (traffic demands):s1 s2 (10)s1 s3 (10)s1 s4 (10)

Configuration failure

Congestions1

7

3

10

10

10

s2

s3

s4

10

s2 s4 (10)

s3 s4 (10)

Page 7: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

Data plane faults

7

Link and switch failures

s1

7

3

3

7

s2

s3

s4

link failure link failure

s110

7

s2

s3

s43congestion

Rescaling: Sending traffic proportionally to residual paths

Link Capacity: 10

s2 s4 (10)

s3 s4 (10)

Page 8: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

Control and data plane faults in practice

8

Control plane: fault rate = 0.1% -- 1% per TE update.

Data plane: fault rate = 25% per 5 minutes.

In production networks:• Faults are common.• Faults cause severe congestion.

Page 9: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

State of the art for handling faults

•Heavy over-provisioning

•Reactive handling of faults• Control plane faults: retry • Data plane faults: re-compute TE and update networks

9

Cannot prevent congestion

Slow(seconds -- minutes)

Blocked by control plane faults

Big loss in throughput

Page 10: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

10

How about handling congestion proactively?

Page 11: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

Forward fault correction (FFC) in TE

11

• [Bad News] Individual faults are unpredictable.• [Good News] Simultaneous #faults is small.

Network faults

Packet loss

FFC guarantees no congestion under up to k arbitrary faults.

FEC guarantees no information loss under up to k arbitrary packet drops.

with careful data encoding

with careful traffic distribution

Page 12: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

Example: FFC for control plane faults

s1

s2

s3

s4

Link Capacity: 10 7

3

3

7

10s1

10

10

10

10

s2

s3

s4

Configuration failure

Congestions1

7

3

10

10

10

s2

s3

s4

10

Non-FFC

12

Page 13: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

Example: FFC for control plane faults

s1

s2

s3

s4

Link Capacity: 10 7

3

3

7

13

Configuration failure

s1

7

3

10

10

10

s2

s3

s4

7

s1

s2

s3

s4

10

10

10

10

7

Control Plane FFC (k=1)

Configuration failure

s1

10

7

10

10

s2

s3

s47

3

Page 14: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

Trade-off: network utilization vs. robustness

10s1

10

10

10

10

s2

s3

s4

Non-FFC

s1

s2

s3

s4

10

10

10

10

7

K=1 (Control Plane FFC)

s1

10

4

10

s2

s3

s4

10

10

K=2 (Control Plane FFC)

14

Throughput: 44Throughput: 47Throughput: 50

Page 15: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

Systematically realizing FFC in TE

15

Formulation:How to merge FFC into existing TE framework?

Computation:How to find FFC-TE efficiently?

Page 16: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

Basic TE linear programming formulations

16

max.

Sizes of flows

control plane faults

Deliver all granted flows

No overloaded link

FFC constraints:

Maximizing throughput

No overloaded link up to link failures

switch failures

TE decisions:Traffic on paths

Basic TE constraints:

TE objective:

𝑏𝑓

𝑙 𝑓 , 𝑡

s.t.

LP formulations

Page 17: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

Formulating control plane FFC

17

s1 s2

𝑓 1

𝑓 2

𝑓 3

𝑙1𝑛𝑒𝑤+𝑙2

𝑛𝑒𝑤+ 𝑙3𝑜𝑙𝑑≤ link cap

𝑙1𝑛𝑒𝑤+𝑙2

𝑜𝑙𝑑+𝑙3𝑛𝑒𝑤 ≤ link cap

𝑙1𝑜𝑙𝑑+𝑙2

𝑛𝑒𝑤+𝑙3𝑛𝑒𝑤 ≤ link cap

(𝟑𝟏)’s load in old TE ’s load in new TE

Fault on :

Fault on :

Fault on :

Total load under faults?

With n flows and FFC protection k: #constraints = for each link.

Challenge: too many constraints

Page 18: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

An efficient and precise solution to FFC

19

Our approach:A lossless compression from O() constraints to O(kn) constraints.

Given , FFC requires thatthe sum of arbitrary k elements in is link spare capacity

O()

Define as the mth largest element in : link spare capacity

Expressing with ?

O()

: additional load due to fault-i

Total load under faults link capacity

Total additional load due to faults link spare capacity

Page 19: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

Sorting network

19

𝑥1𝑥2𝑥3𝑥4

A comparison𝑥1𝑥2

=max{}

=min{}

𝑧1𝑧 2𝑧 3

𝑧 4 𝑧5 (1st largest) (2nd largest)

𝑧 6𝑧7𝑧 8

1st round 2nd round

• Complexity: O(kn) additional variables and constraints.

• Throughput: optimal in control-plane and data plane if paths are disjoint.

+ link spare capacity

Page 20: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

FFC extensions

• Differential protection for different traffic priorities

•Minimizing congestion risks without rate limiters

• Control plane faults on rate limiters

• Uncertainty in current TE

• Different TE objectives (e.g. max-min fairness)

• …20

Page 21: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

Evaluation overview

•Testbed experiment• FFC can be implemented in commodity switches• FFC has no data loss due to congestion under faults

• Large-scale simulation

21

A WAN network with O(100) switches and O(1000) links

Injecting faults based on real failure reports

Single priority traffic in a well-provisioned network

Multiple priority traffic in a well-utilized network

Page 22: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

FFC prevents congestion with negligible throughput loss

FFC Throughput / Optimal Throughput

40

60

20

0

Rati

o (%

)

22

80

100

High priority (High FFC protection)Medium priority (Low FFC protection)Low priority (No FFC protection)

Single priority

FFC Data-loss / Non-FFC Data-loss

160%

<0.01%

Page 23: Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

Conclusions

• Centralized TE is critical to high network utilization but is vulnerable to control and data plane faults.

• FFC proactively handle these faults.• Guarantee: no congestion when #faults ≤ k.• Efficiently computable with low throughput overhead in practice.

23

Heavy network over-provisioning

High risk of congestion

FFC