RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

34
RedPlane: Enabling Fault-Tolerant Stateful In-Switch Applications Daehyeok Kim §Jacob Nelson , Dan Ports , Vyas Sekar § , Srinivasan Seshan § § Carnegie Mellon University Microsoft 1

Transcript of RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Page 1: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

RedPlane: Enabling Fault-Tolerant Stateful In-Switch Applications

Daehyeok Kim§‡

Jacob Nelson‡, Dan Ports‡, Vyas Sekar§, Srinivasan Seshan§

§Carnegie Mellon University ‡Microsoft

1

Page 2: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Programmable networks are stateful

Classical switches Programmable “data plane” switches

Match Stateful action

(IP addr, port) NAT (IP addr., port)

Programmable switching ASICs

➔ “Stateful” packet processing

Match Action

IP addr/prefix Forward (port)

Stateful in-switch applications:

network functions, monitoring,

accelerating distributed systems

”Stateless” packet processing

2

Page 3: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Problem: Switch failure

Match

(Flow ID)

NAT action

(IP, Port)

(10.0.0.1, 4321) (192.168.10.1, 1234)

PPP

[1] Liu et al., Crystalnet: Faithfully emulating large production networks. In ACM SOSP 2017.

[2] Meza et al., A large scale study of data center network reliability. In ACM IMC 2017.

Other stateful apps suffer

from the same problem!

NAT 1

NAT 2

Flow state does not exist!

➔ Connection broken

Switch failures are prevalent [1, 2]

3

Page 4: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Strawman solutions

External

state store

Switch

control plane

PPP

Control plane

Data plane

Requires a custom routing policy

Updates can be lost

Consumes additional

resources

Application Application

S1: Checkpoint-recovery

S2: Chain replication

among switches

Switch

control plane

Mismatch between the control

and data plane performance

➔ miss some state / packet drops

4

Page 5: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Our work: RedPlane

External

state store

App code

+RedPlane

P4 API

APIs that allow easy

integration with apps

Inexpensive replicated state store

on commodity servers

Correct state replication

entirely in the data planeRedPlane-

enabled app

RedPlane-

enabled app

One big fault-tolerant

switch abstraction!

5

Page 6: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Outline

RedPlane motivation

RedPlane design

Results

6

Page 7: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

RedPlane design overview

External

state store

RedPlane-

enabled app

App code+RedPlane

P4 API

DeveloperP4 Compiler

7

RedPlane-

enabled app

Page 8: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Challenge 1: Correct replication in the data plane

Strawman: strict correctness used in server-based replicated systems

External

state store

RedPlane-

enabled app

Buffer a packet until the state is replicated

(exactly-once semantics)

Ensure replication messages are delivered in order and reliably

(Linearizability)

PPP

8

PPP

Replication requests

Responses

Page 9: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Challenge 1: Correct replication in the data plane

Strawman: strict correctness used in server-based replicated systems

External

state store

RedPlane-

enabled app

8

PPP

Replication requests

Responses

Expensive to buffer entire packets

Expensive to realize reliable transport

in the switch data plane

Page 10: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Linearizable mode: Relaxed correctness

Insight: End-to-end network apps already tolerate lossy networks!

Our approach: Linearizability-based relaxed correctness

9

External

state store

RedPlane-

enabled appPPP

PPP

Permitting some input/output packet loss

➔ No need to buffer entire packets

In-order and reliable message delivery

➔ Provides linearizability

Page 11: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Basic RedPlane protocol:Realizing the linearizable mode in the data planeExample: per-flow packet counter

10

External

state store

RedPlane-

enabled appP

Init

(k=Red)

Red: 0

External state

P

Piggyback an output packet

instead of buffering locally!

1. Sends a state initialization

request

Page 12: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Basic RedPlane protocol:Realizing the linearizable mode in the data planeExample: per-flow packet counter

10

External

state store

RedPlane-

enabled appP

ACK

(k=Red)

Red: 1

Switch local state

Red: 0

External state

1. Sends a state initialization

request

2. Receives an ACK &

initializes the local state

Page 13: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Basic RedPlane protocol:Realizing the linearizable mode in the data planeExample: per-flow packet counter

10

External

state store

RedPlane-

enabled appP

Replication

(k=Red, v=1)

Red: 1

Switch local state

Red: 1

External state

1. Sends a state initialization

request

2. Receives an ACK &

initializes the local state

3. Replicates the updated

state

Page 14: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Basic RedPlane protocol:Realizing the linearizable mode in the data planeExample: per-flow packet counter

10

External

state store

RedPlane-

enabled appP

ACK

(k=Red)

Red: 1

Switch local state

Red: 1

External state

1. Sends a state initialization

request

2. Receives an ACK &

initializes the local state

3. Replicates the updated

state

4. Receives an ACK & releases

the output packet

P

Page 15: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Inconsistency due to unreliable channel

Problem: state in the switch and state store can be inconsistent due to out-of-order requests or request packet loss

11

External

state store

RedPlane-

enabled appP

Replication

(k=Red, v=1)

Red: 1

Switch local state

Red: 0

External state

Inconsistent!

Page 16: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Sequencing and lightweight retransmission

Our approach: A simple UDP-based transport with sequencing and lightweight retransmission

12

External

state store

RedPlane-

enabled appP

Replication

(k=Red, v=1, seq=1)

Red: 1

Switch local state

Red: 0

External state

Repl.

Buffers only RedPlane header by leveraging

packet mirroring & truncating feature in ASIC

Commits the latest requests only

based on a sequence number

Page 17: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Challenge 2: Transparent to routing policies

A switch failure or recovery can cause routing traffic to another switch

External

state store

RedPlane-

enabled app

RedPlane-

enabled appPPP

RedPlane-

enabled app

Ensure that a packet accesses the correct state

irrespective of the location of the current switch

Switch-1

Switch-2

Switch-3

13

Page 18: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Accessing stale state

Problem: A packet may access state state during failover or recovery

External

state store

RedPlane-

enabled app

RedPlane-

enabled appPPPSwitch-1

Switch-2

14

Red: 2

External state

Red: 2Link failure

→ Local state is still alive

Page 19: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Accessing stale state

Problem: A packet may access state state during failover or recovery

External

state store

RedPlane-

enabled app

RedPlane-

enabled appPPPSwitch-1

Switch-2

14

Red: 4

External state

Red: 4

Red: 2

Page 20: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Accessing stale state

Problem: A packet may access state state during failover or recovery

External

state store

RedPlane-

enabled app

RedPlane-

enabled appPPPSwitch-1

Switch-2

14

Red: 4

External state

Red: 4

Red: 2Reads stale state!

Page 21: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Lease-based state ownership management

Our approach: For a given flow, ensuring only one switch processes packets at a time using leases

External

state store

RedPlane-

enabled app

RedPlane-

enabled appPPPSwitch-1

Switch-2

15

Red: 4

External state

Red: 4

Red: 2 Red: Switch-2

Lease state

Page 22: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Lease-based state ownership management

Our approach: For a given flow, ensuring only one switch processes packets at a time using leases

External

state store

RedPlane-

enabled app

RedPlane-

enabled appPPPSwitch-1

Switch-2

15

Red: 4

External state

Red: 4

Red: 2 Red: Switch-2

Lease state

Waits until the current

lease expires

Page 23: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Lease-based state ownership management

Our approach: For a given flow, ensuring only one switch processes packets at a time using leases

External

state store

RedPlane-

enabled app

RedPlane-

enabled appPPPSwitch-1

Switch-2

15

Red: 4

External state

Red: 4

Red: 4 Red: Switch-1

Lease state

Page 24: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Challenge 3: Handling high traffic volume

Switch data plane operates at up to a few billion packets per second

External

state store

RedPlane-

enabled app

16

PPP

Unable to keep up with high

replication request rate

High performance overhead

Page 25: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Challenge 3: Handling high traffic volume

Switch data plane operates at up to a few billion packets per second

External

state store

RedPlane-

enabled app

16

PPP

“Bounded-inconsistency mode”

for write-centric applications

Lease-based state management allows

local reads for read-centric apps

Page 26: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Putting it all together

RedPlane provides a fault-tolerant state store abstraction to applications

External

state store

Linearizability-based correctness definition (Correctness)

Bounded-inconsistency for write-centric applications (Correctness, Performance)

Lease-based state management

(Routing agnostic, Performance)

Sequencing and lightweight retransmission mechanism

(Correctness)

RedPlane-

enabled appRedPlane protocol

17

Page 27: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Outline

RedPlane motivation

RedPlane design

Results

18

Page 28: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Implementation

External

state store

RedPlane-

enabled app

App code+RedPlane

P4 API

DeveloperP4 Compiler

Six switches (2 programmable, 4 regular)

and 10 commodity servers

P4 modules

Various P4 applications for evaluation Replicated state store

on servers in C++

RedPlane protocol

19

Page 29: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

0

0.2

0.4

0.6

0.8

1

1 10 100

CD

F

Latency (μs)

Switch-NAT (w/o FT)

RedPlane-NAT

Server FT-NAT

How does RedPlane affect application latency?

State initialization overhead (once per flow)

6X

20

Page 30: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

How much BW overhead does RedPlane add?

21

99.8 99.9 99.9 91.4

48.8

25.6

25.6

0

20

40

60

80

100

NAT Firewall Load balancer EPC-SGW Sync-Counter

BW

co

nsu

mp

tio

n (

%)

Original packets RedPlane requests RedPlane responses

Page 31: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

0

20

40

60

80

100

0 10 20 30 40 50 60

Th

rou

gh

pu

t (G

bp

s)

Time (sec)

Baseline (no failure) Failure Failure+RedPlane

How fast the connectivity can be recovered?

Switch-1 failed Switch-1 recovered

22

Page 32: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Other results

Throughput of RedPlane-enabled applications

Low switch resource overhead of reliable replication protocol

Less than 13% of switch ASIC resource usage

Model checking for RedPlane protocol by using TLA+

23

Page 33: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Future directions

Better support for write-centric apps

Supporting non-partitionable states

Automatically enabling fault-tolerance with compiler/language support

Next generation switch architectures for fault-tolerance

24

Page 34: RedPlane: Enabling Fault-Tolerant Stateful In-Switch ...

Conclusions

Switch failures can affect the correctness of stateful in-switch apps

RedPlane provides a fault-tolerant state store abstraction• Linearizability-based practical correctness definition for in-switch apps

• Bounded inconsistency mode for write-centric apps

• Sequencing and lightweight retransmission for reliable replication

• Lease-based state ownership management

Offers fault tolerance with minimal performance and resource overhead• No per-packet latency overhead for read-centric apps

• End-to-end connectivity is recovered within a second

github.com/daehyeok-kim/redplane-public25