Download - Slingshot: Time-Critical Multicast for Clustered Applications Mahesh Balakrishnan Stefan Pleisch Ken Birman Cornell University.

Slingshot: Time-Critical Multicast for Clustered Applications

Mahesh BalakrishnanStefan PleischKen Birman

Cornell University

The Contemporary Datacenter

Building-wide super-clusters: 1000s of commodity blade-servers

Typically used as commercial website back-ends: Amazon, etc.

Software Paradigms: SOA, Eventing, Publish/Subscribe…

… many-to-many communication, Multicast!

Multicast in the Datacenter

IP Multicast available: adding reliability to it is a well-researched technology…

Scalability dimensions Number of receivers Number of senders? Number of groups?

Metrics Throughput Timeliness?

Time-Critical Applications

… dealing in perishable data: stock quotes, location updates

… willing to trade complete reliability for timeliness … requiring tunable reliability/ timeliness/ overhead

tradeoffs

Probabilistic Guarantee of Timeliness? For x% overhead, y% of lost packets are recovered in

time t. Remainder can be optionally recovered in time t’.

Design Space

Reactive vs. Proactive Reactive: Loss Discovery

ACK Sender-Based Sequencing

• If the multicast rate in a group is constant, the inter-multicast time at any sender goes up linearly with the number of senders

Gossip – Scalable Proactive: FEC – Tunable

Slingshot Overview

Receiver-Based FEC:

Senders send initially via unreliable IP Multicast

Phase 1: Receivers repair losses by proactively sending each other FEC repair packets

Phase 2: Remaining losses are recovered from the sender

Each receiver sends an error correction (XOR) packet to c randomly selected receivers with the last r packets it received

Rate-of-fire parameter (r, c): Allows tuning of overhead-timeliness tradeoff

Protocol Details 0

Two Packet Types:

Packet ID (Sender, SeqNo)

ApplicationPayload

XOR ofData Packets

List of Data Packet IDs:(sender1,seqno1), (sender2,seqno2)….

Data Packet :

Repair Packet :

App

licat

ion

MT

U: 1

024

Less

than

Net

wor

k M

TU

Terminology: Data packets are included in repair packet

Protocol Details 1

Data Structures: Data Buffer: received data packets Repair Bin: pointers to last <r data packets

Arrival of Data Packet dp at Receiver: dp is added to the data buffer &dp is added to the repair bin If repair bin size equals r, a repair packet rp

is created from its contents, and the repair bin is cleared

rp is dispatched to c random receivers

Protocol Details 2

Arrival of Repair Packet rp at Receiver: If #(missing included data packets) ==0: rp is discarded1: it is recovered by XORing rp with the

other r-1 data packets>1: rp is stored in a special buffer, in

case future data packet arrivals and recoveries make it usable

Evaluation Setup

64 node rack-style cluster at Cornell Loss rate fixed at 1%: packets dropped at end

buffers All nodes send and receive Inter-node latencies = 50-100 microseconds Group Data Rate: 1000 packets per second Each node multicasts 64 packets per second;

i.e one packet every 64 milliseconds

Slingshot Tunability

For 27% overhead, 93.5% Lost Packets are recovered at an

avg. of 3.5 milliseconds

Example TradeoffPoints betweenOverhead, Timeliness,and Reliability

Overhead and Recovered Packets plotted on left y-axis, Recovery Time on right

Slingshot vs SRM

Slingshot recovers 93% in 10 ms, 97% in 25 ms

Fastest SRM packet Recovery is 2.2 seconds93% in 4.85 seconds, 97% in 5.1 seconds

2-3 Orders of Magnitude faster

Slingshot Scalability: Group Size

Scalability in Group Size

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

10 60 110 160 210 260 310 360

Group Size

Fra

ctio

n o

f L

ost

Pac

kets

Rec

ove

red

Gossip-Style Scalability: Insensitive to scale beyond a certain size

Simulation Results:

Conclusion

Slingshot provides a tunable, probabilistic guarantee of timeliness

Outperforms SRM by 2 orders of magnitude in a 64 node system

Insensitive to number of senders Future Work:

Achieve scalability in other dimensions (number of groups)

Build a time-critical middleware layer that uses Slingshot as a generic primitive