Slingshot: Time-Critical Multicast for Clustered Applications Mahesh Balakrishnan Stefan Pleisch Ken...
-
Upload
bryce-reynolds -
Category
Documents
-
view
220 -
download
0
Transcript of Slingshot: Time-Critical Multicast for Clustered Applications Mahesh Balakrishnan Stefan Pleisch Ken...
![Page 1: Slingshot: Time-Critical Multicast for Clustered Applications Mahesh Balakrishnan Stefan Pleisch Ken Birman Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062315/5697c0221a28abf838cd3671/html5/thumbnails/1.jpg)
Slingshot: Time-Critical Multicast for Clustered Applications
Mahesh BalakrishnanStefan PleischKen Birman
Cornell University
![Page 2: Slingshot: Time-Critical Multicast for Clustered Applications Mahesh Balakrishnan Stefan Pleisch Ken Birman Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062315/5697c0221a28abf838cd3671/html5/thumbnails/2.jpg)
The Contemporary Datacenter
Building-wide super-clusters: 1000s of commodity blade-servers
Typically used as commercial website back-ends: Amazon, etc.
Software Paradigms: SOA, Eventing, Publish/Subscribe…
… many-to-many communication, Multicast!
![Page 3: Slingshot: Time-Critical Multicast for Clustered Applications Mahesh Balakrishnan Stefan Pleisch Ken Birman Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062315/5697c0221a28abf838cd3671/html5/thumbnails/3.jpg)
Multicast in the Datacenter
IP Multicast available: adding reliability to it is a well-researched technology…
Scalability dimensions Number of receivers Number of senders? Number of groups?
Metrics Throughput Timeliness?
![Page 4: Slingshot: Time-Critical Multicast for Clustered Applications Mahesh Balakrishnan Stefan Pleisch Ken Birman Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062315/5697c0221a28abf838cd3671/html5/thumbnails/4.jpg)
Time-Critical Applications
… dealing in perishable data: stock quotes, location updates
… willing to trade complete reliability for timeliness … requiring tunable reliability/ timeliness/ overhead
tradeoffs
Probabilistic Guarantee of Timeliness? For x% overhead, y% of lost packets are recovered in
time t. Remainder can be optionally recovered in time t’.
![Page 5: Slingshot: Time-Critical Multicast for Clustered Applications Mahesh Balakrishnan Stefan Pleisch Ken Birman Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062315/5697c0221a28abf838cd3671/html5/thumbnails/5.jpg)
Design Space
Reactive vs. Proactive Reactive: Loss Discovery
ACK Sender-Based Sequencing
• If the multicast rate in a group is constant, the inter-multicast time at any sender goes up linearly with the number of senders
Gossip – Scalable Proactive: FEC – Tunable
![Page 6: Slingshot: Time-Critical Multicast for Clustered Applications Mahesh Balakrishnan Stefan Pleisch Ken Birman Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062315/5697c0221a28abf838cd3671/html5/thumbnails/6.jpg)
Slingshot Overview
Receiver-Based FEC:
Senders send initially via unreliable IP Multicast
Phase 1: Receivers repair losses by proactively sending each other FEC repair packets
Phase 2: Remaining losses are recovered from the sender
Each receiver sends an error correction (XOR) packet to c randomly selected receivers with the last r packets it received
Rate-of-fire parameter (r, c): Allows tuning of overhead-timeliness tradeoff
![Page 7: Slingshot: Time-Critical Multicast for Clustered Applications Mahesh Balakrishnan Stefan Pleisch Ken Birman Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062315/5697c0221a28abf838cd3671/html5/thumbnails/7.jpg)
Protocol Details 0
Two Packet Types:
Packet ID (Sender, SeqNo)
ApplicationPayload
XOR ofData Packets
List of Data Packet IDs:(sender1,seqno1), (sender2,seqno2)….
Data Packet :
Repair Packet :
App
licat
ion
MT
U: 1
024
Less
than
Net
wor
k M
TU
Terminology: Data packets are included in repair packet
![Page 8: Slingshot: Time-Critical Multicast for Clustered Applications Mahesh Balakrishnan Stefan Pleisch Ken Birman Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062315/5697c0221a28abf838cd3671/html5/thumbnails/8.jpg)
Protocol Details 1
Data Structures: Data Buffer: received data packets Repair Bin: pointers to last <r data packets
Arrival of Data Packet dp at Receiver: dp is added to the data buffer &dp is added to the repair bin If repair bin size equals r, a repair packet rp
is created from its contents, and the repair bin is cleared
rp is dispatched to c random receivers
![Page 9: Slingshot: Time-Critical Multicast for Clustered Applications Mahesh Balakrishnan Stefan Pleisch Ken Birman Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062315/5697c0221a28abf838cd3671/html5/thumbnails/9.jpg)
Protocol Details 2
Arrival of Repair Packet rp at Receiver: If #(missing included data packets) ==0: rp is discarded1: it is recovered by XORing rp with the
other r-1 data packets>1: rp is stored in a special buffer, in
case future data packet arrivals and recoveries make it usable
![Page 10: Slingshot: Time-Critical Multicast for Clustered Applications Mahesh Balakrishnan Stefan Pleisch Ken Birman Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062315/5697c0221a28abf838cd3671/html5/thumbnails/10.jpg)
Evaluation Setup
64 node rack-style cluster at Cornell Loss rate fixed at 1%: packets dropped at end
buffers All nodes send and receive Inter-node latencies = 50-100 microseconds Group Data Rate: 1000 packets per second Each node multicasts 64 packets per second;
i.e one packet every 64 milliseconds
![Page 11: Slingshot: Time-Critical Multicast for Clustered Applications Mahesh Balakrishnan Stefan Pleisch Ken Birman Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062315/5697c0221a28abf838cd3671/html5/thumbnails/11.jpg)
Slingshot Tunability
For 27% overhead, 93.5% Lost Packets are recovered at an
avg. of 3.5 milliseconds
Example TradeoffPoints betweenOverhead, Timeliness,and Reliability
Overhead and Recovered Packets plotted on left y-axis, Recovery Time on right
![Page 12: Slingshot: Time-Critical Multicast for Clustered Applications Mahesh Balakrishnan Stefan Pleisch Ken Birman Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062315/5697c0221a28abf838cd3671/html5/thumbnails/12.jpg)
Slingshot vs SRM
Slingshot recovers 93% in 10 ms, 97% in 25 ms
Fastest SRM packet Recovery is 2.2 seconds93% in 4.85 seconds, 97% in 5.1 seconds
2-3 Orders of Magnitude faster
![Page 13: Slingshot: Time-Critical Multicast for Clustered Applications Mahesh Balakrishnan Stefan Pleisch Ken Birman Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062315/5697c0221a28abf838cd3671/html5/thumbnails/13.jpg)
Slingshot Scalability: Group Size
Scalability in Group Size
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
10 60 110 160 210 260 310 360
Group Size
Fra
ctio
n o
f L
ost
Pac
kets
Rec
ove
red
Gossip-Style Scalability: Insensitive to scale beyond a certain size
Simulation Results:
![Page 14: Slingshot: Time-Critical Multicast for Clustered Applications Mahesh Balakrishnan Stefan Pleisch Ken Birman Cornell University.](https://reader036.fdocuments.us/reader036/viewer/2022062315/5697c0221a28abf838cd3671/html5/thumbnails/14.jpg)
Conclusion
Slingshot provides a tunable, probabilistic guarantee of timeliness
Outperforms SRM by 2 orders of magnitude in a 64 node system
Insensitive to number of senders Future Work:
Achieve scalability in other dimensions (number of groups)
Build a time-critical middleware layer that uses Slingshot as a generic primitive