Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat David Andersen, Greg Ganger, Garth...

Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat

David Andersen, Greg Ganger, Garth Gibson, Brian Mueller*

Carnegie Mellon University, *Panasas Inc.

Solving TCP Incast (and more) With Aggressive TCP Timeouts

PDL Retreat 2009

Cluster-based Storage Systems

ClientCommodity

EthernetSwitch

Servers

Ethernet: 1-10Gbps

Round Trip Time (RTT): 100-10us

Cluster-based Storage Systems

Client Switch

Storage Servers

Data Block

Server Request Unit(SRU)

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Synchronized Read Setup

• Test on an Ethernet-based storage cluster

• Client performs synchronized reads

• Increase # of servers involved in transfer• Data block size is fixed (FS read)

• TCP used as the data transfer protocol

TCP Throughput Collapse

Collapse!

Cluster Setup

1Gbps Ethernet

Unmodified TCP

S50 Switch

1MB Block Size

• TCP Incast• Cause of throughput collapse:

coarse-grained TCP timeouts

Solution: µsecond TCP + no minRTO

more servers

High throughput for up to 47 servers

Simulation scales to thousands of servers

Throughput(Mbps)

Unmodified TCP

Our solution

Overview

• Problem: Coarse-grained TCP timeouts (200ms) too expensive for datacenter applications

• Solution: microsecond granularity timeouts• Improves datacenter app throughput & latency• Also safe for use in the wide-area (Internet)

Outline• Overview

Why are TCP timeouts expensive?

• How do coarse-grained timeouts affect apps?

• Solution: Microsecond TCP Retransmissions

• Is the solution safe?

TCP: data-driven loss recovery

Sender Receiver

3 duplicate ACKs for 1(packet 2 is probably lost)

Retransmit packet 2 immediately Ack 5

In datacentersdata-driven recovery in µsecs after

TCP: timeout-driven loss recovery

Sender Receiver

RetransmissionTimeout(RTO)

Timeouts are expensive (msecs to recover after loss)

Retransmit packet

TCP: Loss recovery comparison

Sender Receiver

Ack 1Ack 1Ack 1

Retransmit 2

Sender Receiver

RetransmissionTimeout(RTO)

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (µs) in datacenters

RTO Estimation and Minimum Bound

• Jacobson’s TCP RTO Estimator• RTOEstimated = SRTT + (4 * RTTVAR)

• Actual RTO = max(minRTO, RTOEstimated)

• Minimum RTO bound (minRTO) = 200ms• TCP timer granularity• Safety (Allman99)• minRTO (200ms) >> Datacenter RTT (100µs)• 1 TCP Timeout lasts 1000 datacenter RTTs!

Outline• Overview

• Why are TCP timeouts expensive?

How do coarse-grained timeouts affect apps?

Single Flow TCP Request-Response

Client Switch Server

DataData Data

timeRequest sent

Response sent

Response dropped

Response resent

Apps Sensitive to 200ms Timeouts

• Single flow request-response• Latency-sensitive applications

• Barrier-Synchronized workloads• Parallel Cluster File Systems

– Throughput-intensive• Search: multi-server queries

– Latency-sensitive

Link Idle Time Due To Timeouts

Client Switch

Synchronized Read

1 2 3 4 Server Request Unit(SRU)

Req. sent

Rsp. sent

4 dropped Response resent1 – 3 done Link Idle!

Client Link Utilization

Link Idle!

200ms timeouts Throughput Collapse

• [Nagle04] called this Incast• Provided application level solutions• Cause of throughput collapse: TCP timeouts

• [FAST08] Search for network level solutions to TCP Incast

Collapse!

Cluster Setup

1Gbps Ethernet

200ms minRTO

S50 Switch

1MB Block Size

Results from our previous work (FAST08)Network Level Solutions Results / Conclusions

Increase Switch Buffer Size Delays throughput collapse Throughput collapse inevitable Expensive

Alternate TCP Implementations (avoiding timeouts, aggressive data-driven recovery, disable slow start)

Throughput collapse inevitable because timeouts are inevitable (complete window loss a common case)

Ethernet Flow Control Effective Limited effectiveness (works for

simple topologies) head-of-line blocking

Ethernet Flow Control Effective Limited effectiveness (works for

simple topologies) head-of-line blocking

Reducing minRTO (in simulation) Very effective Implementation concerns (µs

timers for OS, TCP) Safety concerns

Outline• Overview

• How do coarse-grained timeouts affect apps?

Solution: Microsecond TCP Retransmissions• and eliminate minRTO

µsecond Retransmission Timeouts (RTO)

RTO = max( minRTO, f(RTT) )

200µs?

RTT tracked in milliseconds

Track RTT in µsecond

Lowering minRTO to 1ms

• Lower minRTO to as low a value as possible without changing timers/TCP impl.

• Simple one-line change to Linux

• Uses low-resolution 1ms kernel timers

Default minRTO: Throughput Collapse

Unmodified TCP(200ms minRTO)

Lowering minRTO to 1ms helps

Millisecond retransmissions are not enough

1ms minRTO

Requirements for µsecond RTO

• TCP must track RTT in microseconds• Modify internal data structures• Reuse timestamp option

• Efficient high-resolution kernel timers• Use HPET for efficient interrupt signaling

Solution: µsecond TCP + no minRTO

more servers

1ms minRTO

microsecond TCP+ no minRTO

High throughput for up to 47 servers

Simulation: Scaling to thousands

Block Size = 80MB, Buffer = 32KB, RTT = 20us

Synchronized Retransmissions At Scale

Simultaneous retransmissions successive timeouts

Successive RTO = RTO * 2backoff

Simulation: Scaling to thousands

Desynchronize retransmissions to scale further

Successive RTO = (RTO + (rand(0.5)*RTO) ) * 2backoff

For use within datacenters only

Outline• Overview

• The Incast Workload

Is the solution safe?• Interaction with Delayed-ACK within datacenters• Performance in the wide-area

Delayed-ACK (for RTO > 40ms)

Delayed-Ack: Optimization to reduce #ACKs sent

Sender Receiver

µsecond RTO and Delayed-ACK

Premature TimeoutRTO on sender triggers before Delayed-ACK on receiver

Sender Receiver

RTO < 40ms

TimeoutRetransmit packet

Sender Receiver

RTO > 40ms

Impact of Delayed-ACK

Is it safe for the wide-area?

• Stability: Could we cause congestion collapse?• No: Wide-area RTOs are in 10s, 100s of ms• No: Timeouts result in rediscovering link capacity

(slow down the rate of transfer)

• Performance: Do we timeout unnecessarily?• [Allman99] Reducing minRTO increases the chance

of premature timeouts– Premature timeouts slow transfer rate

• Today: detect and recover from premature timeouts• Wide-area experiments to determine performance

impact

Wide-area Experiment

Do microsecond timeouts harm wide-area throughput?

Microsecond TCP+

No minRTO

Standard TCP

BitTorrent Seeds

BitTorrent Clients

Wide-area Experiment: Results

No noticeable difference in throughput

Conclusion

• Microsecond granularity TCP timeouts (with no minRTO) improve datacenter application response time and throughput

• Safe for wide-area communication

• Linux patch: http://www.cs.cmu.edu/~vrv/incast/

• Code (simulation, cluster) and scripts: http://www.cs.cmu.edu/~amarp/dist/incast/incast_1.1.tar.gz

Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat David Andersen, Greg Ganger, Garth...

Documents

Transcript of Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat David Andersen, Greg Ganger, Garth...

Agility and Performance in Elastic Distributed Storage · 16 Agility and Performance in Elastic Distributed Storage LIANGHONG XU, JAMES CIPAR, ELIE KREVAT, ALEXEY TUMANOV, and NITIN

Hiral -amul

Longfield Primary School Hiral Kelly. We will be looking at: Grammar Improving writing - VCOP Up levelling Features of different Genres Talk.

PREPARED DIRECT TESTIMONY OF LEE S. KREVAT ON BEHALF …webarchive.sdge.com/sites/default/files/regulatory/A_14... · 2014. 3. 3. · Application No. 14-02-___ (Filed February 28,

By Hiral A. Jesrani S As Champion of Clean Air

Towards a storage system for connected homes Trinabh Gupta*, Amar Phanishayee, Jaeyeon Jung, Ratul Mahajan *The University of Texas at Austin Microsoft.

parsippany.vidyalaya.org Saanvi Vavilala, Pavan Eda, Sriya Jayanthi, Maya Kalapatapu Teachers: ... 22. Jokes Participants: Hiral Solanki, Hiren Solanki, Raghavi ...

Hiral Betai's Architecture Portfolio (B.Arch)

LIST OF FORMULATIONS- HIRAL LABS LTD. TABLETS & …

Hiral polynomials mini

JackRabbit: Improved agility in elastic distributed storage · JackRabbit: Improved agility in elastic distributed storage James Cipar?, Lianghong Xu?, Elie Krevat?, Alexey Tumanov?

Wireshark Presented By: Hiral Chhaya, Anvita Priyam.

Beam: Ending Monolithic Applications for Connected ... - Ratul · Chenguang Shen (UCLA) Rayman Preet Singh (Samsung Research) Amar Phanishayee Aman Kansal Ratul Mahajan Microsoft

1 Extensible Kernels Ken, with slides by Amar Phanishayee.

Diagnosing performance changes by comparing system …...Diagnosing performance changes by comparing system behaviours Raja R. Sambasivan⋆, Alice X. Zheng†, Elie Krevat⋆, Spencer

GUJARAT INTERNATIONAL TECHNOLOGICAL EXPERIENCE … · Mr. Keyur Darji- Deputy Director Ms. Hiral Mistry- Admin Assistant 2. Officials from KARCHER: Mr. Ruediger Schroeder - Managing

1 Extensible Kernels Amar Phanishayee. 2 Traditional OS services – Management and Protection Provides a set of abstractions Processes, Threads, Virtual.

Work october issue - · PDF fileMilly Trivedi, Shalini Gupta, Vasumati Patel, Hiral Purani, Krishnan Saraiya, ... Mridula Mohan, Rakesh Suresh, Mahija Janardhanan, Vindhya Savithri,

Solar system hiral thaker final(Science)

Leh - LH2- 2018 - Heena Tours Word - Leh - LH2- 2018.docx Author Hiral Created Date 11/4/2017 5:49:02 PM ...

Towards a storage system for connected homes Trinabh Gupta, Amar Phanishayee, Jaeyeon Jung, Ratul Mahajan The University of Texas at Austin Microsoft.