TCP for Data center networks
-
Upload
cameroon45 -
Category
Technology
-
view
1.770 -
download
0
description
Transcript of TCP for Data center networks
![Page 1: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/1.jpg)
TCP for Data center networks
DeeptiSurjyendu Ray
![Page 2: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/2.jpg)
What is a Datacenter ?
• A facility used for housing a large amount of computer and communications equipment maintained by an organization for the purpose of handling the data necessary for its operations. ( MSDN Glossary )
• A data center (sometimes spelled datacenter) is a centralized repository, either physical or virtual, for the storage, management, and dissemination of data and information organized around a particular body of knowledge or pertaining to a particular
business.
![Page 3: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/3.jpg)
What is a Datacenter network ?
• Data centers consist of:– Server racks with servers ( compute
nodes/storage )– Switches,– Connecting links along with its topology.
• The network architecture typically is made up of: – tree of routing and – switching elements with progressively
more specialized and expensive equipment moving up the network hierarchy.
![Page 4: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/4.jpg)
Properties of a typical Datacenter network ?
• Characters of a datacenter network:– High-fan-in of the tree.– High-bandwidth, low-latency workload.– Clients that issue barrier-synchronized requests in
parallel.– Relatively small amount of data per request.– Network constraint: Small switch buffer.
![Page 5: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/5.jpg)
The TCP Incast problem?
• Incast :- TCP Throughput Collapse i.e– drastic reduction in application throughput when
simultaneously requesting data from many servers using TCP.
• Leading to :-– Gross under utilization of link capacity in many- to-
one communication networks, like Data Center networks.
![Page 6: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/6.jpg)
The root of TCP Incast :
• Highly bursty, fast data transmissions overfill Ethernet switch buffers.
• Cause being:-– Intense packet loss that results in TCP timeouts.
• The TCP timeouts last 100’s of milliseconds.– TCP timeout ≈ 100’s ms
• But, round trip time of a data centre network is around 100’s of microsecond.– RTT ≈ 100’s us
![Page 7: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/7.jpg)
Round trips !!!!
•RTT << TCP Timeout.•Sender will have to wait for TCP timeout before re-transmission i.e Retransmission time out (RTO)•Coarse grained RTOs reduce application throughput by 90%
![Page 8: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/8.jpg)
Link Idle Time Due To Timeouts
•RTT << TCP Timeout.•Sender will have to wait for TCP timeout before re-transmission i.e Retransmission time out (RTO)•Coarse grained RTOs reduce application throughput by 90%
![Page 9: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/9.jpg)
Induced timeout due to barrier synchronization
• The client can not make forward progress untill the responses from every server for the current request have been received.•Barrier synchronized workloads are becoming increasingly common in today’s commodity clusters.
•E.g. parallel reads/writes in cluster file systems like Lustre, Panasas.• search queries sent to dozen of nodes, with results returned to be sorted.
![Page 10: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/10.jpg)
Barrier Synchronization: a typical request pattern in Data Centers
![Page 11: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/11.jpg)
Idle Link issue !!
![Page 12: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/12.jpg)
Idle Link issue !!
![Page 13: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/13.jpg)
200ms timeouts Throughput Collapse
• Advent of more servers into the network induces overflow of switch buffer.• This overflow causes severe packet loss.• Under packet loss, TCP experiences a time out that lasts a minimum of 200 ms.
![Page 14: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/14.jpg)
Proposed solution to TCP Incast
A bi-pronged attack on the problem entails:• System extensions to enable microsecond granularity retransmission
Fine grained TCP retransmission through high resolution Linux kernel timers.
Reducing RTOmin improves system throughput.• Removing acknowledgement delay
The client acknowledges every other packet, thus reducing network load.
![Page 15: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/15.jpg)
Motivation to resolve Incast using TCP
• TCP is well-understood and mature, facilitating its use as a transport protocol, in data centers.
• Commodity Ethernet switches are cost-competitive to specialized technology i.e. Infiniband.
• TCP, being well understood, gives us the potential to harness the TCP stack and modify it to overcome the limitation of the presence of a small buffer in the switch.
![Page 16: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/16.jpg)
Solution Domain
![Page 17: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/17.jpg)
Insight into fine-grained TCP
• Premise:– The timers must operate on a granularity close to the
RTT of the network, hundreds of us or less.• Commodity Ethernet switches are cost-competitive to
specialized technology i.e. Infiniband.• TCP, being well understood, gives us the potential to
harness the TCP stack and modify it to overcome the limitation of the presence of a small buffer in the switch.
![Page 18: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/18.jpg)
RTO Estimation and Minimum Bound
• Jacobson’s TCP RTO Estimator– RTOEstimated = SRTT + (4 * RTTVAR)
• Actual RTO = max(minRTO, RTOEstimated)
• Minimum RTO bound (minRTO) = 200ms– TCP timer granularity– Safety (Allman99)– minRTO (200ms) >> Datacenter RTT (100µs)– 1 TCP Timeout lasts 1000 datacenter RTTs!
![Page 19: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/19.jpg)
RTO Estimation and Minimum Bound
• Jacobson’s TCP RTO Estimator– RTOEstimated = SRTT + (4 * RTTVAR)
• Actual RTO = max(minRTO, RTOEstimated)• Minimum RTO bound (minRTO) = 200ms
– TCP timer granularity– minRTO (200ms) >> Datacenter RTT (100µs)– 1 TCP Timeout lasts 1000 datacenter RTTs!
![Page 20: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/20.jpg)
Evaluation workload
• The test client requests for a block of data, striped across “n” servers.
• Thus each server responds with blocksize/n bytes of data
![Page 21: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/21.jpg)
µsecond Retransmission Timeouts (RTO)
RTO = max( minRTO, f(RTT) )
Does eliminating RTOmin helps avoid TCP incast collapse ?
![Page 22: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/22.jpg)
Simulation result
Reducing the RTOmin in the simulation to us from the current default value of 200ms improves goodput
![Page 23: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/23.jpg)
Real world cluster
Experiments on a real cluster validate the simulation result => reducing the RTOmin improves the goodput.
![Page 24: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/24.jpg)
Real world cluster
Experiments on a real cluster validate the simulation result => reducing the RTOmin improves the goodput.
![Page 25: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/25.jpg)
TCP Requirements for µsecond RTO
• TCP must track RTT in microseconds– Efficient high-resolution kernel timers
• Use HPET for efficient interrupt signaling (HPET is High Precision Event Timer)
• The HPET is a programmable hardware timer that consists of a free-running up counter and several comparators and registers, which modern operating systems can set.
![Page 26: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/26.jpg)
Modifications to the TCP stack
• The minimal modifications required of the TCP stack, to support ”hrtimers” are:– microsecond resolution time accounting to track
RTTs with greater precision.– redefinition of TCP constants.– Replacement of low resolution timers with
hrtimers.
![Page 27: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/27.jpg)
µsecond TCP + no minRTO
For a 48 node cluster providing TCP re transmissions in us eliminates incast collapse for upto 47 servers.
![Page 28: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/28.jpg)
Simulation: Scaling to thousands
In simulation, introducing a randomized component to the RTO desynchronizes retransmission following timeouts and avoids good put degradation for a large number of flows.
![Page 29: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/29.jpg)
Conclusion
• Microsecond granularity TCP timeouts (with no minRTO) improve datacenter application response time and throughput.
• Safe for wide-area communication.• This paper presented a practical, effective,
and safe solution to eliminate TCP incast in data centre environments.– microsecond granularity TCP timeouts.– Randomized re transmissions.
![Page 30: TCP for Data center networks](https://reader035.fdocuments.us/reader035/viewer/2022081412/5455ccd2b1af9f37608b4a50/html5/thumbnails/30.jpg)
Future Work
• The practical implémentation of the proposed work is shown on about 48 servers in a data centre.
• Its practical implémentation needs to be seen on thousands of machines.
• Narrow down the TCP variables of interest for introducing microsecond granularity to decrease the problem space.