Computer network (5)
Transcript of Computer network (5)
Data Center Networking
Stanford CS144 Lecture 17Philip Levis, 11/30/11
Low latencies: µsHigh capacity: GigE, 10 GigESpecialized trafficCentrally managed
Topology
(picture courtesy of Al-Fares et al, “A Scalable, Commodity Data Center Network Architecture”)
Storage Workload
(picture courtesy of Phanishayee et al, “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”)
Query Workload
(picture courtesy of Alizadeh et al., “Data Center TCP (DCTCP)”)
Problems
Per-Pair Bandwidth
(picture courtesy of Al-Fares et al, “A Scalable, Commodity Data Center Network Architecture”)
Incast
(from Phanishayee et al, “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”)
Incast Details
(from Phanishayee et al, “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”)
Mixed traffic
• Low latency for short flows
• High burst tolerance (incast)
• High throughput for long flows
Recent Research
• New switching topology: Al-Fares et al.
• Fix TCP incast: Vasudevan et al.
• Data Center TCP: Alizadeh et al.
Per-Pair Bandwidth
(picture courtesy of Al-Fares et al, “A Scalable, Commodity Data Center Network Architecture”)
Fat Tree
Fat Tree
(k/2)2
k/2k/2k
SwitchingPrefix Port
10.2.0.0/24 0
10.2.1.0/24 1
0.0.0.0/0 Suffix Port
0.0.0.2/8 2
0.0.0.3/8 3
10.2.0.X
10.2.1.X
X.X.X.2
X.X.X.3
TCAM
Encoder
Prefix Next Hop Port
00 10.2.0.1 0
01 10.2.1.1 1
10 10.4.1.1 2
11 10.4.1.2 3
Not Perfect
(k/2)2
k/2k/2k
Fat-Tree Status
Incast
• RTO = SRTT + (4 X RTTVAR)
Behavior
(from Phanishayee et al, “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”)
RFC 6298 (2.4) Whenever RTO is computed, if it is less than 1 second, then the RTO SHOULD be rounded up to 1 second. - in practice, often 200ms
The delayed ACK algorithm specified in [Bra89] SHOULD be used by a TCP receiver. When used, a TCP receiver MUST NOT excessively delay acknowledgments. Specifically, an ACK SHOULD be generated for at least every second full-sized segment, and MUST be generated within 500 ms of the arrival of the first unacknowledged packet. - in practice, often 40ms
RFC 2581
Solutions
• Proposal 1: Adjust RTO (Vasudevan et al.)
• Proposal 2: DCTCP (Alizadeh et al.)
RTT
RTT 2
RTO
• Make RTOmin 200µs
• Timeout = (RTO + (rand(0.5) x RTO))
Improvement
Wide Area
DCTCP
• Three goals• Low latency for short flows
• High burst tolerance (incast)
• High throughput for long flows
• Basic approach: keep switch queues short
Queue Length
• RTT measurements are noisy
• At high speeds, very small• GigE: 10 packets is 120µs
• 10GigE: 10 paciets is 12µs
• Use ECN (explicit congestion notification)• RFC 3168
Setting ECN
K
Set ECN bit
Monitoring α
• Per RTT, measure F, the fraction of packets sent that had the ECN bit set• DCTCP acks copy the ECN bit of the corresponding
data packets into ECN-Echo field
• Compute α, EWMA of F
Adjusting cwnd
• cwnd = cwnd x (1 - α/2)
DCTCP Caveat
“We stress that DCTCP is designed for the data center environment. In this paper, we make no claims about suitability of DCTCP for wide area networks.”
Data Center Networks
• Very different than wide area Internet• Tiny RTTs
• Different traffic patterns
• Single administrative domain
• Standards (e.g., IETF) much less important
• A lot of very novel network design