Computer network (5)

34
Data Center Networking Stanford CS144 Lecture 17 Philip Levis, 11/30/11

Transcript of Computer network (5)

Page 1: Computer network (5)

Data Center Networking

Stanford CS144 Lecture 17Philip Levis, 11/30/11

Page 2: Computer network (5)
Page 3: Computer network (5)

Low latencies: µsHigh capacity: GigE, 10 GigESpecialized trafficCentrally managed

Page 4: Computer network (5)

Topology

(picture courtesy of Al-Fares et al, “A Scalable, Commodity Data Center Network Architecture”)

Page 5: Computer network (5)

Storage Workload

(picture courtesy of Phanishayee et al, “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”)

Page 6: Computer network (5)

Query Workload

(picture courtesy of Alizadeh et al., “Data Center TCP (DCTCP)”)

Page 7: Computer network (5)

Problems

Page 8: Computer network (5)

Per-Pair Bandwidth

(picture courtesy of Al-Fares et al, “A Scalable, Commodity Data Center Network Architecture”)

Page 9: Computer network (5)

Incast

(from Phanishayee et al, “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”)

Page 10: Computer network (5)

Incast Details

(from Phanishayee et al, “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”)

Page 11: Computer network (5)

Mixed traffic

• Low latency for short flows

• High burst tolerance (incast)

• High throughput for long flows

Page 12: Computer network (5)

Recent Research

• New switching topology: Al-Fares et al.

• Fix TCP incast: Vasudevan et al.

• Data Center TCP: Alizadeh et al.

Page 13: Computer network (5)

Per-Pair Bandwidth

(picture courtesy of Al-Fares et al, “A Scalable, Commodity Data Center Network Architecture”)

Page 14: Computer network (5)

Fat Tree

Page 15: Computer network (5)

Fat Tree

(k/2)2

k/2k/2k

Page 16: Computer network (5)

SwitchingPrefix Port

10.2.0.0/24 0

10.2.1.0/24 1

0.0.0.0/0 Suffix Port

0.0.0.2/8 2

0.0.0.3/8 3

10.2.0.X

10.2.1.X

X.X.X.2

X.X.X.3

TCAM

Encoder

Prefix Next Hop Port

00 10.2.0.1 0

01 10.2.1.1 1

10 10.4.1.1 2

11 10.4.1.2 3

Page 17: Computer network (5)

Not Perfect

(k/2)2

k/2k/2k

Page 18: Computer network (5)

Fat-Tree Status

Page 19: Computer network (5)

Incast

• RTO = SRTT + (4 X RTTVAR)

Page 20: Computer network (5)

Behavior

(from Phanishayee et al, “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”)

Page 21: Computer network (5)

RFC 6298 (2.4) Whenever RTO is computed, if it is less than 1 second, then the RTO SHOULD be rounded up to 1 second. - in practice, often 200ms

The delayed ACK algorithm specified in [Bra89] SHOULD be used by a TCP receiver. When used, a TCP receiver MUST NOT excessively delay acknowledgments. Specifically, an ACK SHOULD be generated for at least every second full-sized segment, and MUST be generated within 500 ms of the arrival of the first unacknowledged packet. - in practice, often 40ms

RFC 2581

Page 22: Computer network (5)

Solutions

• Proposal 1: Adjust RTO (Vasudevan et al.)

• Proposal 2: DCTCP (Alizadeh et al.)

Page 23: Computer network (5)

RTT

Page 24: Computer network (5)

RTT 2

Page 25: Computer network (5)

RTO

• Make RTOmin 200µs

• Timeout = (RTO + (rand(0.5) x RTO))

Page 26: Computer network (5)

Improvement

Page 27: Computer network (5)

Wide Area

Page 28: Computer network (5)

DCTCP

• Three goals• Low latency for short flows

• High burst tolerance (incast)

• High throughput for long flows

• Basic approach: keep switch queues short

Page 29: Computer network (5)

Queue Length

• RTT measurements are noisy

• At high speeds, very small• GigE: 10 packets is 120µs

• 10GigE: 10 paciets is 12µs

• Use ECN (explicit congestion notification)• RFC 3168

Page 30: Computer network (5)

Setting ECN

K

Set ECN bit

Page 31: Computer network (5)

Monitoring α

• Per RTT, measure F, the fraction of packets sent that had the ECN bit set• DCTCP acks copy the ECN bit of the corresponding

data packets into ECN-Echo field

• Compute α, EWMA of F

Page 32: Computer network (5)

Adjusting cwnd

• cwnd = cwnd x (1 - α/2)

Page 33: Computer network (5)

DCTCP Caveat

“We stress that DCTCP is designed for the data center environment. In this paper, we make no claims about suitability of DCTCP for wide area networks.”

Page 34: Computer network (5)

Data Center Networks

• Very different than wide area Internet• Tiny RTTs

• Different traffic patterns

• Single administrative domain

• Standards (e.g., IETF) much less important

• A lot of very novel network design