Computer network (5)

Post on 18-Jul-2015

60 views 0 download

Tags:

Transcript of Computer network (5)

Data Center Networking

Stanford CS144 Lecture 17Philip Levis, 11/30/11

Low latencies: µsHigh capacity: GigE, 10 GigESpecialized trafficCentrally managed

Topology

(picture courtesy of Al-Fares et al, “A Scalable, Commodity Data Center Network Architecture”)

Storage Workload

(picture courtesy of Phanishayee et al, “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”)

Query Workload

(picture courtesy of Alizadeh et al., “Data Center TCP (DCTCP)”)

Problems

Per-Pair Bandwidth

(picture courtesy of Al-Fares et al, “A Scalable, Commodity Data Center Network Architecture”)

Incast

(from Phanishayee et al, “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”)

Incast Details

(from Phanishayee et al, “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”)

Mixed traffic

• Low latency for short flows

• High burst tolerance (incast)

• High throughput for long flows

Recent Research

• New switching topology: Al-Fares et al.

• Fix TCP incast: Vasudevan et al.

• Data Center TCP: Alizadeh et al.

Per-Pair Bandwidth

(picture courtesy of Al-Fares et al, “A Scalable, Commodity Data Center Network Architecture”)

Fat Tree

Fat Tree

(k/2)2

k/2k/2k

SwitchingPrefix Port

10.2.0.0/24 0

10.2.1.0/24 1

0.0.0.0/0 Suffix Port

0.0.0.2/8 2

0.0.0.3/8 3

10.2.0.X

10.2.1.X

X.X.X.2

X.X.X.3

TCAM

Encoder

Prefix Next Hop Port

00 10.2.0.1 0

01 10.2.1.1 1

10 10.4.1.1 2

11 10.4.1.2 3

Not Perfect

(k/2)2

k/2k/2k

Fat-Tree Status

Incast

• RTO = SRTT + (4 X RTTVAR)

Behavior

(from Phanishayee et al, “Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems”)

RFC 6298 (2.4) Whenever RTO is computed, if it is less than 1 second, then the RTO SHOULD be rounded up to 1 second. - in practice, often 200ms

The delayed ACK algorithm specified in [Bra89] SHOULD be used by a TCP receiver. When used, a TCP receiver MUST NOT excessively delay acknowledgments. Specifically, an ACK SHOULD be generated for at least every second full-sized segment, and MUST be generated within 500 ms of the arrival of the first unacknowledged packet. - in practice, often 40ms

RFC 2581

Solutions

• Proposal 1: Adjust RTO (Vasudevan et al.)

• Proposal 2: DCTCP (Alizadeh et al.)

RTT

RTT 2

RTO

• Make RTOmin 200µs

• Timeout = (RTO + (rand(0.5) x RTO))

Improvement

Wide Area

DCTCP

• Three goals• Low latency for short flows

• High burst tolerance (incast)

• High throughput for long flows

• Basic approach: keep switch queues short

Queue Length

• RTT measurements are noisy

• At high speeds, very small• GigE: 10 packets is 120µs

• 10GigE: 10 paciets is 12µs

• Use ECN (explicit congestion notification)• RFC 3168

Setting ECN

K

Set ECN bit

Monitoring α

• Per RTT, measure F, the fraction of packets sent that had the ECN bit set• DCTCP acks copy the ECN bit of the corresponding

data packets into ECN-Echo field

• Compute α, EWMA of F

Adjusting cwnd

• cwnd = cwnd x (1 - α/2)

DCTCP Caveat

“We stress that DCTCP is designed for the data center environment. In this paper, we make no claims about suitability of DCTCP for wide area networks.”

Data Center Networks

• Very different than wide area Internet• Tiny RTTs

• Different traffic patterns

• Single administrative domain

• Standards (e.g., IETF) much less important

• A lot of very novel network design