Data Center Networks

Professor H. T. KungHarvard School

of Engineering and Applied Sciences

Data Center Networks

Copyright © 2010 by H. T. Kung

(Lecture #3)

1/04/2010

2

Three Approaches

Main References

VL2: A Scalable and Flexible Data Center Network, SIGCOMM 2009 (Lecture #1 12/21/2009)

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric, SIGCOMM 2009 (Lecture #2 12/23/2010)

BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers,” SIGCOMM 2009 (Lecture #3---Today’s Lecture)

3

Approach 1:

Virtual Layer Two Approach

Use a highly redundant multipath layer-3

network as a virtual layer-2 network

Complete

Bipartite

Layer 3

Interconnection

Multi-rooted tree

4

Approach 2:The PortLand Approach

Switches discover their position in the topology

Pseudo MAC (PMAC) addresses are assigned to all end hosts to encode their position in the topology

The hierarchical PMAC addresses enable efficient, provably loop-free forwarding with small switch state

Multi-rooted tree

Core

Aggregation

Edge

Pod 0 Pod 1 Pod 2 Pod 3

Hosts

5

Approach 3:

Server-centric Source-routing

This is a peer-to-peer approach in the peer nodes will keep states and do the routing

Can use commodity switches

Graceful performance degradation under faulty conditions

Suited to shipping-container based, modular data centers, where physical access by service personnel can be difficult or not allowed due to regulations

Not a multi-rooted tree!

6

Review of the Last Week’s Exam (1/3)

(1) These are true-false questions.

(a) [2] For VL2, the rack and cluster switches in Ref. #1 can actually be IP routers. (True)

(b) [2] For PortLand, the rack and cluster switches in Ref. #1 can actually be IP routers. (False)

(c) [2] When PortLand uses TCP to avoid packet loss in a data center, a TCP header will need to be added to each packet. . (True)

(d) [2] In VL2 and PortLand, when a new host is added to the data center, the network will automatically learn the position of the host so it can be reached by other hosts. . (True)

(e) [2] Multicast support is useful for GFS. (True)

(f) [4] In both VL2 and Portland, multi-rooted tree topologies are used. Is it true that the multi-rooted tree topologies are useful for all of the following purposes: scaling network bandwidth, fault tolerance and multicast? . (True)

7

Review of the Last Week’s Exam (2/3)

(2) [6] We noted in class that putting servers to sleep will save power, but may make local disks unavailable. Give three ideas on how to solve/alleviate this problem?

Answer: data replication, robotic arms for disk drive insertion/removal, software cache, and putting storage on a switching/network fabric rather than CPU buses.

(3) [15] (The first and second correct answers earn 5 and 10 points, respectively) VL2 and PortLand share similar approaches in several aspects in providing large layer-2 networks for data centers. For example, they both use multi-rooted tree topologies. Please describe two other areas where both methods share similar approaches. Please give succinct answers in the bullet form. Hints: Think about addressing.

Answer:

i. Hierarchical addressing

VL2: hierarchical IP addresses

PortLand: hierarchical Pseudo PMAC (PMAC) addresses

ii. Separation of host identifier and host location

VL2: AA vs. LA

PortLand: AMAC vs. PMAC

8

Review of the Last Week’s Exam (3/3)(4) [10 points] VL2 and PortLand share drawbacks in some similar ways. Describe

one such area where both methods may potentially have similar performance problems. Please use no more than a total of 30 words in your answers. Hints: think about possible congestion or update issues.

Answer:

i. Congestion problem for "elephant flows"

ii. Update delay and overhead for location addresses LA and PMAC

(5) [15 points] When discussing PortLand in class, we showed a three-layer multi-rooted tree based on k-port switches with k = 6 (slide 12 of Lecture #2). We noted that the total amount of bandwidth connecting the top two layers of switches is less than that connecting the bottom two layers of switches. As pointed out by someone in class, we can fix this problem by adding some additional switches in the top layer. How many additional switches do we need? Show the resulting drawing, like the one on slide 12. To save time in drawing, you should just add nodes and links on top of the existing drawing of slide 12.

Answer:

Add three additional switches in the top layer.

For the three switches in the middle layer of each pod , connect each switch to a separate added switch.

9

Container-based Datacenter (1/2)

Placing the server racks (thousands of servers) into a standard shipping container and integrating heat exchange and power distribution into the container

Air handling is similar to in-rack cooling and typically allows higher power densities than regular raised-floor datacenters

The container-based facility has achieved extremely high energy efficiency ratings compared with typical datacenters today

Microsoft Data Center Near Chicago

(9/30/2009)

Source: http://www.datacenterknowledge.com/archives/

2009/09/30/microsoft-unveils-its-container-powered-cloud

10

Container-based Datacenter (2/2)

Shipping-container based, modular data center (MDC) offers a new way in which data centers are built and deployed. In an MDC, up to a few thousands of servers are interconnected via switches to form the network infrastructure, say, a typical, two-or three-level tree in the current practice. All the servers and switches are then packed into a standard 20- or 40-feet shipping-container

No longer tied to a fixed location, organizations can place the MDC anywhere they intend and then relocate as their requirements change

In addition to high degree of mobility, an MDC has other benefits including shorter deployment time, higher system and power density, and lower cooling and manufacturing cost

11

BCube: A Network Architecture

for Modular Data Centers

BCube is a network architecture specifically designed for shipping-container based, modular data centers

At the core of the BCube architecture is its server-centric network structure, where servers with multiple network ports connect to multiple layers of commercial off-the-shelf (COTS) mini-switches. Servers act as not only end hosts, but also relay nodes for each other. BCube supports various bandwidth-intensive applications

BCube exhibits graceful performance degradation as the server and/or switch failure rate increases. This property is of special importance for shipping-container data centers, since once the container is sealed and operational, it becomes very difficult to repair or replace its components

12

Goals

Support bandwidth-intensive traffic patterns among data center servers: One-to-one

One-to-several (e.g., distributed file systems)

One-to-all (e.g., application data broadcasting)

All-to-all (e.g., MapReduce)

Beyond using commodity servers, go one step further by using only low-end COTS mini-switches. This option eliminates expensive high-end switches

Different from a traditional data center, it is difficult or even impossible to service an MDC once it is deployed. Therefore, BCube needs to achieve graceful performance degradation in the presence of server and switch failures

13

Approach Take the server-centric approach, rather than the switch-oriented

practice. It places intelligence on MDC servers and works with commodity switches

Provide multiple parallel short paths between any pair of servers BCube not only provides high one-to-one bandwidth, but also

greatly improves fault tolerance and load balancing

BCube accelerates one-to-x traffic by constructing edge-disjoint complete graphs and multiple edge-disjoint server spanning trees.

Moreover, due to its low diameter, BCube provides high network capacity for all-to-all traffic such as MapReduce

BCube runs a source routing protocol called BSR (BCube Source Routing). BSR places routing intelligence solely onto servers. By taking advantage of the multi-path property of BCube and by actively probing the network, BSR balances traffic and handles failures without link-state distribution (this is a typical p2p probing method). With BSR, the capacity of BCube decreases gracefully as the server and/or switch failure increases

BCube uses more wires than the tree structure. “But wiring is a solvable issue for containers which are at most 40-feet long” (a strange argument!)

14

Requirement 1: Support for

Bandwidth-intensive Traffic One-to-one, which is the basic traffic model in which one server

moves data to another server. For example, this takes place on server pairs that exchange large amount of data such as disk backup. Good one-to-one support also results in good several-to-one and all-to-one support

One-to-several, in which one server transfers the same copy of data to several receivers. Current distributed systems such as GFS, HDFS, and CloudStore, replicate data chunks of a file several times (typically three) at different chunk servers to improve reliability. When a chunk is written into the file system, it needs to be simultaneously replicated to several servers.

One-to-all, in which a server transfers the same copy of data to all the other servers in the cluster. There are several cases that one-to-all happens: to upgrade the system image, to distribute application binaries, or to distribute specific application data

All-to-all, in which every server transmits data to all the other servers. The representative example of all-to-all traffic is MapReduce. The reduce phase of MapReduce needs to shuffle data among many servers, thus generating an all-to-all traffic pattern

15

Requirement 2:

Use of Low-end Commodity Switches Current data centers use commodity PC servers, but

high-end switches/routers. We want to use low-end non-programmable COTS switches instead of the high-end ones, based on the observation that the per-port price of the low-end switches is much cheaper than that of the high-end ones

The COTS switches, however, can speak only the spanning tree protocol, which cannot fully utilize the links in advanced network structures (why?). The switch boxes are generally not as open as the server computers. Re-programming the switches for new routing and packet forwarding algorithms is much harder, if not impossible, compared with programming the servers. This is a challenge we need to address

16

Requirement 3:

Graceful Performance Degradation Given that we only assume commodity servers and

switches in a shipping-container data center, we should assume a failure model of frequent component failures. Moreover, an MDC is prefabricated in factory, and it is rather difficult, if not impossible, to service an MDC once it is deployed in the field, due to operational and space constraints (“data center in a shipping-container” is analogous to “system on a chip” built with low-power transistors which may fail)

Therefore, it is important that we design our network architecture to be fault tolerant and to degrade gracefully in the presence of continuous component failures

17

BCube’s Recursively Defined Topology

Let n be the expansion factor at each level. That is, the total number of servers is increased by 4X with each additional level. Throughout this class, we assume n = 4, unless stated otherwise

BCubek at level k is constructed from by connecting n = 4 copies of BCubek-1 at level k-1 using nk n-port switches

Each switch connects n servers, each in a separate Bcubek-1

Each server in BCubek has k + 1 ports, each connecting to a switch in a seperate level

Throughout this class, we assumed n = 4

How many paths are there between

server 00 and server 21? (see a later slide)

BCube1 (i.e., k = 1):

18

Constructing Level 2 from Level 1

For BCubek, we have: k +1 levels: level-0 through level-k

# servers is nk+1

# n-port switches at each level is the same, that is, nk. Thus the total number of switches is

(k + 1)nk

For example, with n = 8 and k = 3, BCube3 connects 84 =4096 servers in four levels by using 83 = 512 8-port switches each level

Note that switches only connect to servers and never directly connect to other switches. we can treat the switches as dummy crossbars that connect several neighboring servers and let servers relay traffic for each other

19

How to Route

from Server 00 to Server 21 ?

Level 0:

Fix 1st Digit

Level 1:

Fix 2nd Digit

The blue path fixes the 1st digit first and then the 2nd digit, whereas the red path uses the reverse order

Note that the blue and red paths are node-disjoint. This is not an accident!

Question: Are there other paths from 00 to 21?

There is no magic here: The BCube topology is actually the well-known hypercube topology. Routing over BCube can be understood by examining the intuitive routing we can easily see on hypercube

20

Hypercube0 1

00 01

10 11

(a) B i na ry 1-c ub e,

built o f tw o

bina ry 0-c u bes ,

label ed 0 and 1

(b) B i na ry 2-c ub e,

built o f tw o

bina ry 1-c u bes ,

label ed 0 and 1

0

1

(c ) B ina ry 3-c u be, bu ilt o f two bin ary 2 -c u bes , lab eled 0 an d 1

0

000 001

010 011

100 101

110 111

1

(d) B i na ry 4-c ub e, bui lt o f two bi na ry 3-c ub es , label ed 0 and 1

0 1

000 0

000 1

001 0

001 1

010 0

010 1

011 0

011 1

100 0

100 1

101 0

101 1

110 0

110 1

111 0

111 1

Source: Slides from “Introduction to Parallel Processing:

Algorithms and Architectures” by Behrooz Parhami

2-node 4-node

8-node

16-node

21

Only sample

wraparound

links are

shown to

avoid clutter

Isomorphic to

the 4 4 4

3D torus

(each has

64 6/2 links)

The 64-Node

Hypercube



22

Neighbors of a Node in a Hypercube

xq–1xq–2 . . . x2x1x0 ID of node x

xq–1xq–2 . . . x2x1x0 dimension-0 neighbor; N0(x)

xq–1xq–2 . . . x2x1x0 dimension-1 neighbor; N1(x). .. .. .

xq–1xq–2 . . . x2x1x0 dimension-(q– 1) neighbor; Nq–1(x)

The q

neighbors

of node x

Nodes whose labels differ in k bits

(at Hamming distance k) connected

by shortest path of length k

Both node- and edge-symmetric

Strengths: symmetry, log diameter,

and linear bisection width

Weakness: poor scalability due to

many long interconnection wires

Dim 0

Dim 1

Dim 2Dim 3

0100 0101

0110

00001100

1101

1111

0111

0011

x

1011

0010

1010

x



23

BCube Uses Switches to

Implement Hypercube Links


0

000 001

010 011

100 101

110 111

1


0

000 001

010 011

100 101

110 111

1

Sw1 Sw2

Sw3 Sw3

16-node Hypercube 16-node BCube

Sw Sw

SwSw

24

Hypercube Routing

Gives BCube Routing


0

000 001

010 011

100 101

110 111

1


0

000 001

010 011

100 101

110 111

1

16-node Hypercube 16-node BCube

Sw1 Sw2

Sw3 Sw3

Sw Sw

SwSw

Thus BCubeRouting is the same as the

routing algorithm for Hypercube

25

Single-path Routing in BCube

In BcubeRouting, A=akak-1… a0 is the source

server and B=bkbk-1… b0 is the destination

server. We systematically build a series of

intermediate servers by “correcting” one digit

of the previous server. Hence the path length

is at most k+1

Note that the intermediate switches in the

path can be uniquely determined by its two

adjacent servers, hence are omitted from the

path

26

Multi-paths for One-to-one Traffic

Two parallel paths between a source server

and a destination server exist if they are node-

disjoint, i.e., the intermediate servers and

switches on one path do not appear on the

other

Theorem. There are k + 1 parallel paths between

any two servers in a BCubek

BCube should also well support several-to-one

and all-to-one traffic patterns. We can fully

utilize the multiple links of the destination server

to accelerate these x-to-one traffic patterns

27

Speedup for One-to-several Traffic

Edge-disjoint complete graphs with k + 2

servers can be efficiently constructed in a

BCubek. These complete graphs can speed

up data replications in distributed file

systems like GFS

28

BCube Source Routing (BSR)

In BSR, the source server decides which path a packet flow should traverse by probing the network and encodes the path in the packet header

Source routing has the following advantages: The source can control the routing path without coordinations of the

intermediate servers (this is suited for data center management, why?)

Intermediate servers do not involve in routing and just forward packets based on the packet header. This simplifies their functionalities

y reactively probing the network, we can avoid link state broadcasting, which suffers from scalability concerns when thousands of servers are in operation

When a new flow comes, the source sends probe packets over multiple parallel paths. The intermediate servers process the probe packets to fill the needed information, e.g., the minimum available bandwidth of its input/output links. The destination returns a probe response to the source. When the source receives the responses, it uses a metric to select the best path, e.g., the one with maximum available bandwidth

29

The PathSelection Procedure

A source uses BuildPathSet to obtain k + 1 parallel paths and then probes these paths. If one path is found not available, the source uses the Breadth First Search (BFS) algorithm to find another parallel path. For n = 8 and k = 3, the execution time of BFS is less than 1 millisecond

An intermediate server updates the available bandwidth field of the probe packet if its available bandwidth is smaller than the existing value

A destination server updates the available bandwidth field of the probe packet if the available bandwidth of the incoming link is smaller than the value carried in the probe packet. It then sends the value back to the source in a probe response message

30

Path Adaption

During the lifetime of a flow, its path may break due to various failures and the network condition may change significantly as well. The source periodically (say, every 10 seconds) performs path selection to adapt to network failures and dynamic network conditions

When an intermediate server finds that the next hop of a packet is not available, it sends a path failure message back to the source. As long as there are paths available, the source does not probe the network immediately when the message is received. Instead, it switches the flow to one of the available paths obtained from the previous probing. When the probing timer expires, the source will perform another round path selection and try its best to maintain k+ 1 parallel paths

When multiple flows between two servers arrive simultaneously, they may select the same path. To make things worse, after the path selection timers expire, they will probe the network and switch to another path simultaneously. This results in path oscillation. We mitigate this symptom by injecting randomness into the timeout value of the path selection timers

31

Packaging and Wiring

We show how packaging and wiring can be addressed for a container with 2048 servers and 1280 8-port switches (a partial BCube with n = 8 and k = 3). The interior size of a 40-feet container is 12m x 2.35m x 2.38m

In the container, we deploy 32 racks in two columns, with each column has 16 racks. Each rack accommodates 44 rack units (or 1.96m high)

We use 32 rack units to host 64 servers as the current practice can pack two servers into one unit, and 10 rack units to host 40 8-port switches. The 8-port switches are small enough, and we can easily put 4 into one rack unit. Altogether, we use 42 rack units and have 2 unused units

32

Packaging and Wiring (Cont.)

As for wiring, the Gigabit Ethernet copper wires can be 100 meters long, which is much longer than the perimeter of a 40-feet container. And there is enough space to accommodate these wires. We use 64 servers within a rack to form a BCube1 and 16 8-port switches within the rack to interconnect them

The wires of the BCube1 are inside the rack and do not go out. The inter-rack wires are layer-2 and layer-3 wires and we pace them on the top of the racks

We divide the 32 racks into four super-racks. A super-rack forms a BCube2 and there are two super-racks in each column. We evenly distribute the layer-2 and layer-3 switches into all the racks, so that there are 8 layer-2 and 16 layer-3 switches within every rack. The level-2 wires are within a super-rack and level-3 wires are between super-racks

Our calculation shows that the maximum number of level-2 and level-3 wires along a rack column is 768 (256 and 512 for level-2 and level-3, respectively). The diameter of an Ethernet wire is 0.54cm. The maximum space needed is approximate 176cm2 < (20cm)2. Since the available height from the top of the rack to the ceil is 42cm, there is enough space for all the wires

33

Graceful Degradation The aggregate bottleneck throughput (ABT) is the

throughput of the bottleneck flow times the number of total flows in the all-to-all traffic model. ABT reflects the all-to-all network capacity

Server Failure Rate (%) Switch Failure Rate (%)

34

Implementation Architecture The BCube architecture includes a BCube

protocol stack. The BCube stack locates between the TCP/IP protocol driver and the Ethernet NDIS driver. The BCube driver is located at 2.5 layer: to the TCP/IP driver, it is a NDIS driver; to the real Ethernet driver, it is a protocol driver

If we directly use the 32-bit addresses, we need many bytes to store the complete path. For example, we need 32 bytes when the maximum path length is 8. We leverage the fact that neighboring servers in BCube differ in only one digit in their address arrays to reduce the space needed for an intermediate server, from four bytes to only one byte

35

Implementation Architecture

(Cont.)Components:

•BSR Protocol for Routing

•Neighbor Maintenance Protocol (maintains a neighbor status table)

•Packet sending/receiving part (interacts with the TCP/IP stack)

•Packet Forwarding Engine (relays packets for other servers)

Header:

•Between the Ethernet Header and IP Header

•Contains typical fields

•Similar to DCell: 1-1 mapping between IP and BCube addresses

•Different from DCell: every BCube packet store the complete path and a next hop index (NHI)

–Using 1-digit address difference between neighbors, path is stored efficiently

36

Packet Forwarding Engine

Neighbor Status Table:

•Maintained by Neighbor Maintenance Protocol

•Consists of Neighbor MACs, connecting output ports, and a Status Flag indicating availability

•Table is almost static (MACs change when a neighboring NIC is replaced, status flag changes when the neighbor’s status changes.)

Forwarding:

•Only one lookup for –Gets the packet, checks the NHA (next hop array) for status and MAC of the next hop

–Checks the Neighbor Status Table if it is alive

–Does Checksum

–Forwards the packet to the identified output port

•Because of PCI Interface limitations (160Mb/s) software implementation is used

37

Testbed

•16 Servers + 8 8-port Gigabit Ethernet

mini-switches

–BCube1 with 4 BCube0 s

•No disk I/O

•No Ethernet flow control

38

CPU Overhead for Packet Forwarding

39

MTU: 9KB

Tests: 1-1, 1-M, 1-All, All-All

Topology:

Bandwidth-Intensive Application Support

40


41


42

Performance Comparisons

43

Cost, Power, and Wiring Comparison

44

Conclusion

By installing a small number of network ports at each server and using COTS mini-switches as crossbars, and putting routing intelligence at the server side, BCube forms a server-centric architecture

We have shown that BCube significantly accelerates one-to-x traffic patterns and provides high network capacity for all-to-all traffic

The BSR routing protocol further enables graceful performance degradation

Future work will study how to scale the current server-centric design from the single container to multiple containers

Data Center Networks

Documents

Transcript of Data Center Networks