[Fat- tree]: A Scalable, Commodity Data Center Network Architecture

[Fat- tree]: A Scalable, Commodity Data Center Network Architecture
Mohammad Al-Fares, Alexander Loukissas,Amin Vahdat Represented by: Bassma Aldahlan Outline Cloud Computing Data Center Networks Problem Space Paper Goal
Background Architecture Implantation Evaluation Results and Outgoing work Cloud Computing "Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (for example, networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction." . Data Center Networks Data Centers have recently received significant attention as a cost-effective infrastructure for storing large volumes of data and hosting large-scale service applications Data Center Networks A data center (DC) is a facility consisting of servers (physical machines), storage and network devices (e.g., switches, routers, and cables), power distribution systems, cooling systems A data center network is the communication infrastructure used in a data center, and is described by the network topology, routing/switching equipment, and the used protocols (e.g., Ethernet and IP). Could vs. Datacenter Could
an off-premise form of computing that stores data on the Internet services are outsourced to third-party cloud providers Less secure does not require time or capital to get up and running (less expensive) Datacenter on-premise hardware that stores data within an organization's local network run by an in-house IT department More secure More expensive Problem Space Inter-node communication bandwidth is the principle bottleneck in DC(or large-scale clusters) . Two solutions: Specialized hardware and communication protocols (e.g., InfiniBand or Myrinet) Cons: scale to clusters of thousands of nodes with high bandwidth, No commodity parts (more expensive) Not compatible with TCP/IP applications. commodity Ethernet switches and routers to interconnect cluster machines Scale poorly Non-linear cost increases with clusters size Design Goal The goal of this paper is to design a data center communication architecture that meets the following goals: Scalable interconnection bandwidth Arbitrary host communication at full bandwidth. Economies of scale Commodity switches Backward compatibility Compatible with host running Ethernet and IP Common data center topology
Background Common data center topology Internet Data Center Layer-3 router Core Layer-2/3 switch Aggregation Layer-2 switch Access Servers Background (Cont.) Oversubscription: Multi-rooted DC network
Ratio of the worst-case achievable aggregate bandwidth among the end hosts to the total bisection bandwidth of a particular communication topology Lower the total cost of the design Typical designs: factor of 2:5:1 (400 Mbps)to 8:1(125 Mbps) An oversubscription of 1:1 indicates that all hosts may potentially communicate with arbitrary other hosts at the full bandwidth of their network interface. Multi-rooted DC network Multiple core switches at the top level Equal Cost Multiple Routing (ECMP) Background (Cont.) Cost Background (Cont.) Cost Architecture Clos Networks/Fat-Trees
Adopt a special instance of a Clos topology Similar trends in telephone switches led to designing a topology with high bandwidth by interconnecting smaller commodity switches. Why Fat-Tree? Fat tree has identical bandwidth at any bisections Each layer has the same aggregated bandwidth Architecture (Cont.) Fat-Trees
Inter-connect racks (of servers) using a fat-tree topology K-ary fat tree: three-layer topology (edge, aggregation and core) each pod consists of (k/2)2 switches & each k-port switche in the lower layer is directly connected to k/2 hosts. each edge switch connects to k/2 servers & k/2 aggr. switches each aggr. switch connects to k/2 edge & k/2 core switches (k/2)2 core switches: each connects to k pods Architecture (Cont.) Addressing The pod switches: The core switches:
IP addresses is allocated within the private /8 block. The pod switches: 10.pod.switch.1, where pod denotes the pod number (in [0, k 1]), and switch denotes the position of that switch in the pod (in [0, k 1], starting from left to right, bottom to top) The core switches: 10.k.j.i, where j and i denote that switchs coordinates in the (k/2)2 core switch grid (each in [1, (k/2)], starting from top-left). Hosts : 10.pod.switch.ID, where ID is the hosts position in that subnet (in [2, k/2+1], starting from left to right). Architecture (Cont.) Architecture (Cont.) Tow- level Routing structure:
Use two level look-ups to distribute traffic and maintain packet ordering First level is prefix lookup used to route down the topology to servers Second level is a suffix lookup used to route up towards core maintain packet ordering by using same ports for same server Architecture (Cont.) To implement the two- level routing:
They use Content Addressable Memory (CAM) : faster in finding a match against a bit pattern. perform parallel searches among all its entries in a single clock cycle. Lookup engines use a special kind of CAM, called Ternary CAM (TCAM). Architecture (Cont.) Routing Algorithm: Architecture (Cont.) S: S: Prefixes Port /24 /24 1 /8 Prefixes Port /8 3 /8 2 Prefixes Port /24 /24 1 /8 Prefixes Port /8 3 /8 2 S: Prefixes Port /8 2 /8 3 Prefixes Port /24 /24 1 /8 S: Prefixes Port /24 /24 1 /8 Prefixes Port /8 3 /8 2 S: Prefixes Port /16 /16 1 /16 2 /16 3 Architecture (Cont.) Two level Routing table
Traffic diffusion occurs in the first half of a packet journey Subsequent packets to the same host to follow the same path; avoid packet reordering Centralized protocol to initialize the routing table. Architecture (Cont.) Flow Classification
used with dynamic port- reassignment in port switches to avoid local congestion. A flow:a sequence of packets that has the same source and destination IP In particular, pod switches: 1. Recognize subsequent packets of the same flow, and forward them on the same outgoing port. (to avoid packet reordering) 2.Periodically reassign a minimal number of flow output ports to minimize any disparity between the aggregate flow capacity of different ports.(to insure fair distribution on flow- balancing) Architecture (Cont.) Flow scheduling
routing large flows plays the most important role in determining the achievable bisection bandwidth of a network Large flow to minimize the overlap with one another. assign a new flow to the least- loaded port initially
Architecture (Cont.) Edge Switches Prevent long flows from having the same link Assign long lived flow to different links Avoid congestion. assign a new flow to the least- loaded port initially detect any outgoing flow with a size above a threshold Notify central scheduler aboutthe source and destination for all active large flows. Assign the flow toun- contended path. Architecture (Cont.) Call(k/2)2 paths corresponds to the core switches
Central Scheduler tacks all active large flows and assign them non-conflicting paths Replicated maintains boolean state for all links in signifying their availability to carry large flows. For inter-pod traffic: Call(k/2)2 paths corresponds to the core switches searches through the core switches to find corresponding path components do not include a reserved link; marks those links as reserved notifies the relevant lower- and upper-layer switches in the source pod with the correct outgoing port that corresponds to that flows chosen path. Architecture (Cont.) For intra-pod large flows:
The scheduler garbage collects flows whose last update is older than a given time. clearing their reservations. Implementation Flow Classifier-very few seconds, the heuristic at- tempts to switch, if needed, the output port of at most three flows to minimize the difference between the aggregate flow capacity of its output ports. Tow level table Click is a modular software router architecture that supports implementation of experimental router designs. A Click router is a graph of packet processing modules called elements that perform tasks such as routing table lookup a new Click element is built Flow Scheduler the element FlowReporter is implemented , which resides in all edge switches, and detects outgoing flows whose size is larger than a given threshold. It sends regular notifications to the central scheduler about these active large flows N Y Hash(src,dst) IP if Seen(hash) Record the new flow f;
Lookup previously assigned port x Assign f to the least-loaded upward port x Send packet on port x Send the packet on port x end Find pmax and pmin with the largest and
For every aggregate switch Find pmax and pmin with the largest and smallest aggregate outgoing traffic D = pmax -pmin Find the largest flow f assigned to port pmax whose size is smaller than D if such a flow exists Switch the output port of flow f to pmin Fault-Tolerance Failure b/w upper layer and core switches
In this scheme, each switch in the network maintains a BFD (Bidirectional Forwarding Detection) session with each of its neighbors to determine when a link or neighboring switch fails Failure b/w upper layer and core switches Outgoing inter-pod traffic, local routing table marks the affected link as unavailable and chooses another core switch Incoming inter-pod traffic, core switch broadcasts a tag to upper switches directly connected signifying its inability to carry traffic to that entire pod, then upper switches avoid that core switch when assigning flows destined to that pod Fault-Tolerance Failure b/w lower and upper layer switches
Outgoing inter- and intra pod traffic from lower-layer, the local flow classifier sets the costto infinity anddoes not assign it any new flows, chooses anotherupper layer switch Intra-pod traffic using upper layer switch asintermediary Switch broadcasts a tag notifying all lower levelswitches, these would check when assigning newflows and avoid it Inter-pod traffic coming into upper layer switch Tag to all its core switches signifying its ability tocarry traffic, core switches mirror this tag to all upperlayer switches, then upper switches avoid affectedcore switch when assigning new flaws Power and Heat Issues Results: Heat & Power Consumption Evaluation Experiment Description 4-port fat-tree, there are 16 hosts, four pods (each with four switches), and four core switches; a total of 20 switches and 16 end hosts multiplex these 36 elements onto ten physical machines, interconnected by a 48-port ProCurve 2900 switch with 1 Gigabit Ethernet links Each pod of switches is hosted on one machine; each pods hosts are hosted on one machine; and the two remaining machines run two core switches each. Evaluation (Cont.) four machines running four hosts each, and four machines each running four pod switches with one additional uplink. The four pod switches are connected to a 4-port core switch running on a dedicated machine 3.6:1 oversubscription on the uplinks from the pod switches to the core switch these links are bandwidth-limited to Mbit/s, and all other links are limited to 96Mbit/s. Each host generates a constant 96Mbit/s of outgoing traffic. Results: Network Utilization Packaging Increased wiring overhead is inherent to the fat-tree topology Each pod consists of 12 racks with 48 machines each, and 48 individual 48-port GigE switches Place the 48 switches in a centralized rack Cables moves in sets of 12 from pod to pod and in sets of 48 from racks to pod switches opens additional opportunities for packing to reduce wiring complexity Minimize total cable length by placing racks around the pod switch in two dimensions Packaging Conclusion Bandwidth is the scalability bottleneck in large scale clusters Existing solutions are expensive and limit cluster size Fat-tree topology with scalable routing and backward compatibility with TCP/IP and Ethernet Large number of commodity switches have the potential of displacing high end switches in DC the same way clusters of commodity PCs have displaced supercomputers for high end computing environments Thank you Q & A

[Fat- tree]: A Scalable, Commodity Data Center Network Architecture

Documents

Transcript of [Fat- tree]: A Scalable, Commodity Data Center Network Architecture