[Fat- tree]: A Scalable, Commodity Data Center Network Architecture
-
Upload
baldric-hodge -
Category
Documents
-
view
246 -
download
0
description
Transcript of [Fat- tree]: A Scalable, Commodity Data Center Network Architecture
[Fat- tree]: A Scalable, Commodity Data Center Network
Architecture
Mohammad Al-Fares, Alexander Loukissas,Amin Vahdat Represented by:
Bassma Aldahlan Outline Cloud Computing Data Center Networks
Problem Space Paper Goal
Background Architecture Implantation Evaluation Results and
Outgoing work Cloud Computing "Cloud computing is a model for
enabling convenient, on-demand network access to a shared pool of
configurable computing resources (for example, networks, servers,
storage, applications, and services) that can be rapidly
provisioned and released with minimal management effort or service
provider interaction." . Data Center Networks Data Centers have
recently received significant attention as a cost-effective
infrastructure for storing large volumes of data and hosting
large-scale service applications Data Center Networks A data center
(DC) is a facility consisting of servers (physical machines),
storage and network devices (e.g., switches, routers, and cables),
power distribution systems, cooling systems A data center network
is the communication infrastructure used in a data center, and is
described by the network topology, routing/switching equipment, and
the used protocols (e.g., Ethernet and IP). Could vs. Datacenter
Could
an off-premise form of computing that stores data on the Internet
services are outsourced to third-party cloud providers Less secure
does not require time or capital to get up and running (less
expensive) Datacenter on-premise hardware that stores data within
an organization's local network run by an in-house IT department
More secure More expensive Problem Space Inter-node communication
bandwidth is the principle bottleneck in DC(or large-scale
clusters) . Two solutions: Specialized hardware and communication
protocols (e.g., InfiniBand or Myrinet) Cons: scale to clusters of
thousands of nodes with high bandwidth, No commodity parts (more
expensive) Not compatible with TCP/IP applications. commodity
Ethernet switches and routers to interconnect cluster machines
Scale poorly Non-linear cost increases with clusters size Design
Goal The goal of this paper is to design a data center
communication architecture that meets the following goals: Scalable
interconnection bandwidth Arbitrary host communication at full
bandwidth. Economies of scale Commodity switches Backward
compatibility Compatible with host running Ethernet and IP Common
data center topology
Background Common data center topology Internet Data Center Layer-3
router Core Layer-2/3 switch Aggregation Layer-2 switch Access
Servers Background (Cont.) Oversubscription: Multi-rooted DC
network
Ratio of the worst-case achievable aggregate bandwidth among the
end hosts to the total bisection bandwidth of a particular
communication topology Lower the total cost of the design Typical
designs: factor of 2:5:1 (400 Mbps)to 8:1(125 Mbps) An
oversubscription of 1:1 indicates that all hosts may potentially
communicate with arbitrary other hosts at the full bandwidth of
their network interface. Multi-rooted DC network Multiple core
switches at the top level Equal Cost Multiple Routing (ECMP)
Background (Cont.) Cost Background (Cont.) Cost Architecture Clos
Networks/Fat-Trees
Adopt a special instance of a Clos topology Similar trends in
telephone switches led to designing a topology with high bandwidth
by interconnecting smaller commodity switches. Why Fat-Tree? Fat
tree has identical bandwidth at any bisections Each layer has the
same aggregated bandwidth Architecture (Cont.) Fat-Trees
Inter-connect racks (of servers) using a fat-tree topology K-ary
fat tree: three-layer topology (edge, aggregation and core) each
pod consists of (k/2)2 switches & each k-port switche in the
lower layer is directly connected to k/2 hosts. each edge switch
connects to k/2 servers & k/2 aggr. switches each aggr. switch
connects to k/2 edge & k/2 core switches (k/2)2 core switches:
each connects to k pods Architecture (Cont.) Addressing The pod
switches: The core switches:
IP addresses is allocated within the private /8 block. The pod
switches: 10.pod.switch.1, where pod denotes the pod number (in [0,
k 1]), and switch denotes the position of that switch in the pod
(in [0, k 1], starting from left to right, bottom to top) The core
switches: 10.k.j.i, where j and i denote that switchs coordinates
in the (k/2)2 core switch grid (each in [1, (k/2)], starting from
top-left). Hosts : 10.pod.switch.ID, where ID is the hosts position
in that subnet (in [2, k/2+1], starting from left to right).
Architecture (Cont.) Architecture (Cont.) Tow- level Routing
structure:
Use two level look-ups to distribute traffic and maintain packet
ordering First level is prefix lookup used to route down the
topology to servers Second level is a suffix lookup used to route
up towards core maintain packet ordering by using same ports for
same server Architecture (Cont.) To implement the two- level
routing:
They use Content Addressable Memory (CAM) : faster in finding a
match against a bit pattern. perform parallel searches among all
its entries in a single clock cycle. Lookup engines use a special
kind of CAM, called Ternary CAM (TCAM). Architecture (Cont.)
Routing Algorithm: Architecture (Cont.) S: S: Prefixes Port /24 /24
1 /8 Prefixes Port /8 3 /8 2 Prefixes Port /24 /24 1 /8 Prefixes
Port /8 3 /8 2 S: Prefixes Port /8 2 /8 3 Prefixes Port /24 /24 1
/8 S: Prefixes Port /24 /24 1 /8 Prefixes Port /8 3 /8 2 S:
Prefixes Port /16 /16 1 /16 2 /16 3 Architecture (Cont.) Two level
Routing table
Traffic diffusion occurs in the first half of a packet journey
Subsequent packets to the same host to follow the same path; avoid
packet reordering Centralized protocol to initialize the routing
table. Architecture (Cont.) Flow Classification
used with dynamic port- reassignment in port switches to avoid
local congestion. A flow:a sequence of packets that has the same
source and destination IP In particular, pod switches: 1. Recognize
subsequent packets of the same flow, and forward them on the same
outgoing port. (to avoid packet reordering) 2.Periodically reassign
a minimal number of flow output ports to minimize any disparity
between the aggregate flow capacity of different ports.(to insure
fair distribution on flow- balancing) Architecture (Cont.) Flow
scheduling
routing large flows plays the most important role in determining
the achievable bisection bandwidth of a network Large flow to
minimize the overlap with one another. assign a new flow to the
least- loaded port initially
Architecture (Cont.) Edge Switches Prevent long flows from having
the same link Assign long lived flow to different links Avoid
congestion. assign a new flow to the least- loaded port initially
detect any outgoing flow with a size above a threshold Notify
central scheduler aboutthe source and destination for all active
large flows. Assign the flow toun- contended path. Architecture
(Cont.) Call(k/2)2 paths corresponds to the core switches
Central Scheduler tacks all active large flows and assign them
non-conflicting paths Replicated maintains boolean state for all
links in signifying their availability to carry large flows. For
inter-pod traffic: Call(k/2)2 paths corresponds to the core
switches searches through the core switches to find corresponding
path components do not include a reserved link; marks those links
as reserved notifies the relevant lower- and upper-layer switches
in the source pod with the correct outgoing port that corresponds
to that flows chosen path. Architecture (Cont.) For intra-pod large
flows:
The scheduler garbage collects flows whose last update is older
than a given time. clearing their reservations. Implementation Flow
Classifier-very few seconds, the heuristic at- tempts to switch, if
needed, the output port of at most three flows to minimize the
difference between the aggregate flow capacity of its output ports.
Tow level table Click is a modular software router architecture
that supports implementation of experimental router designs. A
Click router is a graph of packet processing modules called
elements that perform tasks such as routing table lookup a new
Click element is built Flow Scheduler the element FlowReporter is
implemented , which resides in all edge switches, and detects
outgoing flows whose size is larger than a given threshold. It
sends regular notifications to the central scheduler about these
active large flows N Y Hash(src,dst) IP if Seen(hash) Record the
new flow f;
Lookup previously assigned port x Assign f to the least-loaded
upward port x Send packet on port x Send the packet on port x end
Find pmax and pmin with the largest and
For every aggregate switch Find pmax and pmin with the largest and
smallest aggregate outgoing traffic D = pmax -pmin Find the largest
flow f assigned to port pmax whose size is smaller than D if such a
flow exists Switch the output port of flow f to pmin
Fault-Tolerance Failure b/w upper layer and core switches
In this scheme, each switch in the network maintains a BFD
(Bidirectional Forwarding Detection) session with each of its
neighbors to determine when a link or neighboring switch fails
Failure b/w upper layer and core switches Outgoing inter-pod
traffic, local routing table marks the affected link as unavailable
and chooses another core switch Incoming inter-pod traffic, core
switch broadcasts a tag to upper switches directly connected
signifying its inability to carry traffic to that entire pod, then
upper switches avoid that core switch when assigning flows destined
to that pod Fault-Tolerance Failure b/w lower and upper layer
switches
Outgoing inter- and intra pod traffic from lower-layer, the local
flow classifier sets the costto infinity anddoes not assign it any
new flows, chooses anotherupper layer switch Intra-pod traffic
using upper layer switch asintermediary Switch broadcasts a tag
notifying all lower levelswitches, these would check when assigning
newflows and avoid it Inter-pod traffic coming into upper layer
switch Tag to all its core switches signifying its ability tocarry
traffic, core switches mirror this tag to all upperlayer switches,
then upper switches avoid affectedcore switch when assigning new
flaws Power and Heat Issues Results: Heat & Power Consumption
Evaluation Experiment Description 4-port fat-tree, there are 16
hosts, four pods (each with four switches), and four core switches;
a total of 20 switches and 16 end hosts multiplex these 36 elements
onto ten physical machines, interconnected by a 48-port ProCurve
2900 switch with 1 Gigabit Ethernet links Each pod of switches is
hosted on one machine; each pods hosts are hosted on one machine;
and the two remaining ma- chines run two core switches each.
Evaluation (Cont.) four machines running four hosts each, and four
machines each running four pod switches with one additional uplink.
The four pod switches are connected to a 4-port core switch running
on a dedicated machine 3.6:1 oversubscription on the uplinks from
the pod switches to the core switch these links are
bandwidth-limited to Mbit/s, and all other links are limited to
96Mbit/s. Each host generates a constant 96Mbit/s of outgoing
traffic. Results: Network Utilization Packaging Increased wiring
overhead is inherent to the fat-tree topology Each pod consists of
12 racks with 48 machines each, and 48 individual 48-port GigE
switches Place the 48 switches in a centralized rack Cables moves
in sets of 12 from pod to pod and in sets of 48 from racks to pod
switches opens additional opportunities for packing to reduce
wiring complexity Minimize total cable length by placing racks
around the pod switch in two dimensions Packaging Conclusion
Bandwidth is the scalability bottleneck in large scale clusters
Existing solutions are expensive and limit cluster size Fat-tree
topology with scalable routing and backward compatibility with
TCP/IP and Ethernet Large number of commodity switches have the
potential of displacing high end switches in DC the same way
clusters of commodity PCs have displaced supercomputers for high
end computing environments Thank you Q & A