C-Through: Part-time Optics in Data Centers Guohui Wang*, David G. Andersen†, Michael Kaminsky‡,...
-
Upload
tyler-preston -
Category
Documents
-
view
215 -
download
2
Transcript of C-Through: Part-time Optics in Data Centers Guohui Wang*, David G. Andersen†, Michael Kaminsky‡,...
c-Through: Part-time Optics in Data Centers
Guohui Wang*, David G. Andersen†, Michael Kaminsky‡, Konstantina Papagiannaki‡,T. S. Eugene Ng*, Michael Kozuch‡, Michael Ryan‡
?Rice University, †Carnegie Mellon University, ‡Intel Labs Pittsburgh
概要• How to achieve high throughput among
remote servers in a data center• Hybrid packet and optical circuit switched data
center network architecture (HyPaC)• Core switching architecture resides in end
hosts (see-through -> c-through)
背景• Emerging applications which require to handle large
amounts of data– VM migration– Data mining ex. Hadoop
• Oversubscription problem– Increasing number of servers interconnected via switch– Ex. 1000hosts with 10Gb/s link = Oversubscription ratio of 1000
• Necessity for higher throughput among remote servers within a datacenter
• Optical circuit switching technology– Higher bandwidth / slower switching speed
Optical Circuit Switching• Technologies
– MEMS-based Optical Circuit Switch• Does not encode/decode packets• No requirement for tranceivers
– Wavelength Division Multiplexing (WDM)• larger transmission rate
• Characteristics– Pros
• 40Gb/s ~ 100Gb throughput (40Gb/s < on electrical ether)
– Cons• 20ms to switch to a new I/O port
– reconfiguration time of mirror rotation
HyPaC network Architecture
• ToR switches are electrically/optically connected
• Each rack can have at most one high bandwidth(optical) connection at a time
• Design Choices– Traffic Demand Estimation
• Performed@end host(servers) ; Increases the per-connection socket buffer size, and observe end-host buffer occupancy at runtime
• Provides HOL
– Traffic Demultiplexing• Partition electrical & optical network in 2 logical network
using VLAN
– Circuit Utilization Optimizing• Buffer additional data in TCP socket buffers, rely on TCP’s
functionality to burst send data
Design and ImplementationManaging Optical Paths
• Traffic Measurement– Increases the per-socket TCP socket buffer on
runtime– Buffering at end hosts– each server computes for each destination rack the
total numbers of bytes waiting in socket buffers, and reports these per-destination-rack demands to the optical manager
– Easy to scale (DRAM cheaper on end hosts than ToR switches)
• Utilization Optimization– Modifying the size of per-socket buffer
• Optical Configuration manager– collects traffic measurements from end-nodes, and
determines how optical paths should be configured, issues configuration directives to the switches and informs hosts which paths are optically connected
– small central manager attached to the optical switch (similar to a router control plane)
– tries to maximizes the amount of traffic offloaded to the optical network
Design and ImplementationTraffic De-multiplexing
• VLAN based network isolation– Assign two VLANs to the ToR switch– VLAN-s : electrical packets– VLAN-c : optical packets– Topology of VLAN-c changes rapidly, so protocols with long convergence time
should not be used
• Traffic de-multipexing on hosts– each hosts run a management daemon that informs the kernel about hte inter-
rack connectivity– the kernel will de-multiplex traffic to the optical and electrical paths– broadcast and multicast packets are always scheduled over the electrical
network– optical paths have higher transmission priority => high utilization, no flow
starvation
Design and ImplementationSystem Implementation
• the total memory consumption of all socket buffers on each server rarely goes beyond 200MB
• sockets stats are read using netstat
Evaluationemulation environment
• Pseudo electrical/optical architecture– Limit rack-to-rack communications (only 1 flow per rack)– Optical circuit switching time
Evaluationmicro-benchmark evaluation
• Fig 5– Today’s TCP performance– 40:1 oversubscription ratio– reconfiguration = 5ms– TCP could rapidly adapt to dynamically reprovisioned paths
• Tbl 2– The output scheduler does not significantly decrease the network througuput
EvaluationHadoop Sort
• Hadoop : MapReduce(Processing algorithm for large chunks of data) implementation• Hadoop sort
– Input data size = output data size– Requires high inter-rack network bandwidth
Discussion• Applicability of the HyPaC Architecture
– Traffic concentration(is required)• hadoop default buffering config; buffering of large amount of data not
supported• traffic concentration of large buffer size is good for a HyPaC architecture
– Zero/Loose Synchronization• The pairwise connections and reconfiguration interval impose a minimum time
to contact all racks of interest called the circuit visit delay.• if synchronization time < circut visit delay, the reconfigured optical path
becomes NG – Latency for L1 reasons
• Making Applications Optics Aware– buffer more data– optical manager could be integrated into a cluster-wide job/physical resource
manager
Helios: A Hybrid Electrical/Optical SwitchArchitecture for Modular Data Centers
Nathan Farrington, George Porter, Sivasankar Radhakrishnan,Hamid Hajabdolali Bazzaz, Vikram Subramanya, Yeshaiahu Fainman,
George Papen, and Amin VahdatUniversity of California, San Diego
概要• Motivation and optical/electrical hybrid なア
イデアは C-through とほぼ同じ , how to provide high throughput between nodes in the same network separated by switches
• Core architecture resides in the switch
Optimal Architecture (Simulation)• 64pods each with 1024hosts• Core/optical switch integerated into 1 giant switch
The parameter w influences cost both positively and neg- atively. Larger values of w reduce the number of fibers and core circuit switch ports, reducing cost. But larger values of w also lead to more internal fragmentation, which is unused capacity on a superlink resulting from insufficient demand. The effect of w on system cost is related to the number of flows between pods. In our configuration, between w = 4 and w = 8 was optimal.
Design and ImplementationSoftware
• Topology manager– Edmonds algorithm– runs on single server
• Circuit Switch Manager– programmable circuit switching software running on optical GLIMMERGLASS running in
synchronous mode (available to process multiple circuit switching request)
• Pod Switch Manager– runs on each pod switch– manages flow table and interfacing with the topology manager– uses L2 forwarding- supports multipath forwarding using Link Aggregation Groups– LIMITATION : cannopt split traffic over both the packet switch and cirtcuit switch travelling to the
same port ( monaco's limitation )
• Basic Workflow1. Control Loop
• TM issues commands to get the octet-counter matrix to the PSM• each flow classified into mouse or elephant (<15Mb/s or up)
2. Estimate Demand- flow-rate matrix not good- max-min fair bandwidth is better3. Compute New Topology- objective : maximize throughput- max-weighted matching
problems on bipartite graphs4. notify down, change topology, notify up
Evaluation
• Pod-Level Stride (PStride)– Each host in a source pod i sends 1 TCP flow to each
host in a destination pod j = (i+k) mod 4 with k rotating from 1 to 3 after each stability period
– Pod rotation• Host-Level Stride (HStride)– Each host i (numbered from 0 to 23) sends 6 TCP flows
simultaneously to host j = (i+6+k) mod 24 with k rotating from 0 to 12 after each stability period.
• Random
EvaluationDebouncing and EDC
• Debouncing– ポート切り替えの際に発生する,ケーブルシグナルの slow booting– Avoids routing problems, etc.– Switch の仕様
• Electronic Dispersion Compensation (EDC) algorithm– Circuit switching delay time, which removes the noise from light traveling a long time
両方 DISABLEしたらスループットあがったよ!
Scalable Flow-Based Networking with DIFANE
Minlan Yu ∗ Jennifer Rexford ∗ Michael J. Freedman ∗ Jia Wang† ∗ Princeton University, Princeton, NJ, USA † AT&T Labs - Research, Florham Park, NJ,
USA
概要
• 背景– DCN の複雑化 -> スイッチ設定の複雑化– flow based switches• fexible policies ; dropping, forwarding packets based on
rules that match on bits in the packet header
• 目的– creating a dynamic flow based switching
architecture
Existing Flow based Switch Configuration
• existing control schemes– sends the microflow's first packet to a centralizaed
controller, which issues a rule to the switch software
– centralized mechanic = bottleneck– keep all traffic in the data plane for better
performance and scalability
switches should be able to do this on their own
DIFANE DESIGN DECISIONS• Reducing Overhead of Cache Misses
– Process all packets in the data plane (hardware)– rule caching with wildcards
• Scaling to Large Networks and Many Rules• Partition and distribute the flow rules
– one primary controller that manages policy, computes rules, and divides the rues across the switches– each switch handles the portion of the rule space
• -consistent topology information distribution with the link-state protocol– link-state protocol among the switches
DIFANE Architecture• Basic work flow
1. ingress switch recieves packet2. redirects to the authority switches 3. authority switch tells ingress switch to cache the result 4. next packets could be sent directly to the egress switch
• Rule partition and allocation– precompute low-level-rules
• the controller pre-computes the low-level rules based on the high-level policies by substituting high-level names (netowork addresses)
• rule spaces are partitioned -> each portion of rule space to one or more authority switches
Network Dynamics への対応• change to rules
– when administrators modify the policies– network events affect the mapping between policies and rules (failures?
• Topology Dynamics- authority switch failure– routing protocol sends failure msg of specific auth switch– partitions rules regarding the authority switch get recomputed, redirected to
backup switch– addition recovery
• the controller randomly selects auth switch, divides is partition in 2, and let the new one handle the half
• Host Mobility– installing rules at new ingress switch on demand-removing rules from old
ingress switch by TIMEOUT
WildCard あれこれ• caching wildcard rules
– overlapping rules can cause trouble– ingress switches could only cache the highest priority rules– to prevent conflicting cache rules, authority switches are only able to install caching rules in its own flow range
• Partitioning wildcard rules– allocating non-overlapping flow ranges to the authority switches- splitting the rules so that each rule only belongs to
one authority switch– cuts to align with rule boundaries- duplicating authority rules to reduce stretch