EN2910A: Advanced Computer Architecture - Brown...

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers

Prof. Sherief Reda School of Engineering

Brown University

Material from: •  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale

Machines, Second Edition •  Computer Architecture: A Quantitative Approach by Hennessy and Patterson

Warehouse-scale computers

The data center as a computer. L. Barroso

Performance metrics for clusters

•  Supercomputers: –  Execution time –  Threads communicate using message passing –  FLOPS (FLOP/s): theoretical peak or using a standard

benchmark (e.g., LINPACK is used for Top-500 supercomputer ranking)

•  Data center scale: –  Abundant parallelism through request-level parallelism –  Latency is important metric because it is seen by users –  Bing study: users will use search less as response time

increases –  Service Level Objectives (SLOs)/Service Level Agreements

(SLAs). E.g. 99% of requests be below 100 ms

S. Reda EN2910A FALL’13 3

Anatomy of data center

•  7-ft rack (42U) holds 40 1U servers and a rack-level ethernet switch •  Rack switches have uplinks connecting to cluster-level switches

Servers

•  Data centers: –  1U servers; each has two quad-core processors & 16 GB –  Disk attached to each node, 1Gbps ethernet

•  Supercomputers –  Use more expensive high-density blade servers and GPGPU –  Blade chassis has number of blades (e.g., Cray XE6 blade with

four 16-core AMD Opteron processors with 64 GB). –  Chassis provides shared power supply and cooling 5

Blade chassis

Power usage of servers

Assumes two-socket x86 servers, 16 DIMMs and 8 disk drives per server, and an average utilization of 80%.

Storage hierarchy of a data center

Data centers use global distributed storage (disks attached to each compute node), whereas supercomputers use Network Attached Storage (NAS) devices connected directly to the cluster network.

Data center network

Datacenters often reduce communication costs by oversubscribing the network at the top-of-rack. In an oversubscribed network, we increase that 1:1 ratio between server and fabric ports. For example, with 2:1 oversubscription we only build a uplinks with half the bandwidth. Each server can still peak at 10 Gbps of traffic, but if all servers are simultaneously sending traffic they’ll only be able to average 5 Gbps. In practice, oversubscription ratios of 4–10 are common.

Quantifying latency, bandwidth and capacity

•  Data center assumption: 2,400 servers, each with 16 GB of DRAM and four 2 TB disk drives

•  Each group of 80 servers is connected through a 1-Gbps link to a rack-level switch that has an additional eight 1-Gbps ports used for connecting the rack to the cluster-level switch

Support equipment

Power distribution and losses

•  More losses inside servers to convert from AC to DC •  Total losses about 11%

Heat flow inside a facility

•  Datacenters employ raised floors. The underfloor area often contains power cables to racks, but its primary purpose is to distribute cool air to the server racks through perforated tiles.

•  Heat recirculation contaminates cold air aisles which is a main source of inefficiencies.

Heat exchange systems

Data center energy efficiency

Power usage effectiveness (PUE) reflects the quality of the datacenter building infrastructure itself, and captures the ratio of total building power to IT power.

survey results of over 1100 datacenters in 2012

Check Google’s green web http://www.google.com/green/bigpicture/

Need for energy-proportional computing

Average activity distribution of a sample of two Google clusters, each containing over 20,000 servers, over a period of 3 months ( January-March 2013).

Variable utilization creates server idleness à wasted power consumption à need for mechanism to adjust the power consumption based on the load of servers

Software

•  Platform-level software –  Operating system –  Virtual machines

•  Cluster-level infrastructure software –  Resource management –  Hardware abstraction (e.g., global distributed file systems and

message passing) –  Programming frameworks

•  Applications –  Transactional à data centers –  Batch à supercomputers

Supercomputer and data centers networks

•  Data centers use Ethernet, which is good enough for request-level parallelism with limited inter-thread communication.

•  Supercomputer use Infiniband and other proprietary networks which enable various possible network topologies to facilitate quick communication and synchronization among threads on various servers.

Indirect and direct networks

Direct Network Node: computing + switch

Indirect Network Network as a “box” with no servers

Network metrics

•  Network size: number of nodes •  Node degree: number of ports for each switch (switch complexity)

•  Network diameter: Longest shortest path between any two nodes in the network

•  Network bandwidth: best-case total bandwidth

•  Bisection bandwidth: worst-case total bandwidth when half of nodes is communicating with other half

Indirect networks: crossbar switch

•  Offer best latency as diameter is just 1 •  Point-to-point connections between N pair of nodes can be

established as long as the pairs are distinct. •  Bandwidth scales linearly with the number of nodes •  Size is not scalable as number of switches grows quadratically

with the number of nodes.

Indirect networks: Omega and butterfly

Omega network butterfly network

•  Number of switches = N/2 log N à addresses size scalability problems of crossbar.

•  Diameter grows as log N. •  Omega network uses perfect-shuffle exchange •  More prone to contention. Consider 0 communication with 5 and

4 wants to communicate with 6 in Omega.

Indirect networks: trees

Binary tree

How to build a fat tree with skinny switches?

Number of switches = N -1 Simple structure Bisection BW = 1

Fat tree

Links do not have the same BW Links get thicker (i.e. more BW) as we get to root

Direct networks: linear arrays and rings

•  Great aggregate BW à can exchange N-1 messages at a time •  Bisection BW = 1 for array and 2 for ring •  Layout of ring can be done to get short links

Direct networks: meshes and tori

mesh 2D torus 3D torus

•  Improves bisection BW and diameter •  Increases number of links; requires more ports per switch (i.e.,

degree increases) •  What is the diameter?

Designing a torus topology for HPC cluster •  Cluster of 192 blades. Each server is 10U

chassis with 16 blades and a switch with 32 ports

•  Racks: 12 10U chassis à 4 chassis per rack and a total of 3 racks

•  2D torus: 4x3 torus. Each switch has 32 ports à 16 ports to the blades and 16 ports to each of the neighbors (4 cables each)

[example from clusterdesign.org]

Direct networks: hypercubes

•  n-dimensional hypercubes •  Number of nodes = 2n.

•  One hop to n nodes à n+1 ports per switch. Diameter = n. •  What is the bisection bandwidth?

Summary

Supercomputers and data centers issues: •  Performance evaluation •  Server makeup •  Storage hierarchy •  Network topologies •  Support infrastructure for cooling and power delivery •  Power consumption concerns

EN2910A: Advanced Computer Architecture - Brown...

Documents

Transcript of EN2910A: Advanced Computer Architecture - Brown...

Anaerobic threshold and respiratory gas exchange …homepages.wmich.edu/~ccheatha/hper6760/files/Topic06-Research... · Anaerobic threshold and respiratory gas exchange during exercise

Sherief Reda Division of Engineering, Brown University ...scale.engin.brown.edu/classes/EN0291S40F06/lecture04.pdf · Division of Engineering, Brown University Fall 2006. 2 • Static

Topic 6: Bitwise Instructions - Home | Computer Science ...cseweb.ucsd.edu/~airturk/CSE30/topic06.pdf · Topic 6: Bitwise Instructions CSE 30: Computer Organization and Systems Programming

TEMPLATE FOR DISTANCE EDUCATION DOCUMENTSazlanppd.tripod.com/topic06.pdf · Unit CIV2202: Surveying 6.1 Topic 6: Tacheometry Department of Civil Engineering, Monash University Edition

DRUM: A Dynamic Range Unbiased Multiplier for ...scale.engin.brown.edu/pubs/ICCAD15.pdfDRUM: A Dynamic Range Unbiased Multiplier for Approximate Applications Soheil Hashemi School

Maximizing the performance of Parallel Applications on ...scale.engin.brown.edu/theses/Shuchen's_thesis.pdf · Heterogeneous computing supports a great number of different applications.

Topic06 Introduction to Memory

EN164: Design of Computing Systems - Brown …scale.engin.brown.edu/classes/EN164S12/topic02 - lab.pdf · • Introduction to Logic Synthesis using Verilog HDL ... 1 bit full adder

Hardware-Software Co-design of Resource-Efﬁcient Deep Neural …scale.engin.brown.edu/pdfs/chhay.pdf · 2020. 8. 16. · Hardware-Software Co-design of Resource-Efﬁcient Deep

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ...scale.engin.brown.edu/pubs/vlsi13.pdf · IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 Power Mapping

New Techniques for Power-Efﬁcient CPU-GPU …scale.engin.brown.edu/theses/kapil.pdfNew Techniques for Power-Efﬁcient CPU-GPU Processors by Kapil Dev M.S., Rice University, Houston,

EN164: Design of Computing Systems - Brown …scale.engin.brown.edu/classes/EN164S12/topic03 - isa.pdfEN164: Design of Computing Systems Topic 03: Instruction Set Architecture Design

Mobile Application Development - GitHub Pagesedeleastar.github.io › mobile-app-dev › topic06 › pdf › b...Mobile Application Development Eamonn de Leastar (edeleastar@wit.ie)

Design Patterns - GitHub Pagesedeleastar.github.io/design-patterns/topic06/pdf/e-gui-patterns.pdf · Design Patterns Eamonn de Leastar (edeleastar@wit.ie) Patterns for Graphic User

EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic03-superscalar.pdf · 25/10/2013 · EN2910A: Advanced Computer Architecture

EN2910A: Advanced Computer Architecture - Brown …scale.engin.brown.edu/classes/EN2910AF15/topic04-multithread.pdfEN2910A: Advanced Computer Architecture. Topic 04: Multi-threading,

EN2910A: Advanced Computer Architecture - Brown Universityscale.engin.brown.edu/classes/EN2910AF13/topic01-intro.pdf · EN2910A: Advanced Computer Architecture Topic 01: Introduction

Professor: Sherief Reda School of Engineering, …scale.engin.brown.edu/classes/EN2910AF13/hw09.pdfENGN 2910A Homework 09 (60 points) – Due Date: Dec 5th 2013 Professor: Sherief

Modeling and Optimization of Chip Cooling with Two-Phase ...scale.engin.brown.edu/pdfs/islped19.pdfModeling and Optimization of Chip Cooling with Two-Phase Vapor Chambers Zihao Yuan

EN2911X: Reconfigurable Computing - Brown …scale.engin.brown.edu/classes/EN2911XF14/topic02.pdf · 23/09/2014 · • Introduction to Logic Synthesis using Verilog HDL ... 1 bit