© 2011 IBM Corporation
Experiences in Application Specific Supercomputer DesignReasons, Challenges and Lessons Learned
Heiko J Schick – IBM Deutschland R&D GmbH
January 2011
© 2009 IBM Corporation2
Agenda
The Road to Exascale
Reasons for Application Specific Supercomputers
Example: QPACE QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
Challenges and Lessons Learned
© 2009 IBM Corporation3
Where are we now? Blue Gene/Q !!!
BG/Q Overview– 16384 cores per compute rack– Water cooled, 42U compute rack– PowerPC compliant 64-bit microprocessor– Double precision, quad pipe floating point acceleration on each core– The system is scalable to 1024 compute racks, each with 1024 compute node ASICs.
BG/Q Compute Node– 16 PowerPC processing cores per node, 4w MT, 1,6 GHZ– 16 GB DDR3 SDRAM memory per node with 40GB/s DRAM access– 3 GF/W
• 200GF/Chip• 60W/Node (All-inclusive: DRAM, Power Conversion, etc.)
– Integrated Network
© 2009 IBM Corporation4
Projected Performance Development
Almost a doubling every year !!!
© 2009 IBM Corporation5
The Big Leap from Petaflops to Exaflops
We will hit 20 Petaflop in 2011/2012 …. Now beginning research for ~2018 Exascale.
IT/CMOS industry is trying to double performance every 2 years.HPC industry is trying to double performance every year.
Technology disruptions in many areas.
– BAD NEWS: Scalability of current technologies?• Silicon Power, Interconnect, Memory, Packaging.
– GOOD NEWS: Emerging technologies?• Memory technologies (e.g. storage class memory)
Exploiting exascale machines.– Want to maximize science output per €.– Need multiple partner applications to evaluate HW trade-offs.
© 2009 IBM Corporation6
Extrapolating an Exaflop in 2018 Standard technology scaling will not get us there in 2018
BlueGene/L (2005)
Exaflop Directly scaled
Exaflop compromise using traditional technology
Assumption for “compromise guess”
Node Peak Perf 5.6GF 20TF 20TF Same node count (64k)
hardware concurrency/node
2 8000 1600 Assume 3.5GHz
System Power in Compute Chip
1 MW 3.5 GW 25 MW Expected based on technology improvement through 4 technology generations. (Only compute chip power scaling, I/Os also scaled same way)
Link Bandwidth (Each unidirectional 3-D link)
1.4Gbps 5 Tbps 1 Tbps Not possible to maintain bandwidth ratio.
Wires per unidirectional 3-D link
2 400 wires 80 wires Large wire count will eliminate high density and drive links onto cables where they are 100x more expensive. Assume 20 Gbps signaling
Pins in network on node
24 pins 5,000 pins 1,000 pins 20 Gbps differential assumed. 20 Gbps over copper will be limited to 12 inches. Will need optics for in rack interconnects. 10Gbps now possible in both copper and optics.
Power in network 100 KW 20 MW 4 MW 10 mW/Gbps assumed. Now: 25 mW/Gbps for long distance (greater than 2 feet on copper) for both ends one direction. 45mW/Gbps optics both ends one direction. + 15mW/Gbps of electricalElectrical power in future: separately optimized links for power.
Memory Bandwidth/node
5.6GB/s 20TB/s 1 TB/s Not possible to maintain external bandwidth/Flop
L2 cache/node 4 MB 16 GB 500 MB About 6-7 technology generations with expected eDRAM density improvements
Data pins associated with memory/node
128 data pins
40,000 pins 2000 pins 3.2 Gbps per pin
Power in memory I/O (not DRAM)
12.8 KW 80 MW 4 MW 10 mW/Gbps assumed. Most current power in address bus. Future probably about 15mW/Gbps maybe get to 10mW/Gbps (2.5mW/Gbps is c*v^2*f for random data on data pins) Address power is higher.
© 2009 IBM Corporation7
Building Blocks of Matter
QPACE = QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
Quarks are the constituents of matter which strongly interact exchanging gluons.
Particular phenomena– Confinement– Asymptotic freedom (Nobel Prize 2004)
Theory of strong interactions = Quantum Chromodynamics (QCD)
© 2009 IBM Corporation9
Balanced Hardware
Example caxpy:
Processor FPU throughput
[FLOPS / cycle]
Memory bandwidth
[words / cycle] [FLOPS / word]
apeNEXT 8 2 4
QCDOC (MM) 2 0.63 3.2
QCDOC (LS) 2 2 1
Xeon 2 0.29 7
GPU 128 x 2 17.3 (*) 14.8
Cell/B.E. (MM) 8 x 4 1 32
Cell/B.E. (LS) 8 x 4 8 x 4 1
© 2009 IBM Corporation
Balanced Systems ?!?
10
© 2009 IBM Corporation
… but are they Reliable, Available and Serviceable ?!?
11
© 2009 IBM Corporation13
Collaboration and Credits
QPACE = QCD Parallel Computing on the Cell Broadband Engine™ (Cell/B.E.)
Academic Partners– University Regensburg S. Heybrock, D. Hierl, T. Maurer, N. Meyer, A. Nobile, A. Schaefer, S. Solbrig, T. Streuer, T.
Wettig
– University Wuppertal Z. Fodor, A. Frommer, M. Huesken
– University Ferrara M. Pivanti, F. Schifano, R. Tripiccione
– University Milano H. Simma
– DESY Zeuthen D.Pleiter, K.-H. Sulanke, F. Winter
– Research Lab Juelich M. Drochner, N. Eicker, T. Lippert
Industrial Partner– IBM (DE, US, FR) H. Baier, H. Boettiger, A. Castellane, J.-F. Fauh, U. Fischer, G. Goldrian, C. Gomez, T. Huth,
B. Krill, J. Lauritsen, J. McFadden, I. Ouda, M. Ries, H.J. Schick, J.-S. Vogt
Main Funding– DFG (SFB TR55), IBM
Support by Others– Eurotech (IT) , Knuerr (DE), Xilinx (US)
© 2009 IBM Corporation14
Production Chain
Major steps– Pre-integration at University Regensburg– Integration at IBM / Boeblingen– Installation at FZ Juelich and University Wuppertal
© 2009 IBM Corporation15
Concept
System– Node card with IBM® PowerXCell™ 8i processor and network processor (NWP)
• Important feature: fast double precision arithmetic's– Commodity processor interconnected by a custom network– Custom system design– Liquid cooling system
Rack parameters– 256 node cards
• 26 TFLOPS peak (double precision)• 1 TB Memory
– O(35) kWatt power consumption
Applications– Target sustained performance of 20-30%– Optimized for calculations in theoretical particle physics:
Simulation of Quantum Chromodynamics
© 2009 IBM Corporation16
Networks
Torus network– Nearest-neighbor communication, 3-dimensional torus topology– Aggregate bandwidth 6 GByte/s per node and direction– Remote DMA communication (local store to local store)
Interrupt tree network– Evaluation of global conditions and synchronization– Global Exceptions– 2 signals per direction
Ethernet network– 1 Gigabit Ethernet link per node card to rack-level switches (switched network)– I/O to parallel file system (user input / output)– Linux network boot– Aim of O(10) GB bandwidth per rack
© 2009 IBM Corporation17
Backplane(8 per rack)
Power Supply and Power Adapter Card(24 per rack)
Rack
Root Card(16 per rack)
Node Card(256 per rack)
© 2009 IBM Corporation18
Node Card
Components– IBM PowerXCell 8i processor 3.2 GHZ– 4 Gigabyte DDR2 memory 800 MHZ with ECC– Network processor (NWP) Xilinx FPGA LX110T FPGA– Ethernet PHY – 6 x 1GB/s external links using PCI Express physical layer– Service Processor (SP) Freescale 52211– FLASH (firmware and FPGA configuration)– Power subsystem– Clocking
Network Processor– FLEXIO interface to PowerXCell 8i processor, 2 bytes with 3 GHZ bit rate – Gigabit Ethernet– UART FW Linux console– UART SP communication– SPI Master (boot flash)– SPI Slave for training and configuration– GPIO
© 2009 IBM Corporation19
Memory
PowerXCell 8iProcessor
Network Processor(FPGA)
Network PHYs
Node Card
© 2009 IBM Corporation20
PowerXCell 8i
FPGA Virtex-5
800MHz
PowerSubsystem
FLEXIO6GB/s
SPI
Compute Network
SPFreescaleMCF52211
RS232
384 IO@250MHZ
4*8*2*6 = 384 IO680 available (LX110T)
FLEXIO6GB/s
Clocking
UART
Flash
6x 1GB/s PHY
SPI
I2C
SPII2C
GigE
RW(Debug)
DDR2 DDR2DDR2DDR2
PHY
Node Card
© 2009 IBM Corporation
Network Processor
21
x+
x-
z-
FlexIOInterface
Network Logic
Routing
Arbitration
FIFOs
Configuration
PHY
PHY
PHY
Link Interface
Link Interface
Link Interface
Ethernet Interface
Global Signals
SerialInterfaces
PHY
SPI Flash
Slices 92 %
PINs 86 %
LUT-FF pairs 73 %
Flip-Flops 55 %
LUTs 53 %
BRAM / FIFOs 35 %
Flip-Flops LUTs
Processor Interface 53 % 46 %
Torus 36 % 39 %
Ethernet 4 % 2 %
© 2009 IBM Corporation23
Torus Network Architecture
2-sided communication– Node A initiates send, node B initiates receive– Send and receive commands have to match– Multiple use of same link by virtual channels
Send / receive from / to local store or main memory– CPU → NWP
• CPU moves data and control info to NWP• Back-pressure controlled
– NWP → NWP• Independent of processor• Each datagram has to be acknowledged
– NWP → CPU• CPU provides credits to NWP• NWP writes data into processor• Completion indicated by notification
© 2009 IBM Corporation
Torus Network Reconfiguration
Torus network PHYs provide 2 interfaces– Used for network reconfiguration b selecting primary or secondary interface
Example – 1x8 or 2x4 node-cards
Partition sizes (1,2,2N) * (1,2,4,8,16) * (1,2,4,8)– N ... number of racks connected via cables
24
© 2009 IBM Corporation
Cooling
Concept– Node card mounted in housing = heat conductor– Housing connected to liquid cooled cold plate– Critical thermal interfaces
• Processor – thermal box• Thermal box – cold plate
– Dry connection between node card and cooling circuit
Node card housing – Closed node card housing acts as heat conductor.– Heat conductor is linked with liquid-cooled “cold plate”– Cold Plate is placed between two rows of node cards.
Simulation Results for one Cold Plate– Ambient 12°C– Water 10 L / min– Load 4224 Watt
2112 Watt / side
25
© 2009 IBM Corporation
Project Review
Hardware design– Almost all critical problems solved in time– Network Processor implementation was a challenge– No serious problems due to wrong design decisions
Hardware status– Manufacturing quality good: Small bone pile, few defects during operation.
Time schedule– Essentially stayed within planned schedule– Implementation of system / application software delayed
26
© 2009 IBM Corporation27
Summary
QPACE is a new, scalable LQCD machine based on the PowerXCell 8i processor.
Design highlights– FPGA directly attached to processor– LQCD optimized, low latency torus network– Novel, cost-efficient liquid cooling system– High packaging density– Very power efficient architecture
O(20-30%) sustained performance for key LQCD kernels is reached / feasible
→ O(10-16) TFLOPS / rack (SP)
© 2009 IBM Corporation28
Power Efficiency
© 2009 IBM Corporation29
© 2009 IBM Corporation30
© 2009 IBM Corporation31
© 2009 IBM Corporation33
© 2009 IBM Corporation34
Challenge #1: Data Ordering
InfiniBand test failed on cluster with 14 blade server
– Nodes were connected via InfiniBand DDR adapter.– IMB (Pallas) stresses MPI traffic over InfiniBand network.– System fails after couple minutes, waiting endless for a event.– System runs stable, if global setting for strong ordering is set (default is relaxed).– Problem was in the meanwhile recreated with same symptoms on InfiniBand SDR hardware .– Changing from relaxed to strong ordering changes performance significantly !!!
First indication points to DMA ordering issue
– InfiniBand adapter do consecutive writes to memory, sending out data, followed by status.– InfiniBand software stack polls regularly on status.– If status is updated before data arrives, we clearly had an issue.
© 2009 IBM Corporation35
Challenge #1: Data Ordering (continued)
Ordering of device initiated write transactions
– Device (InfiniBand, GbE, ...) writes data to two different memory locations in host memory– First transaction writes data block ( multiple writes )– Second transaction writes status ( data ready )– It must be ensured, that status does not reach host memory before complete data
• If not, software may consume data, before it is valid !!!
Solution 1:– Always do strong ordering, i.e. every node in the path from device to host will send data out in order received
– Challenge: IO Bandwidth impact, which can be significant.
Solution 2:– Provide means to enforce ordering of the second write behind the first, but leave all other writes unordered– Better performance
– Challenge: Might need device firmware and/or software support
© 2009 IBM Corporation36
Challenge #2: Data is Everything
BAD NEWS: There is a many ways how an application can be accelerated.
– An inline accelerator is an accelerator that runs sequentially with the main compute engine.
– A core accelerator is a mechanism that accelerates the performance of a single core. A core may run multiple hardware threads in an SMT implementation.
– A chip accelerator is an off-chip mechanism that boosts the performance of the primary compute chip. Graphics accelerators are typically of this type.
– A system accelerator is a network-attached appliance that boosts the performance of a primary multinode system. Azul is an example of a system accelerator.
© 2009 IBM Corporation37
Challenge #2: Data is Everything (continued)
GOOD NEWS: Application acceleration is possible!
– It is all about data:
• Who owns it?
• Where is it now?
• Where is it needed next?
• How much does it cost to send it from now to next?
– Scientists, computer architects, application developers and system administrators needs to work together closely.
© 2009 IBM Corporation38
Challenge #3: Traffic Pattern
IDEA: Open QPACE for border range of HPC applications (e.g High Performance LINPACK).
High-speed point-to-point interconnect with a 3D torus topology
Direct SPE-to-SPE communication between neighboring nodes for good nearest neighbor performance
© 2009 IBM Corporation39
Challenge #3: Traffic Pattern (continued)
BAD NEWS: High Performance LINPACK Requirements
– Matrix stored in main memory• Experiments show: Performance gain with increasing memory size
– MPI communications:• Between processes in same row/column of process grid• Message sizes: 1 kB … 30 MB
– Efficient Level 3 BLAS routines (DGEMM, DTRSM, …)
– Space trade-offs and complexity leads to a PPE-centric programming model!
© 2009 IBM Corporation40
Challenge #3: Traffic Pattern (continued)
GOOD NEWS: We have an FPGA. ;-)
– DMA Engine was added to the NWP design on the FPGA and can fetch data from main memory.
– PPE is responsible for MM-to-MM message transfers.
– SPE is only used for computation offload.
© 2009 IBM Corporation41
Challenge #4: Algorithmic Performance
BAD NEWS: Algorithmic performance doesn’t necessarily reflect machine performance.
– Numerical problem solving of a sparse matrix via an iterative method.– If room of residence is adequate small the algorithm has converged and calculation is finished.– Difference between the algorithms is the number of used auxiliary vectors (number in brackets).
Source: Andrea Nobile (University of Regensburg)
© 2009 IBM Corporation42
Thank you very much for your attention.
© 2009 IBM Corporation43
Disclaimer
IBM®, DB2®, MVS/ESA, AIX®, S/390®, AS/400®, OS/390®, OS/400®, iSeries, pSeries, xSeries, zSeries, z/OS, AFP, Intelligent Miner, WebSphere®, Netfinity®, Tivoli®, Informix und Informix® Dynamic ServerTM, IBM, BladeCenter and POWER and others are trademarks of the IBM Corporation in US and/or other countries.
Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license there from. Linux is a trademark of Linus Torvalds in the United States, other countries or both.
Other company, product, or service names may be trademarks or service marks of others. The information and materials are provided on an "as is" basis and are subject to change.
Top Related