Post on 09-May-2020
Dr. George ChiuIEEE FellowIBM T.J. Watson Research CenterYorktown Heights, NY
Architecture of the IBM Blue Gene Supercomputer
President Obama Honors IBM's Blue Gene Supercomputer With National Medal Of Technology And InnovationNinth time IBM has received nation's most prestigious tech awardBlue Gene has led to breakthroughs in science, energy efficiency and analytics
WASHINGTON, D.C. - 18 Sep 2009: President Obama recognized IBM (NYSE: IBM) and its Blue Gene family of supercomputers with the National Medal of Technology and Innovation, the country's most prestigious award given to leading innovators for technological achievement.President Obama will personally bestow the award at a special White House ceremony on October 7. IBM, which earned the National Medal of Technology and Innovation on eight other occasions, is the only company recognized with the award this year.Blue Gene's speed and expandability have enabled business and science to address a wide range of complex problems and make more informed decisions -- not just in the life sciences, but also in astronomy, climate, simulations, modeling and many other areas. Blue Gene systems have helped map the human genome, investigated medical therapies, safeguarded nuclear arsenals, simulated radioactive decay, replicated brain power, flown airplanes, pinpointed tumors, predicted climate trends, and identified fossil fuels – all without the time and money that would have been required to physically complete these tasks.The system also reflects breakthroughs in energy efficiency. With the creation of Blue Gene, IBM dramatically shrank the physical size and energy needs of a computing system whose processing speed would have required a dedicated power plant capable of generating power to thousands of homes.The influence of the Blue Gene supercomputer's energy-efficient design and computing model can be seen today across the Information Technology industry. Today, 18 of the top 20 most energy efficient supercomputers in the world are built on IBM high performance computing technology, according to the latest Supercomputing 'Green500 List' announced by Green500.orgin July, 2009.
© 2007 IBM Corporation3
CMOS Scaling in Petaflop EraThree decades of exponential clock rate (and electrical power!) growth has endedInstruction Level Parallelism (ILP) growth has endedSingle threaded performance improvement is dead (Bill Dally)Yet Moore’s Law continues in transistor countIndustry response: Multi-core(i.e. double the number of cores every 18 months instead of the clock frequency (and power!)
Source: “The Landscape of Computer Architecture,” John Shalf, NERSC/LBNL, presented at ISC07, Dresden, June 25, 2007
Frequency (GHz)
Ops/Cycle
# of Compute engines1
10
10,000
1001,000
1,000,000100,000
10,000,000
10100
Clusters Blue GeneFuture (2010-2015)
Over the next 8-10 yearsFrequency might improve by 2xOps/cycle might improve by 2-4xOnly opportunity for dramatic performance improvement is in number of compute engines
Performance Improvement Trend
Source: Tilak Agerwala, ICS 08
Blue Gene Roadmap• QCDSP – 600 GF based on TI DSP C31 (1998)
• QCDOC – 20TF based on IBM 180nm ASIC (2003)
• BG/L (5.7 TF/rack) – 130nm ASIC (1999-2004GA)– 104 racks, 212,992 cores, 596 TF/s, 210 MF/W; dual-core system-on-chip, – 0.5/1 GB/node
• BG/P (13.9 TF/rack) – 90nm ASIC (2004-2007GA)– 72 racks, 294,912 cores, 1 PF/s, 357 MF/W; quad core SOC, DMA– 2/4 GB/node– SMP support, OpenMP, MPI
• BG/Q (209 TF/rack) – 20 PF/s
HPCC 2009
IBM BG/P 0.501 PF peak (36 racks) Class 1: Number 1 on G-Random Access (117 GUPS) Class 2: Number 1
Cray XT5 2.331 PF peak Class 1: Number 1 on G-HPL (1533 TF/s) Class 1: Number 1 on EP-Stream (398 TB/s) Class 1: Number 1 on G-FFT (11 TF/s)
Source: www.top500.org
BlueGene/P
13.6 GF/s8 MB EDRAM
4 processors
1 chip, 20 DRAMs
13.6 GF/s2.0 GB DDR2
(4.0GB 6/30/08)
32 Node Cards
13.9 TF/s2 (4) TB
72 Racks, 72x32x32
1 PF/s144 (288) TB
Cabled 8x8x16Rack
System
Compute Card
Chip
435 GF/s64 (128) GB
(32 chips 4x4x2)32 compute, 0-1 IO cards
Node Card
JTAG 10 Gb/s
256
256
32k I1/32k D132k I1/32k D1
PPC450PPC450
Double FPUDouble FPU
Ethernet10 Gbit
Ethernet10 GbitJTAG
Access
JTAGAccess Collective
CollectiveTorus
Torus GlobalBarrier
GlobalBarrier
DDR-2Controllerw/ ECC
DDR-2Controllerw/ ECC
32k I1/32k D132k I1/32k D1
PPC450PPC450
Double FPUDouble FPU
4MBeDRAM
L3 Cacheor
On-ChipMemory
4MBeDRAM
L3 Cacheor
On-ChipMemory
6 3.4Gb/sbidirectional
4 globalbarriers orinterrupts
128
32k I1/32k D132k I1/32k D1
PPC450PPC450
Double FPUDouble FPU
32k I1/32k D132k I1/32k D1
PPC450PPC450
Double FPUDouble FPU L2L2
Snoop filter
Snoop filter
4MBeDRAM
L3 Cacheor
On-ChipMemory
4MBeDRAM
L3 Cacheor
On-ChipMemory
512b data 72b ECC
128
L2L2
Snoop filter
Snoop filter
128
L2L2
Snoop filter
Snoop filter
128
L2L2
Snoop filter
Snoop filter
Multiplexing
switch
Multiplexing
switch
DMADMA
Multiplexing
switch
Multiplexing
switch
3 6.8Gb/sbidirectional
DDR-2Controllerw/ ECC
DDR-2Controllerw/ ECC
13.6 Gb/sDDR-2 DRAM bus
32
SharedSRAM
SharedSRAM
snoop
Hybrid PMU
w/ SRAM256x64b
Hybrid PMU
w/ SRAM256x64b
BlueGene/P compute ASIC
Shared L3 Directory
for eDRAM
w/ECC
Shared L3 Directory
for eDRAM
w/ECC
Shared L3 Directory
for eDRAM
w/ECC
Shared L3 Directory
for eDRAM
w/ECC
ArbArb
512b data 72b ECC
Hard CoresEDRAMsI/O cellsDecapsFuse/BISTSoft CoresArraysCust. Logic
Relative utilization of Blue Gene/P chip area by different types of components
Port snoop filter• Each of 4 port filters contains three complementary filters for optimal filtering
– Snoop cache: keeps track of snoops. Those addresses recently invalidated need not be re-invalidated again, because we know it is not in L1.
– Stream registers: addresses requested by L1 to L2 are monitored and stored in stream registers. Using this information, we could discard some invalidations, because we know some of them are not in L1.
– Range filter: address ranges are set to filter all coherence requests with addresses either within or outside of the specified address range.
• All filters run concurrently– Combined rejection of unnecessary snoops are typically over 90%– All required snoops are forwarded to L1’s– Performance improvements up to 35%, depending on the application.
Stream register status
Local processor cache misses
Snp_addressSnp_request
Stream Registers
Range Filter
Snoop CacheDecision
Logic
Enable control for port snoop filter
Forward request into the snoop queueReturn token
Snoop filter efficiency
0
20
40
60
80
100
120
FFT Barnes LU Ocean Raytrace Cholesky
Filte
r rat
e (%
)
Big
ger i
s be
tter
Big
ger i
s be
tter
Simulation results
IBM System Blue Gene®/P Solution © 2007 IBM Corporation
Execution Modes in BG/P per Node
Hardware Abstractions BlackSoftware Abstractions Blue
node
core
core core
core
P0
T0
T1
T2
P0
T0
T1 T3
T2
P0
T0
T1
P0
T0
SMP Mode1 Process
1-4 Threads/Process
P0T0
T1 T1
T0
P1
P0T0 T0
P1
Dual Mode2 Processes
1-2 Threads/Process
P0T0
T0
Quad Mode (VNM)4 Processes
1 Thread/Process
P1
P2T0
T0
P3
Next Generation HPC– Many Core– Expensive Memory – Two-Tiered Programming Model
Air-CooledBG/L
36”
48”
Air-CooledBG/P
36”
40 kW/Rack5000 CFM/Rack
25 kW/Rack3000 CFM/Rack
Hydro-AirConcept forBlueGene/P
Hydro-AirCooledBG/P
40 kW/Rack5000 CFM/Row
Key:BG Rack with Cards and Fans
AirflowAir Plenum
Air-to-Water Heat Exchanger
(drawn to scale)
11
Main Memory Capacity per Rack
0500
10001500200025003000350040004500
LRZIA64
CrayXT4
ASCPurple
RR BG/P SunTACC
SGIICE
Peak Memory Bandwidth per node (byte/flop)
0 0.5 1 1.5 2
BG/P 4 core
Roadrunner
Cray XT3 2 core
Cray XT5 4 core
POWER5
Itanium 2
Sun TACC
SGI ICE
Main Memory Bandwidth per Rack
0
2000
4000
6000
8000
10000
12000
14000
LRZItanium
Cray XT5
ASCPurple
RR BG/P SunTACC
SGI ICE
BlueGene/P Interconnection Networks
3 Dimensional Torus Interconnects all compute nodes (73,728) Virtual cut-through hardware routing 3.4 Gb/s on all 12 node links (5.1 GB/s per node) 0.5 µs latency between nearest neighbors, 5 µs to the
farthest MPI: 3 µs latency for one hop, 10 µs to the farthest Communications backbone for computations 1.7/3.9 TB/s bisection bandwidth, 188TB/s total bandwidth
Collective Network One-to-all broadcast functionality Reduction operations functionality 6.8 Gb/s of bandwidth per link per direction Latency of one way tree traversal 1.3 µs, MPI 5 µs ~62TB/s total binary tree bandwidth (72k machine) Interconnects all compute and I/O nodes (1152)
Low Latency Global Barrier and Interrupt Latency of one way to reach all 72K nodes 0.65 µs,
MPI 1.6 µs
Interprocessor Peak Bandwidth per node (byte/flop)
0 0.2 0.4 0.6 0.8
BG/L,P
Cray XT5 4c
Cray XT4 2c
NEC ES
Power5
Itanium 2
Sun TACC
x86 cluster
Dell Myrinet
Roadrunner
Total power consumption of the BlueGene/P chip configuration
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
Idle UMT2k SPHOT SPPM CrystalMK DGEMM
Ave
rage
pow
er (W
)
Avg node power Avg memory power
IBM® System Blue Gene®/P Solution © 2007 IBM Corporation
IBM® System Blue Gene®/P Solution: Expanding the Limits of Breakthrough Science
Summary
Blue Gene/P: Facilitating Extreme Scalability– Ultrascale capability computing when nothing else will satisfy– Provides customer with enough computing resources to help
solve grand challenge problems– Provide competitive advantages for customers’ applications
looking for extreme computing power– Energy conscious solution supporting green initiatives– Familiar open/standards operating environment– Simple porting of parallel codes
Key Solution Highlights– Leadership performance, space saving design, low power
requirements, high reliability, and easy manageability