Architecture of the IBM Blue Gene...

Dr. George ChiuIEEE FellowIBM T.J. Watson Research CenterYorktown Heights, NY

Architecture of the IBM Blue Gene Supercomputer

President Obama Honors IBM's Blue Gene Supercomputer With National Medal Of Technology And InnovationNinth time IBM has received nation's most prestigious tech awardBlue Gene has led to breakthroughs in science, energy efficiency and analytics

WASHINGTON, D.C. - 18 Sep 2009: President Obama recognized IBM (NYSE: IBM) and its Blue Gene family of supercomputers with the National Medal of Technology and Innovation, the country's most prestigious award given to leading innovators for technological achievement.President Obama will personally bestow the award at a special White House ceremony on October 7. IBM, which earned the National Medal of Technology and Innovation on eight other occasions, is the only company recognized with the award this year.Blue Gene's speed and expandability have enabled business and science to address a wide range of complex problems and make more informed decisions -- not just in the life sciences, but also in astronomy, climate, simulations, modeling and many other areas. Blue Gene systems have helped map the human genome, investigated medical therapies, safeguarded nuclear arsenals, simulated radioactive decay, replicated brain power, flown airplanes, pinpointed tumors, predicted climate trends, and identified fossil fuels – all without the time and money that would have been required to physically complete these tasks.The system also reflects breakthroughs in energy efficiency. With the creation of Blue Gene, IBM dramatically shrank the physical size and energy needs of a computing system whose processing speed would have required a dedicated power plant capable of generating power to thousands of homes.The influence of the Blue Gene supercomputer's energy-efficient design and computing model can be seen today across the Information Technology industry. Today, 18 of the top 20 most energy efficient supercomputers in the world are built on IBM high performance computing technology, according to the latest Supercomputing 'Green500 List' announced by Green500.orgin July, 2009.

CMOS Scaling in Petaflop EraThree decades of exponential clock rate (and electrical power!) growth has endedInstruction Level Parallelism (ILP) growth has endedSingle threaded performance improvement is dead (Bill Dally)Yet Moore’s Law continues in transistor countIndustry response: Multi-core(i.e. double the number of cores every 18 months instead of the clock frequency (and power!)

Source: “The Landscape of Computer Architecture,” John Shalf, NERSC/LBNL, presented at ISC07, Dresden, June 25, 2007

Frequency (GHz)

Ops/Cycle

# of Compute engines1

10,000

1001,000

1,000,000100,000

10,000,000

Clusters Blue GeneFuture (2010-2015)

Over the next 8-10 yearsFrequency might improve by 2xOps/cycle might improve by 2-4xOnly opportunity for dramatic performance improvement is in number of compute engines

Performance Improvement Trend

Source: Tilak Agerwala, ICS 08

Blue Gene Roadmap• QCDSP – 600 GF based on TI DSP C31 (1998)

• QCDOC – 20TF based on IBM 180nm ASIC (2003)

• BG/L (5.7 TF/rack) – 130nm ASIC (1999-2004GA)– 104 racks, 212,992 cores, 596 TF/s, 210 MF/W; dual-core system-on-chip, – 0.5/1 GB/node

• BG/P (13.9 TF/rack) – 90nm ASIC (2004-2007GA)– 72 racks, 294,912 cores, 1 PF/s, 357 MF/W; quad core SOC, DMA– 2/4 GB/node– SMP support, OpenMP, MPI

• BG/Q (209 TF/rack) – 20 PF/s

HPCC 2009

IBM BG/P 0.501 PF peak (36 racks) Class 1: Number 1 on G-Random Access (117 GUPS) Class 2: Number 1

Cray XT5 2.331 PF peak Class 1: Number 1 on G-HPL (1533 TF/s) Class 1: Number 1 on EP-Stream (398 TB/s) Class 1: Number 1 on G-FFT (11 TF/s)

Source: www.top500.org

BlueGene/P

13.6 GF/s8 MB EDRAM

4 processors

1 chip, 20 DRAMs

13.6 GF/s2.0 GB DDR2

(4.0GB 6/30/08)

32 Node Cards

13.9 TF/s2 (4) TB

72 Racks, 72x32x32

1 PF/s144 (288) TB

Cabled 8x8x16Rack

System

Compute Card

435 GF/s64 (128) GB

(32 chips 4x4x2)32 compute, 0-1 IO cards

Node Card

JTAG 10 Gb/s

32k I1/32k D132k I1/32k D1

PPC450PPC450

Double FPUDouble FPU

Ethernet10 Gbit

Ethernet10 GbitJTAG

Access

JTAGAccess Collective

CollectiveTorus

Torus GlobalBarrier

GlobalBarrier

DDR-2Controllerw/ ECC

32k I1/32k D132k I1/32k D1

PPC450PPC450

4MBeDRAM

L3 Cacheor

On-ChipMemory

4MBeDRAM

L3 Cacheor

On-ChipMemory

6 3.4Gb/sbidirectional

4 globalbarriers orinterrupts

32k I1/32k D132k I1/32k D1

PPC450PPC450

32k I1/32k D132k I1/32k D1

PPC450PPC450

Double FPUDouble FPU L2L2

Snoop filter

4MBeDRAM

L3 Cacheor

On-ChipMemory

4MBeDRAM

L3 Cacheor

On-ChipMemory

512b data 72b ECC

Snoop filter

Multiplexing

switch

Multiplexing

switch

DMADMA

Multiplexing

switch

Multiplexing

switch

3 6.8Gb/sbidirectional

DDR-2Controllerw/ ECC

13.6 Gb/sDDR-2 DRAM bus

SharedSRAM

Hybrid PMU

w/ SRAM256x64b

Hybrid PMU

w/ SRAM256x64b

BlueGene/P compute ASIC

Shared L3 Directory

for eDRAM

Shared L3 Directory

for eDRAM

Shared L3 Directory

for eDRAM

Shared L3 Directory

for eDRAM

ArbArb

512b data 72b ECC

Hard CoresEDRAMsI/O cellsDecapsFuse/BISTSoft CoresArraysCust. Logic

Relative utilization of Blue Gene/P chip area by different types of components

Port snoop filter• Each of 4 port filters contains three complementary filters for optimal filtering

– Snoop cache: keeps track of snoops. Those addresses recently invalidated need not be re-invalidated again, because we know it is not in L1.

– Stream registers: addresses requested by L1 to L2 are monitored and stored in stream registers. Using this information, we could discard some invalidations, because we know some of them are not in L1.

– Range filter: address ranges are set to filter all coherence requests with addresses either within or outside of the specified address range.

• All filters run concurrently– Combined rejection of unnecessary snoops are typically over 90%– All required snoops are forwarded to L1’s– Performance improvements up to 35%, depending on the application.

Stream register status

Local processor cache misses

Snp_addressSnp_request

Stream Registers

Range Filter

Snoop CacheDecision

Enable control for port snoop filter

Forward request into the snoop queueReturn token

Snoop filter efficiency

FFT Barnes LU Ocean Raytrace Cholesky

Simulation results

Execution Modes in BG/P per Node

Hardware Abstractions BlackSoftware Abstractions Blue

core core

SMP Mode1 Process

1-4 Threads/Process

P0T0 T0

Dual Mode2 Processes

1-2 Threads/Process

Quad Mode (VNM)4 Processes

1 Thread/Process

Next Generation HPC– Many Core– Expensive Memory – Two-Tiered Programming Model

Air-CooledBG/L

Air-CooledBG/P

40 kW/Rack5000 CFM/Rack

25 kW/Rack3000 CFM/Rack

Hydro-AirConcept forBlueGene/P

Hydro-AirCooledBG/P

40 kW/Rack5000 CFM/Row

Key:BG Rack with Cards and Fans

AirflowAir Plenum

Air-to-Water Heat Exchanger

(drawn to scale)

Main Memory Capacity per Rack

10001500200025003000350040004500

LRZIA64

CrayXT4

ASCPurple

RR BG/P SunTACC

SGIICE

Peak Memory Bandwidth per node (byte/flop)

0 0.5 1 1.5 2

BG/P 4 core

Roadrunner

Cray XT3 2 core

Cray XT5 4 core

POWER5

Itanium 2

Sun TACC

SGI ICE

Main Memory Bandwidth per Rack

LRZItanium

Cray XT5

ASCPurple

RR BG/P SunTACC

SGI ICE

BlueGene/P Interconnection Networks

3 Dimensional Torus Interconnects all compute nodes (73,728) Virtual cut-through hardware routing 3.4 Gb/s on all 12 node links (5.1 GB/s per node) 0.5 µs latency between nearest neighbors, 5 µs to the

farthest MPI: 3 µs latency for one hop, 10 µs to the farthest Communications backbone for computations 1.7/3.9 TB/s bisection bandwidth, 188TB/s total bandwidth

Collective Network One-to-all broadcast functionality Reduction operations functionality 6.8 Gb/s of bandwidth per link per direction Latency of one way tree traversal 1.3 µs, MPI 5 µs ~62TB/s total binary tree bandwidth (72k machine) Interconnects all compute and I/O nodes (1152)

Low Latency Global Barrier and Interrupt Latency of one way to reach all 72K nodes 0.65 µs,

MPI 1.6 µs

Interprocessor Peak Bandwidth per node (byte/flop)

0 0.2 0.4 0.6 0.8

BG/L,P

Cray XT5 4c

Cray XT4 2c

NEC ES

Power5

Itanium 2

Sun TACC

x86 cluster

Dell Myrinet

Roadrunner

Total power consumption of the BlueGene/P chip configuration

Idle UMT2k SPHOT SPPM CrystalMK DGEMM

Avg node power Avg memory power

IBM® System Blue Gene®/P Solution: Expanding the Limits of Breakthrough Science

Summary

Blue Gene/P: Facilitating Extreme Scalability– Ultrascale capability computing when nothing else will satisfy– Provides customer with enough computing resources to help

solve grand challenge problems– Provide competitive advantages for customers’ applications

looking for extreme computing power– Energy conscious solution supporting green initiatives– Familiar open/standards operating environment– Simple porting of parallel codes

Key Solution Highlights– Leadership performance, space saving design, low power

requirements, high reliability, and easy manageability

Architecture of the IBM Blue Gene...

Documents

Transcript of Architecture of the IBM Blue Gene...

DevOps reference architecture PDF - IBM · 2020. 12. 24. · DevOps reference architecture PDF IBM Cloud Architecture Center Page 1 Planning Page 2 Development Page 3 Integration

Enterprise Architecture: Concept and Practice - IBM · PDF fileApril 2008 Enterprise Architecture: Concept and Practice © 2008 IBM Corporation IBM’s Enterprise Architecture definition

The right architecture for business intelligence - IBM · 4 The right architecture for business intelligence Architecture of IBM Cognos Business ... full range of business intelligence

IBM Security Architecture

Architecture testing ibm

IBM Z-Architecture Principles of Operation

IBM PureData System for Analytics Architecture - IBM Redbooks

Mature architecture capabilities - IBM

IBM Logical Partition Architecture for Power7 operating on ... … · C043 Certification Report - IBM Logical Partition Architecture for Power7 operating on IBM Power Systems hardware

IBM Worklight- Checkout Process Architecture

Privileged Identity Management Architecture IBM

IBM xSeries Architecture

Architecture and Server Sizing for IBM Cognos Controller … · Architecture and Server Sizing for IBM Cognos Controller 8.5 ... ARCHITECTURE AND DEPLOYMENT GUIDE’) ... Architecture

IBM Certified WAS 8.5 Administrator Architecture

IBM Bluemix Architecture Series: Web Application Hosting ... · Redpaper Front cover IBM Bluemix Architecture Series: Web Application Hosting on IBM Containers Leveraging best practice

Architecture of IBM System_360

Solutions Guide for IBM Data Center Architecture - … · Solutions Guide for IBM ® Data Center Architecture: • IBM BladeCenter H Servers. ... IBM BladeCenter H is a powerful platform

Foundations of IBM Cloud Computing Architecture€¦ · Foundations of IBM Cloud Computing Architecture Test 032 Ron Bower, Jeff McNeely Lee Zhang. IBM ISV and Developer Relations

Vormetric Data Security Architecture - IBM

The IBM Reference Architecture for Healthcare and Life ... · PDF fileThe IBM Reference Architecture for Healthcare and Life ... Reference Architecture for Healthcare & Life Science