Blue Gene: A Next Generation Supercomputer (BlueGene/P)

© 2007 IBM Corporation

Blue Gene: A Next Generation Supercomputer (BlueGene/P)

Presented by Alan Gara (chief architect) representing the Blue Gene team.

Outline of Talk

• A brief sampling of applications on BlueGene/L. • A detailed look at the next generation BlueGene (BlueGene/P)• Future challenges motivated by a look at computing ~10 to 15 years

out.– Insight into future generation BlueGene/Q machine.

1GB VersionAvailable 1Q06

Blue Gene/L(PPC 440-700 Mhz)

Scalable to 360 TFlops

Blue Gene/Q(Power architecture)

Scales to 10s of PFlops

Blue Gene RoadmapPe

rfor

man

ce

Blue Gene/P(PPC 450: 0.85 GHz)Scalable to 1 PFlops

LA: 12/04, GA 6/05

2004 2006 20102008

SoC DesignCu-08 (9SF) tech

Providing unmatched sustained $$/perf and Watts/perf for scalable applications

The primary IBM system research vehicle which influences our more traditional PowerPC product line

BlueGene/L at RPI73.0IBM7

BlueGene/L at Stony Brook/BNL82.1IBM5

ORNL101.7Cray2Sandia (Red Storm)101.4Cray/Sandia3

Barcelona PowerPC blades62.6IBM9Leibniz Rechenzentrum56.5SGI10

ASC Purple LLNL75.76IBM6

BlueGene/L at Watson91.2IBM4

BlueGene/L DOE/NSSA/LLNL 280.6IBM1

NCSA62.6Dell8

Rmax TFlops InstallationVendor#

Top500 list (June 2007)

© 2007 IBM Corporation4

Formation of a Non-abrupt SiO2/Si interface correctly predicted “from scratch”

Car-Parrinello Molecular Dynamics (CPMD):Studying the effect on dopants on SiO2/Si boundaries

When nitrogen and hafnium are introduced during the simulation process, detrimental defects are unraveled

Simulations from first principles to understand the physics and chemistry of current technology and guide the design of next-generation materials

Characterization of materials currently under experimental test


Blue Brain

EPFL to simulate the neocortical column

Our understanding of the brain is limited by insufficient information and complexity

► Overcome limitations of neuroscientific experimentation

Inform experimental design and theoryEnable scientific discovery for understanding brain function and diseasesFinally feasible!!! (although by no means finshed)8096 processors (BG/L)

► 100,000 morphologically complex neurons in real-time

Total area: 1570 cm2

Thickness: ~3 mmColumns: ~1million

50 cm

35 cm

300 µm

3 mm

~10,000 neurons

Courtesy of Henry Markham, EPFL

POP2 0.1° benchmark

Courtesy of M. Taylor, John Dennis

20% of time in solver

71% of time in solver

Projected BlueGene/PComparison point for same system (node) size.

Carbon footprint for [email protected]°

Courtesy of John Dennis


BlueGene/P in Focus

BlueGene/P Architectural Highlights

• Scaled performance through density and frequency bump– 2x performance through doubling the processors/node– 1.2x from frequency bump due to technology

• Enhanced function – 4 way SMP– DMA, remote put-get, user programmable memory prefetch– Greatly enhanced 64 bit performance counters (including 450 core)

• Hold BlueGene/L packaging as much as possible:– Improve networks through higher speed signaling on same wires– Improve power efficiency through aggressive power management

• Higher signaling rate – 2.4x higher bandwidth, – improve latency for Torus and Tree networks– 10x higher bandwidth for Ethernet IO

• 72ki nodes in 72 racks should hit 1.00 PF peak.

BGP comparison with BGL

2.7 MW1.7MWTotal Power

1PF410TFPeak Performance (72k nodes)System Properties

4us5.0usHardware Latency (round trip worst case)

2*0.85GB/s=1.7GB/s2*350MB/s=700MB/sBandwidthCollective Network

5us(64 hops)6.4us (64 hops)Hardware Latency(Worst Case)

160ns (32B packet)500ns(256B packet)

200ns (32B packet)1.6us(256B packet)

Hardware Latency (Nearest Neighbor)

6*2*425MB/s=5.1GB/s6*2*175MB/s=2.1GB/sBandwidthTorus Network

13.6 GF/node5.6GF/nodePeak Performance

13.6 GB/s (2*16B wide)5.6GB/s (16B wide)Main Store Bandwidth

2GB512MB/1GBMain Store/node

8MB4MBL3 Cache size (shared)

14 stream prefetching14 stream prefetchingL2 Cache (private)

32KB/processor32KB/processorL1 Cache (private)

SMPSoftware managedCoherency

0.85GHz (target)0.7GHzProcessor Frequency

4* 450 PowerPC2* 440 PowerPCNode ProcessorsNode Properties

BG/PBG/LProperty

PPC 450

FPUL1 Prefetching

L24MB

eDRAML3

DDR-2Controller

PPC 450

FPUL1 Prefetching

L2

PPC 450

FPUL1 Prefetching

L24MB

eDRAML3

DDR-2Controller

PPC 450

FPUL1 Prefetching

L2

Torus Collective BarrierJTAG 10Gb Ethernet

DMA

BlueGene/P node

Arb

6*3.4Gb/sbidirectional



ControlNetwork

2*16B bus @425Mb/s

To 10Gbphysical layer

(Shares I/O with Torus)

Multiplexing sw

itchM

ultiplexing switch

14GB/s read(each), 14GB/s write(each)

Data read @ 7GB/sData write @ 7GB/sInstruction @ 7GB/s

13.6GB/s external DDR2 DRAM bus

4 symmetric ports for collective, torus and global barriers

DMA module allowsRemote direct “put”/ “get”

IBM® System Blue Gene®/P Solution © 2007 IBM Corporation

IBM® System Blue Gene®/P Solution: Expanding the Limits of Breakthrough Science

13.6 GF/s8 MB EDRAM

4 processors

1 chip, 20 DRAMs

13.6 GF/s2.0 GB DDR

Supports 4-way SMP

32 Node Cards1024 chips, 4096 procs

14 TF/s2 TB

1 to 72 or more Racks

1 PF/s +144 TB +

Cabled 8x8x16Rack

System

Compute Card

Chip

435 GF/s64 GB

(32 chips 4x4x2)32 compute, 0-2 IO cards

Node Card

Blue Gene/P

Front End Node / Service NodeSystem p Servers

Linux SLES10

Blue Gene/P continues Blue Gene’s leadership performance in a space-saving, power-efficient package for the most demanding and scalable

high-performance computing applications

HPC SW:CompilersGPFSESSLLoadleveler

IBM® System Blue Gene®/P Solution © 2007 IBM Corporation

IBM® System Blue Gene®/P Solution: Expanding the Limits of Breakthrough Science

Blue Gene/P Interconnection Networks3 Dimensional Torus

– Interconnects all compute nodes• Communications backbone for computations

– Adaptive cut-through hardware routing– 3.4 Gb/s on all 12 node links (5.1 GB/s per node)– 0.5 µs latency between nearest neighbors, 5 µs to the farthest

• MPI: 3 µs latency for one hop, 10 µs to the farthest– 1.7/2.6 TB/s bisection bandwidth, 188TB/s total bandwidth (72k machine)

Collective Network– Interconnects all compute and I/O nodes (1152)– One-to-all broadcast functionality– Reduction operations functionality– 6.8 Gb/s of bandwidth per link– Latency of one way tree traversal 2 µs, MPI 5 µs – ~62TB/s total binary tree bandwidth (72k machine)

Low Latency Global Barrier and Interrupt– Latency of one way to reach all 72K nodes 0.65 µs, MPI 1.6 µs

Other networks– 10Gb Functional Ethernet

• I/O nodes only– 1Gb Private Control Ethernet

• Provides JTAG access to hardware. Accessible only from Service Node system

BG/L applications easily port to 4-way-virtual-node BG/P.May increase performance through new BG/P features: •Program model changes:

• Support mixed OpenMP +MPI (OpenMP on 4-way in node)

• Virtual Node mode supported as in BGL. In BGP 4 MPI tasks/node.

• ptheads supported.

• Number of threads limited to number of cores (4)

• DMA engine enables effective offloading of messaging and increases value of overlapping compute with communicate.

• Messaging library utilizes DMA and is built around put/get functionality.

• HPC toolkit will enable access to performance counters. (BG/P has processor counts.)

• BGL model of high performance kernel on compute nodes and linux on I/O nodes.

•Working on supporting dynamic linking on high performance kernel.

Above also enables new applications for BG/P.

BlueGene/P Software


Future Challenges Insights for BlueGene/Q


June 15, 2005

Challenges for the futureCan get understanding of challenges by projecting to issues in 2023 (Exaflop era)

• Power is fundamental problem that is pervasive at many system levels (compute, memory, disk)• Memory cost and performance is not keeping pace with compute potential• Network performance (bandwidth and latency) will be both costly (bandwidth) and will not scale well to Exaflops• Ease of use to extract promised performance from compute will be main focus. Big peak Flops is mainly a power problem.• Reliability at the Exaflop scale will require a holistic approach at the architecture level. This results from both a lessening of the underlying silicon technology and from the shear number of logic elements.

Supercomputer Peak Performance Supercomputer Peak Performance

1940 1950 1960 1970 1980 1990 2000 2010Year Introduced

1E+2

1E+5

1E+8

1E+11

1E+14

1E+17

Peak

Spe

ed (f

lops

)

Doubling time = 1.5 yr.

ENIAC (vacuum tubes)UNIVAC

IBM 701 IBM 704IBM 7090 (transistors)

IBM StretchCDC 6600 (ICs)

CDC 7600CDC STAR-100 (vectors) CRAY-1

Cyber 205 X-MP2 (parallel vectors)

CRAY-2X-MP4 Y-MP8

i860 (MPPs)

ASCI White

Blue Gene/L

Blue Pacific

DeltaCM-5 Paragon

NWT

ASCI Red OptionASCI Red

CP-PACS

Earth

VP2600/10SX-3/44

Red Storm

ILLIAC IV

SX-2

SX-4SX-5

S-810/20

T3D

T3E

ASCI Purple

2020

Current/Past(performance growth through exponential processor performance growth )

Near Term(performance through exponential growth in parallelism)

Long Term(power cost=system cost)


Extrapolating an Exaflop in 2023

Memory footprint is no problem42 MB42 MB2.7 MBMemory footprint/node

Requires:1)fast global sum2)hardware offload for messaging (Driverless messaging)

15 usec9.4 usec2.3 msecQCD CG single iteration time

Approx equal time to science128^3x256128^3x25664^3X128Total problem size (QCD example)

5 mW/Gbps assumed6 MW50 MW12.8 KWPower in memory I/O (not DRAM)

5 Gbps per pin 4000 pins32,000 pins128 pinsData pins associated with memory/node

About 6-7 technology generations500 MB16 GB4 MBL2 cache/node

Not possible to maintain external bandwidth/Flop2 TB/s20TB/s5.6GB/sMemory Bandwidth/node

10 mW/Gbps assumed8 MW38 MW100 KWPower in network

20 Gbps differential assumed1,200 pins6,000 pins24 pinsPins in network on node

Large wire count will eliminate high density and drive links onto cables where they are 100x more expensive.

100 wires500 wires2Wires per unidirectional 3-D link

Not possible to maintain bandwidth ratio1 Tbps5 Tbps1.4GbpsLink Bandwidth (Each unidirectional 3-D link)

80x improvement (very optimistic)50 MW4 GW1 MWSystem Power Compute Chip

Assume 3.5GHz, 3-D packaging160080002Number of hardware threads/node

Same node count (64k)20TF20TF5.6GFNode Peak Perf

Assumption for “Educated guess”ExaflopEducated guess

ExaflopDirectly scaled

BlueGene/L (2005)

• Power associated with external memory will force high efficiency computing to reside inside chip. (or chip stack)

• Network scaling will be both a latency and bandwidth problem. Bandwidth is a cost problem and latency will require hardware offload to avoid nearly all software layers.

• Processing in a node will be done via thousand(s) of hardware units , each which is only somewhat faster than today’s.

IBM Research


0.23

0.35

0.02 0.02 0.02

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

BG/L BG/P Red Storm Thunderbird Purple

System Power EfficiencyG

FLO

P/s

per W

att

Year

GFL

OP/

s pe

r Wat

t

Power-efficient design focus

“Single thread focus design

1997 1999 2001 2003 2005 20070.001

0.01

0.1

1

QCDSPColumbia

QCDOCColumbia/IBM

Blue Gene/L

ASCI WhitePower 3

Earth Simulator

ASCI QNCSA, Xeon LLNL, Itanium 2

ECMWF, p690Power 4+

NASA, SGI

SX-8

NASA, SGI Cray XT3

Fujitsu BioserverBlue Gene/P

?

Commodity driven

• Large peak power efficiency advantage• Still need dramatic improvement to enable computing in the future.


June 15, 2005

The Power Problem“Thick” gate oxide Scaled gate oxide

t

1.2 nm oxynitride

Field effect transistor

Gate

Oxide thickness is near the limit.• Traditional CMOS scaling has ended.• Density improvements will continue but… power efficiency from technology will only improve very slowly.

1.5 GWatt20 GWatt250 MWatt2019100 PF

2025

2012

2008

2005

Projected Year

150 MWatt2 GWatt25 MWatt10 PF

15 MWatt200 MWatt2.5 MWatt1 PF

1000 PF

250 TF

15 GWatt200 GWatt2.5 GWatt

5 MWatt100 MWatt1.0 MWatt

MareNostrumEarth SimulatorBlueGene/L

If power efficiency does not improve…CMOS alone will no longer enable faster computers with similar power

• Solution is not known!• Architecture can help (to some extent) witness the better power efficiency of commodity processors from “simplification”• New circuits can also help• This problem needs to be addressed now.

Summary/Conclusion

• BlueGene/L has achieved an application reach far broader than expected (or targeted in the design)

– Partnership and collaboration have been critical to exploiting BlueGene/L

• BlueGene/P is an architectural evolution from BlueGene/L– Enhancements from BlueGene/L such as a hardware DMA engine promise same or better

per node scaling on BlueGene/P.– BlueGene/P offers a fully coherent 4-way node with a software stack designed to exploit

parallelism.– BlueGene/P will offer approximately 2-3x speed up with respect to BlueGene/L for same

node count.

• Future Trends – Power will be a severe constraint in the future (and now)– Large systems will have millions of threads, each similar in performance to today.– Challenges of power will apply to all systems (commercial and HPC).

• Market forces in commodity commercial world could result in a different , potentially not well aligned with HPC, direction.

– Reliability of systems in the future will require a holistic approach to reach the extreme levels of scalability.

– Latency in networks will become a pinch point for capability computing.

Blue Gene: A Next Generation Supercomputer (BlueGene/P)

Documents

Transcript of Blue Gene: A Next Generation Supercomputer (BlueGene/P)