Exascale Evolution 1 Brad Benton, IBM March 15, 2010.

21
Exascale Evolution www.openfabrics.org 1 Brad Benton, IBM March 15, 2010

Transcript of Exascale Evolution 1 Brad Benton, IBM March 15, 2010.

Page 1: Exascale Evolution  1 Brad Benton, IBM March 15, 2010.

Exascale Evolution

www.openfabrics.org 1

Brad Benton, IBMMarch 15, 2010

Page 2: Exascale Evolution  1 Brad Benton, IBM March 15, 2010.

Agenda

• Exascale Challenges

• On the Path to Exascale:A Look at Blue Waters

2www.openfabrics.org

Page 3: Exascale Evolution  1 Brad Benton, IBM March 15, 2010.

Exascale Challenges

3www.openfabrics.org

Page 4: Exascale Evolution  1 Brad Benton, IBM March 15, 2010.

Exascale Challenges

• Challenges at every level of system design– Managing 500M to 1B (most likely heterogeneous)

cores– Programming models to exploit multi-core +

accelerators– Interconnect

• How will IB/RC scale to exascale?• How do we “get off the bus”?• How can we put more capability in the interconnect

– Power Management• Power vs. Performance tradeoffs

4www.openfabrics.org

Page 5: Exascale Evolution  1 Brad Benton, IBM March 15, 2010.

Exascale Challenges

• Challenges at every level of system design– Resilience/Fault-Tolerance

• At this scale, something always be broken or in the process of breaking

– Development Environment/Performance Tuning– Workflow Management/Process Steering– Data Management/Storage/Visualization

5www.openfabrics.org

Page 6: Exascale Evolution  1 Brad Benton, IBM March 15, 2010.

Exascale Challenges

• Resiliency/Fault-Tolerance– F/T Model

• Fault Detection• Fault Isolation• Fault Containment• Fault Recovery• Re-integration

– Software Resiliency• More than just checkpoint/restart• Containers/virtualization• suspend/migrate/resume

6www.openfabrics.org

Page 7: Exascale Evolution  1 Brad Benton, IBM March 15, 2010.

Programming Models

• MPI– Will it survive in an exascale world? (its demise was predicted at

petascale, but seems to be doing okay)

• Evolve hybrid language models: MPI + “What?”– OpenMP– GPU Accelerators (CUDA, OpenCL)– PGAS languages

• Greater Exploitation of Autotuningi.e., programs that write progams– ATLAS– FFTW– IBM HPC Toolkit has some of this

7www.openfabrics.org

Page 8: Exascale Evolution  1 Brad Benton, IBM March 15, 2010.

Title goes here on one line.

On the Path to Exascale:

A look at Blue Waters

8www.openfabrics.org

Page 9: Exascale Evolution  1 Brad Benton, IBM March 15, 2010.

NCSA Blue Waters

• Joint effort between NCSA and University of Illinoishttp://www.ncsa.illinois.edu/BlueWaters/

• First Deliverable of a system based on PERCS technology (2011)

• Will be the world’s first sustained petascale system for open scientific research

• http://www.ncsa.illinois.edu/BlueWaters/pdfs/snir-power7.pdf for more detailed information

9www.openfabrics.org

Page 10: Exascale Evolution  1 Brad Benton, IBM March 15, 2010.

Blue Waters Overview

• Approximately 10 PF/s peak• More than 300,000 cores (homogeneous)• More than 1 PetaByte memory• More than 10 Petabyte disk storage• More than 0.5 Exabyte archival storage• More than 1 PF/s sustained on scientific

applications

1010www.openfabrics.org

Page 11: Exascale Evolution  1 Brad Benton, IBM March 15, 2010.

Building Blue Waters

Multi-chip Module4 Power7 chips128 GB memory512 GB/s memory bandwidth1 TF (peak)

Router1,128 GB/s bandwidth

IH Server Node8 MCM’s (256 cores)1 TB memory8 TF (peak)

Fully water cooled

Blue Waters Building Block32 IH server nodes32 TB memory256 TF (peak)4 Storage systems10 Tape drive connections

Blue Waters~1 PF sustained>300,000 cores

>1 PB of memory>10 PB of disk storage

~500 PB of archival storage>100 Gbps connectivity

Blue Waters is built from components that can also be used to build systems with a wide range of capabilities—from deskside to beyond Blue Waters.

Blue Waters will be the most powerful computer in the world for scientific research when it comes on linein Summer of 2011.

CI Days • 22 February 2010 • University of Kentucky

Power7 Chip8 cores, 32 threadsL1, L2, L3 cache (32 MB)Up to 256 GF (peak)45 nm technology

Page 12: Exascale Evolution  1 Brad Benton, IBM March 15, 2010.

Power7 Chip: Computational Heart of Blue Waters

• Base Technology– 45 nm, 576 mm2– 1.2 B transistors

• Chip– 8 cores– 12 execution units/core– 1, 2, 4 way SMT/core– Up to 4 FMAs/cycle– Caches

• 32 KB I, D-cache, 256 KB L2/core

• 32 MB L3 (private/shared)

– Dual DDR3 memory controllers• 128 GB/s peak memory bandwidth (1/2 byte/flop)

– Clock range of 3.5 – 4 GHz

Quad-chip MCM

Power7 Chip

12www.openfabrics.org

Page 13: Exascale Evolution  1 Brad Benton, IBM March 15, 2010.

High-End Server Resilience

13

Page 14: Exascale Evolution  1 Brad Benton, IBM March 15, 2010.

Feeds and Speeds per MCM

• 32 cores• 8 Flop/cycle per core• 4 threads per core max• 3.5 – 4 GHz• 1 TF/s• 32 MB L3• 512 GB/s memory BW (0.5 Byte/flop)• 800 W (0.8 W/flop)

14

Page 15: Exascale Evolution  1 Brad Benton, IBM March 15, 2010.

First Level InterconnectL-LocalHUB to HUB Copper Wiring256 Cores

DCA-0 Connector (Top DCA)DCA-1 Connector (Bottom DCA)

1st Level Local Interconnect (256 cores)

HUB7

HUB6

HUB4

HUB3

HUB5

HUB1

HUB0

HUB2

PCIe

9

PCIe

10

PCIe

11

PCIe

12

PCIe

13

PCIe

14

PCIe

15

PCIe

16

PCIe

17P1-C

17-C1

PCIe

1

PCIe

2

PCIe

3

PCIe

4

PCIe

5

PCIe

6

PCIe

7

PCIe

8

Opt

ical

Fan

-out

from

H

UB

Mod

ules

2,30

4 F

iber

'L-L

ink'

64/40 Optical'D-Link'

64/40 Optical'D-Link'

P7-0

P7-2

P7-3P7-1

QCM 0

U-P1-M1

P7-0

P7-2

P7-3P7-1

QCM 1

U-P1-M2

P7-0

P7-2

P7-3P7-1

QCM 2

U-P1-M3

P7-0

P7-2

P7-3P7-1

QCM 3

U-P1-M4

P7-0

P7-2

P7-3P7-1

QCM 4

U-P1-M5

P7-0

P7-2

P7-3P7-1

QCM 5

U-P1-M6

P7-0

P7-2

P7-3P7-1

QCM 6

U-P1-M7

P7-0

P7-2

P7-3P7-1

QCM 7

U-P1-M8

P1-C

16-C1

P1-C

15-C1

P1-C

14-C1

P1-C

13-C1

P1-C

12-C1

P1-C

11-C1

P1-C

10-C1

P1-C

9-C1

P1-C

8-C1

P1-C

7-C1

P1-C

6-C1

P1-C

5-C1

P1-C

4-C1

P1-C

3-C1

P1-C

2-C1

P1-C

1-C1

N0-DIMM15

N0-DIMM14

N0-DIMM13

N0-DIMM12

N0-DIMM11

N0-DIMM10

N0-DIMM09

N0-DIMM08

N0-DIMM07

N0-DIMM06

N0-DIMM05

N0-DIMM04

N0-DIMM03

N0-DIMM02

N0-DIMM01

N0-DIMM00

N1-DIMM15

N1-DIMM14

N1-DIMM13

N1-DIMM12

N1-DIMM11

N1-DIMM10

N1-DIMM09

N1-DIMM08

N1-DIMM07

N1-DIMM06

N1-DIMM05

N1-DIMM04

N1-DIMM03

N1-DIMM02

N1-DIMM01

N1-DIMM00

N2-DIMM15

N2-DIMM14

N2-DIMM13

N2-DIMM12

N2-DIMM11

N2-DIMM10

N2-DIMM09

N2-DIMM08

N2-DIMM07

N2-DIMM06

N2-DIMM05

N2-DIMM04

N2-DIMM03

N2-DIMM02

N2-DIMM01

N2-DIMM00

N3-DIMM15

N3-DIMM14

N3-DIMM13

N3-DIMM12

N3-DIMM11

N3-DIMM10

N3-DIMM09

N3-DIMM08

N3-DIMM07

N3-DIMM06

N3-DIMM05

N3-DIMM04

N3-DIMM03

N3-DIMM02

N3-DIMM01

N3-DIMM00

N4-DIMM15

N4-DIMM14

N4-DIMM13

N4-DIMM12

N4-DIMM11

N4-DIMM10

N4-DIMM09

N4-DIMM08

N4-DIMM07

N4-DIMM06

N4-DIMM05

N4-DIMM04

N4-DIMM03

N4-DIMM02

N4-DIMM01

N4-DIMM00

N5-DIMM15

N5-DIMM14

N5-DIMM13

N5-DIMM12

N5-DIMM11

N5-DIMM10

N5-DIMM09

N5-DIMM08

N5-DIMM07

N5-DIMM06

N5-DIMM05

N5-DIMM04

N5-DIMM03

N5-DIMM02

N5-DIMM01

N5-DIMM00

N6-DIMM15

N6-DIMM14

N6-DIMM13

N6-DIMM12

N6-DIMM11

N6-DIMM10

N6-DIMM09

N6-DIMM08

N6-DIMM07

N6-DIMM06

N6-DIMM05

N6-DIMM04

N6-DIMM03

N6-DIMM02

N6-DIMM01

N6-DIMM00

N7-DIMM15

N7-DIMM14

N7-DIMM13

N7-DIMM12

N7-DIMM11

N7-DIMM10

N7-DIMM09

N7-DIMM08

N7-DIMM07

N7-DIMM06

N7-DIMM05

N7-DIMM04

N7-DIMM03

N7-DIMM02

N7-DIMM01

N7-DIMM00

ONE DRAWER8 MCMs, 32 chips, 256 cores

www.openfabrics.org 15

Page 16: Exascale Evolution  1 Brad Benton, IBM March 15, 2010.

Interconnect: 1.1 TB/s HUB

• 192 GB/s Host Connection• 336 GB/s to 7 other local nodes in

the same drawer• 240 GB/s to local-remote nodes in

the same supernode (4 drawers)• 320 GB/s to remote nodes• 40 GB/s to general purpose I/O

www.openfabrics.org 16

Page 17: Exascale Evolution  1 Brad Benton, IBM March 15, 2010.

www.openfabrics.org 17

Page 18: Exascale Evolution  1 Brad Benton, IBM March 15, 2010.

Second Level InterconnectOptical ‘L-Remote’ Links from HUBConstruct Super Node (4 CECs)1,024 CoresSuper Node

L-Li

nk C

able

s

Super Node(32 Nodes / 4 CEC)

ONE SUPERNODE4 drawers, 32 MCMs, 128 chips, 1024 cores

www.openfabrics.org 18

Page 19: Exascale Evolution  1 Brad Benton, IBM March 15, 2010.

BPA200 to 480Vac 370 to 575VdcRedundant PowerDirect Site Power FeedPDU Elimination

WCUFacility Water Input100% Heat to WaterRedundant CoolingCRAH Eliminated

Storage Unit4U0-6 / RackUp To 384 SFF DASD / UnitFile System

CECs2U1-12 CECs/Rack256 Cores128 SN DIMM Slots / CEC8,16, (32) GB DIMMs17 PCI-e SlotsImbedded SwitchRedundant DCANW FabricUp to:3072 cores, 24.6TB (49.2TB)

Rack990.6w x 1828.8d x 2108.239”w x 72”d x 83”h~2948kg (~6500lbs)

Rack ComponentsComputeStorageSwitch100% CoolingPDU Eliminated

Input: 8 Water Lines, 4 Power Cords

Out: ~100TFLOPs / 24.6TB / 153.5TB 192 PCI-e 16x / 12 PCI-e 8x

19www.openfabrics.org

Page 20: Exascale Evolution  1 Brad Benton, IBM March 15, 2010.

How does this affect OFA?

• Blue Waters can connect externally via PCIe devices (e.g., InfiniBand) as needed

• Blue Waters interconnect– Is RDMA based– Is not InfiniBand (or iWARP or RoCEE)– Hardware support for Global Shared Memory

• Pendulum is swinging back to proprietary interconnects (at least at IBM)

• Is there a path to OFA compatibility?– how can/should OFA accept/support new/different RDMA

interconnects?– how can/should IBM work w/OFA for embracing new interconnect

technologies?

www.openfabrics.org 20

Page 21: Exascale Evolution  1 Brad Benton, IBM March 15, 2010.

Exascale Evolution

• Technical Evolution is not always in a straight line

• Different technologies evolve at different times and rates

• e.g., Blue Waters is not a direct descendent of RoadRunner/Cell, but rather of POWER/Federation/SP

• To reach exascale levels will require the consolidation and continued evolution of multiple technologies

www.openfabrics.org 21