Dally SC10

8/8/2019 Dally SC10

1/55

8/8/2019 Dally SC10

2/55

To ExaScale and Beyond

2The GPU is the Computer

3

The GPU Advantage

1

GPU Computing

8/8/2019 Dally SC10

3/55

The GPU Advantag

8/8/2019 Dally SC10

4/55

A Tale of Two Machines

The GPU Advantag

8/8/2019 Dally SC10

5/55

Tianhe-1Aat NSC Tianjin

8/8/2019 Dally SC10

6/55

Tianhe-1Aat NSC Tianjin

The Worlds Fastest Supercomputer

2.507 Petaflop

7168 Tesla M2050 GPUs

8/8/2019 Dally SC10

7/55

Tesla M2050 GPUs

8/8/2019 Dally SC10

8/55

3 of Top5 Supercomputers

0

500

1000

1500

2000

2500

Tianhe-1A Jaguar Nebulae Tsubame Hoppe

Gigaflops

8/8/2019 Dally SC10

9/55

0

500

1000

1500

2000

2500

Tianhe-1A Jaguar Nebulae Tsubame Hoppe

Gigaflops

Top 5 Performance and Power

8/8/2019 Dally SC10

10/55

NVIDIA/NCSAGreen 500 Entry

8/8/2019 Dally SC10

11/55

NVIDIA/NCSAGreen 500 Entry

8/8/2019 Dally SC10

12/55

NVIDIA/NCSA Green 500 Entry

128 nodes, each with:1x Core i3 530 (2 cores, 2.93 GHz => 23.4 GFLOP peak)

1x Tesla C2050 (14 cores, 1.15 GHz => 515.2 GFLOP peak)

4x QDR Infiniband

4 GB DRAM

Theoretical Peak Perf: 68.95 TF

Footprint: ~20 ft^2 => 3.45 TF/ft^2Cost: $500K (street price) => 137.9 MF/$

Linpack: 33.62 TF, 36.0 kW => 934 MF/W

8/8/2019 Dally SC10

13/55

Efficiency and Programmability

The GPU Advantag

8/8/2019 Dally SC10

14/55

GPU200pJ/Instruction

CPU2nJ/Instructio

8/8/2019 Dally SC10

15/55

8/8/2019 Dally SC10

16/55

CUDA GPU Roadmap

16

2

4

6

8

10

12

14

DPG

FLOPSperWatt

2007 2009 2011

TeslaFermi

Kepler

8/8/2019 Dally SC10

17/55

Efficiency and Programmability

The GPU Advantag

8/8/2019 Dally SC10

18/55

CUDA Enables Programmability

The GPU Advantag

8/8/2019 Dally SC10

19/55

CUDA C: C with a Few Keywords

void saxpy_serial(int n, float a, float *x, float *y){ for (int i = 0; i < n; ++i)y[i] = a*x[i] + y[i];}

// Invoke serial SAXPY kernelsaxpy_serial(n, 2.0, x, y);__global__ void saxpy_parallel(int n, float a, float *x, float *{

int i = blockIdx.x*blockDim.x + threadIdx.x;if (i < n) y[i] = a*x[i] + y[i];}// Invoke parallel SAXPY kernel with 256 threads/blockint nblocks = (n + 255) / 256;saxpy_parallel(n, 2.0, x, y);

8/8/2019 Dally SC10

20/55

DirectX

GPU Computing Ecosystem

Languages & APIs

ToIntegratedDevelopment Environment

Parallel Nsight for MS Visual Studio

MathematicalPackages

Cons&

Research & Education

All Major Platforms

Libraries

Fortran

8/8/2019 Dally SC10

21/55

GPU Computing ToBy the Numbers:

CUDA Capable GPUs200 Million

CUDA Toolkit Downloads600,000

Active GPU Computing Developers100,000

Members in Parallel Nsight Developer 8,000

Universities Teaching CUDA Worldwid362

CUDA Centers of Excellence Worldwid11

8/8/2019 Dally SC10

22/55

To ExaScale and Beyo

8/8/2019 Dally SC10

23/55

Science Needs 1000x More Computing

1,000,000,000

1,000,000

1,000

1

Gigaflops

1982 1997 2003 2006 2010

Estrogen Receptor36K atoms

F1-ATPase327K atoms

Ribosome2.7M atoms

Chromatophore50M atoms

BPTI3K atoms

1 Exaflop

1 Petaflop

Ran for 8 months tosimulate 2 nanoseco

8/8/2019 Dally SC10

24/55

DARPA Study Identifies Four Challenges fExaScale Computing

Report published September

Four Major ChallengesEnergy and Power challenge

Memory and Storage challenge

Concurrency and Locality challeng

Resiliency challenge

Number one issue is power

Extrapolations of current architectutechnology indicate over 100MW fo

Power also constrains what we can

Available atwww.darpa.mil/ipto/personnel/docs/ExaScale_Study_Initial.pdf

8/8/2019 Dally SC10

25/55

Power is THE Probl

8/8/2019 Dally SC10

26/55

A GPU is the Solution

Power is THE Probl

O S G O S

8/8/2019 Dally SC10

27/55

ExaFLOPS at 20MW = 50GFLOPS/W

0.1

1

10

100

2010 2013

GF

LOPS/W(

Core)

GPU5GFLOPS/W

50GFLOPS/W

8/8/2019 Dally SC10

28/55

50GFLOPS/W10x Energy Gap for Todays GPU

0.1

1

10

100

2010 2013

GF

LOPS/W(

Core)

GPU5GFLOPS/W

10x

ExaFLOPS Gap

GPU Cl th G ith

8/8/2019 Dally SC10

29/55

0.1

1

10

100

2010 2013

GF

LOPS/W(

Core)

GPUs Close the Gap withProcess and Architecture

GPU Cl th G

8/8/2019 Dally SC10

30/55

0.1

1

10

100

2010 2013

GFLOPS/W

GPUs Close the Gapwith Process and Architecture

GPU Cl th G

8/8/2019 Dally SC10

31/55

0.1

1

10

100

2010 2013

GFLOPS/W

GPUs Close the GapWith CPUs, a Gap Remains

100

8/8/2019 Dally SC10

32/55

GPUs Close the GapWith CPUs, a Gap Remains

Heterogeneous Compis Required to get to

0.1

1

10

100

2010 2013

GFLOPS

/W

8/8/2019 Dally SC10

33/55

NVIDIAs Extreme-Scale Computing Pro

Echelon

8/8/2019 Dally SC10

34/55

Echelon Team

S Sk h

8/8/2019 Dally SC10

35/55

System Sketch

LoC

Echelon System

Cabinet 0 (C0) 2.6PF, 205TB/s, 32TB

Module 0 (M)) 160TF, 12.8TB/s, 2TB M15

Node 0 (N0) 20TF, 1.6TB/s, 256GB

Processor Chip (PC)

L0

C0

SM0

L0

C7

NoC

SM127

MC NICL20 L21023

DRAMCube

DRAMCube

NVRAM

High-Radix Router Module (RM)

CN

Dragonfly Interconnect (optical fiber)

N7

LC0

LC7

E ti M d l

8/8/2019 Dally SC10

36/55

Execution Model

A B

Active Message

Abstract MemoryHierarchy

Global Address Space

ThreadObject

B

Load/Store

A

B

The High Cost of Data Movement

8/8/2019 Dally SC10

37/55

The High Cost of Data MovementFetching operands costs more than computing on th

20mm

64-bit DP20pJ 26 pJ 256 pJ

1 nJ

500 pJ Efficieoff-chi

28nm

256-bitbuses

16 nJDRAMRd/Wr

256-bit access8 kB SRAM

50 pJ

8/8/2019 Dally SC10

38/55

An NVIDIA ExaScale Mac

Lane 4 DFMAs 20GFLOPS

8/8/2019 Dally SC10

39/55

Lane 4 DFMAs, 20GFLOPS

DFMA DFMA DFMA DFMA

MainRegisters

LSI L

Operand Registers

L0 I$

L0 D$

SM 8 lanes 160GFLOPS

8/8/2019 Dally SC10

40/55

SM 8 lanes 160GFLOPS

P P P P P P P

Switch

L1$

8/8/2019 Dally SC10

41/55

Node MCM 20TF 256GB

8/8/2019 Dally SC10

42/55

Node MCM 20TF + 256GB

GPU Chip20TF DP256MB

1.4TB/sDRAM BW

150

Netw

DRAMStack

DRAMStack

DRAMStack

NVMemory

Cabinet 128 Nodes 2 56PF 3

8/8/2019 Dally SC10

43/55

32 Modules, 4 Nodes/Module,Central Router Module(s), Dragonfly Interconnec

NODE

NODE

NODE

NODE

MODULE

NODE

NODE

NODE

NODE

MODULE

ROUTER

ROUTER

ROUTER

ROUTER

MODULE

NODE

NODE

NODE

NODE

MODULE

Cabinet 128 Nodes 2.56PF 3

System to ExaScale and Beyo

8/8/2019 Dally SC10

44/55

Dragonfly Interconnect400 Cabinets is ~1EF and ~15MW

System to ExaScale and Beyo

8/8/2019 Dally SC10

45/55

8/8/2019 Dally SC10

46/55

GPU Computing Enables ExaScaleAt Reasonable Power2

The GPU is the ComputerA general purpose computing engine, not just an a3

GPU Computing is #1 TodayOn Top 500 AND Dominant on Green 5001

GPU Computing is the Fut

The Real Challenge is Software4

8/8/2019 Dally SC10

47/55

8/8/2019 Dally SC10

48/55

8/8/2019 Dally SC10

49/55

Optimize the Storage Hierarchy

2Tailor Memory to the Application3

Data Movement Dominates Power1

Power is THE Problem

Some Applications Have Hierarchical Re-U

8/8/2019 Dally SC10

50/55

pp

0

20

40

60

80

100

120

1.0E+0 1.0E+3 1.0E+6 1.0E+9

%Miss

Size

DGEMM

Applications with Hierarchical

8/8/2019 Dally SC10

51/55

ppReuse Want a Deep Storage Hierarchy

P P P P P P P P P P P P P P

L2 L2 L2 L2

L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1

L3

Some Applications Have

8/8/2019 Dally SC10

52/55

Plateaus in Their Working Sets

0

20

40

60

80

100

120

1.0E+0 1.0E+3 1.0E+6 1.0E+9

%Miss

Size

Table

Applications with Plateaus

8/8/2019 Dally SC10

53/55

Want a Shallow Storage Hierarchy

P P P P P P P P P P P P P P

NoC

L2 L2 L2 L

L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1

Configurable Memory Can Do Both

8/8/2019 Dally SC10

54/55

At the Same Time

Flat hierarchy for large working sets

Deep hierarchy for reuse

Shared memory for explicit management

Cache memory for unpredictable sharing

P

L1

SRAM SRAM SRAM SR

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

P

L1

NoC

Configurable Memory

8/8/2019 Dally SC10

55/55

g yReduces Distance and Energy

P L1

SRAM

P L1

SRAM

P L1 P L1

P L1

SRAM

P L1

SRAM

P L1 P L1

P L1

SRAM

P L1

SRAM

P L1 P L1

P L1

SRAM

P L1

SRAM

P L1 P L1

P L1

SRAM

P L1

SRAM

P L1 P L1

P L1

SRAM

P L1

SRAM

P L1 P L1

P L1

SRAM

P L1

SRAM

P L1 P L1

P L1

SRAM

P L1

SRAM

P L1 P L1

P L1

SRAM

P L1

SRAM

P L1 P L1

P L1

SRAM

P L1

SRAM

P L1 P L1

P L1

SRAM

P L1

SRAM

P L1 P L1

P L1

SRAM

P L1

SRAM

P L1 P L1

P L1

SRAM

P L1

SRAM

P L1 P L1

P L1

SRAM

P L1

SRAM

P L1 P L1

P L1

SRAM

P L1

SRAM

P L1 P L1

P L1

SRAM

P L1

SRAM

P L1 P L1

ROUTER

ROUTER

ROUTER

ROUTER

Dally SC10

Documents

Transcript of Dally SC10