Dally SC10
Transcript of Dally SC10
-
8/8/2019 Dally SC10
1/55
-
8/8/2019 Dally SC10
2/55
To ExaScale and Beyond
2The GPU is the Computer
3
The GPU Advantage
1
GPU Computing
-
8/8/2019 Dally SC10
3/55
The GPU Advantag
-
8/8/2019 Dally SC10
4/55
A Tale of Two Machines
The GPU Advantag
-
8/8/2019 Dally SC10
5/55
Tianhe-1Aat NSC Tianjin
-
8/8/2019 Dally SC10
6/55
Tianhe-1Aat NSC Tianjin
The Worlds Fastest Supercomputer
2.507 Petaflop
7168 Tesla M2050 GPUs
-
8/8/2019 Dally SC10
7/55
Tesla M2050 GPUs
-
8/8/2019 Dally SC10
8/55
3 of Top5 Supercomputers
0
500
1000
1500
2000
2500
Tianhe-1A Jaguar Nebulae Tsubame Hoppe
Gigaflops
-
8/8/2019 Dally SC10
9/55
0
500
1000
1500
2000
2500
Tianhe-1A Jaguar Nebulae Tsubame Hoppe
Gigaflops
Top 5 Performance and Power
-
8/8/2019 Dally SC10
10/55
NVIDIA/NCSAGreen 500 Entry
-
8/8/2019 Dally SC10
11/55
NVIDIA/NCSAGreen 500 Entry
-
8/8/2019 Dally SC10
12/55
NVIDIA/NCSA Green 500 Entry
128 nodes, each with:1x Core i3 530 (2 cores, 2.93 GHz => 23.4 GFLOP peak)
1x Tesla C2050 (14 cores, 1.15 GHz => 515.2 GFLOP peak)
4x QDR Infiniband
4 GB DRAM
Theoretical Peak Perf: 68.95 TF
Footprint: ~20 ft^2 => 3.45 TF/ft^2Cost: $500K (street price) => 137.9 MF/$
Linpack: 33.62 TF, 36.0 kW => 934 MF/W
-
8/8/2019 Dally SC10
13/55
Efficiency and Programmability
The GPU Advantag
-
8/8/2019 Dally SC10
14/55
GPU200pJ/Instruction
CPU2nJ/Instructio
-
8/8/2019 Dally SC10
15/55
-
8/8/2019 Dally SC10
16/55
CUDA GPU Roadmap
16
2
4
6
8
10
12
14
DPG
FLOPSperWatt
2007 2009 2011
TeslaFermi
Kepler
-
8/8/2019 Dally SC10
17/55
Efficiency and Programmability
The GPU Advantag
-
8/8/2019 Dally SC10
18/55
CUDA Enables Programmability
The GPU Advantag
-
8/8/2019 Dally SC10
19/55
CUDA C: C with a Few Keywords
void saxpy_serial(int n, float a, float *x, float *y){ for (int i = 0; i < n; ++i)y[i] = a*x[i] + y[i];}
// Invoke serial SAXPY kernelsaxpy_serial(n, 2.0, x, y);__global__ void saxpy_parallel(int n, float a, float *x, float *{
int i = blockIdx.x*blockDim.x + threadIdx.x;if (i < n) y[i] = a*x[i] + y[i];}// Invoke parallel SAXPY kernel with 256 threads/blockint nblocks = (n + 255) / 256;saxpy_parallel(n, 2.0, x, y);
-
8/8/2019 Dally SC10
20/55
DirectX
GPU Computing Ecosystem
Languages & APIs
ToIntegratedDevelopment Environment
Parallel Nsight for MS Visual Studio
MathematicalPackages
Cons&
Research & Education
All Major Platforms
Libraries
Fortran
-
8/8/2019 Dally SC10
21/55
GPU Computing ToBy the Numbers:
CUDA Capable GPUs200 Million
CUDA Toolkit Downloads600,000
Active GPU Computing Developers100,000
Members in Parallel Nsight Developer 8,000
Universities Teaching CUDA Worldwid362
CUDA Centers of Excellence Worldwid11
-
8/8/2019 Dally SC10
22/55
To ExaScale and Beyo
-
8/8/2019 Dally SC10
23/55
Science Needs 1000x More Computing
1,000,000,000
1,000,000
1,000
1
Gigaflops
1982 1997 2003 2006 2010
Estrogen Receptor36K atoms
F1-ATPase327K atoms
Ribosome2.7M atoms
Chromatophore50M atoms
BPTI3K atoms
1 Exaflop
1 Petaflop
Ran for 8 months tosimulate 2 nanoseco
-
8/8/2019 Dally SC10
24/55
DARPA Study Identifies Four Challenges fExaScale Computing
Report published September
Four Major ChallengesEnergy and Power challenge
Memory and Storage challenge
Concurrency and Locality challeng
Resiliency challenge
Number one issue is power
Extrapolations of current architectutechnology indicate over 100MW fo
Power also constrains what we can
Available atwww.darpa.mil/ipto/personnel/docs/ExaScale_Study_Initial.pdf
-
8/8/2019 Dally SC10
25/55
Power is THE Probl
-
8/8/2019 Dally SC10
26/55
A GPU is the Solution
Power is THE Probl
O S G O S
-
8/8/2019 Dally SC10
27/55
ExaFLOPS at 20MW = 50GFLOPS/W
0.1
1
10
100
2010 2013
GF
LOPS/W(
Core)
GPU5GFLOPS/W
50GFLOPS/W
-
8/8/2019 Dally SC10
28/55
50GFLOPS/W10x Energy Gap for Todays GPU
0.1
1
10
100
2010 2013
GF
LOPS/W(
Core)
GPU5GFLOPS/W
10x
ExaFLOPS Gap
GPU Cl th G ith
-
8/8/2019 Dally SC10
29/55
0.1
1
10
100
2010 2013
GF
LOPS/W(
Core)
GPUs Close the Gap withProcess and Architecture
GPU Cl th G
-
8/8/2019 Dally SC10
30/55
0.1
1
10
100
2010 2013
GFLOPS/W
GPUs Close the Gapwith Process and Architecture
GPU Cl th G
-
8/8/2019 Dally SC10
31/55
0.1
1
10
100
2010 2013
GFLOPS/W
GPUs Close the GapWith CPUs, a Gap Remains
100
-
8/8/2019 Dally SC10
32/55
GPUs Close the GapWith CPUs, a Gap Remains
Heterogeneous Compis Required to get to
0.1
1
10
100
2010 2013
GFLOPS
/W
-
8/8/2019 Dally SC10
33/55
NVIDIAs Extreme-Scale Computing Pro
Echelon
-
8/8/2019 Dally SC10
34/55
Echelon Team
S Sk h
-
8/8/2019 Dally SC10
35/55
System Sketch
LoC
Echelon System
Cabinet 0 (C0) 2.6PF, 205TB/s, 32TB
Module 0 (M)) 160TF, 12.8TB/s, 2TB M15
Node 0 (N0) 20TF, 1.6TB/s, 256GB
Processor Chip (PC)
L0
C0
SM0
L0
C7
NoC
SM127
MC NICL20 L21023
DRAMCube
DRAMCube
NVRAM
High-Radix Router Module (RM)
CN
Dragonfly Interconnect (optical fiber)
N7
LC0
LC7
E ti M d l
-
8/8/2019 Dally SC10
36/55
Execution Model
A B
Active Message
Abstract MemoryHierarchy
Global Address Space
ThreadObject
B
Load/Store
A
B
The High Cost of Data Movement
-
8/8/2019 Dally SC10
37/55
The High Cost of Data MovementFetching operands costs more than computing on th
20mm
64-bit DP20pJ 26 pJ 256 pJ
1 nJ
500 pJ Efficieoff-chi
28nm
256-bitbuses
16 nJDRAMRd/Wr
256-bit access8 kB SRAM
50 pJ
-
8/8/2019 Dally SC10
38/55
An NVIDIA ExaScale Mac
Lane 4 DFMAs 20GFLOPS
-
8/8/2019 Dally SC10
39/55
Lane 4 DFMAs, 20GFLOPS
DFMA DFMA DFMA DFMA
MainRegisters
LSI L
Operand Registers
L0 I$
L0 D$
SM 8 lanes 160GFLOPS
-
8/8/2019 Dally SC10
40/55
SM 8 lanes 160GFLOPS
P P P P P P P
Switch
L1$
-
8/8/2019 Dally SC10
41/55
Node MCM 20TF 256GB
-
8/8/2019 Dally SC10
42/55
Node MCM 20TF + 256GB
GPU Chip20TF DP256MB
1.4TB/sDRAM BW
150
Netw
DRAMStack
DRAMStack
DRAMStack
NVMemory
Cabinet 128 Nodes 2 56PF 3
-
8/8/2019 Dally SC10
43/55
32 Modules, 4 Nodes/Module,Central Router Module(s), Dragonfly Interconnec
NODE
NODE
NODE
NODE
MODULE
NODE
NODE
NODE
NODE
MODULE
ROUTER
ROUTER
ROUTER
ROUTER
MODULE
NODE
NODE
NODE
NODE
MODULE
Cabinet 128 Nodes 2.56PF 3
System to ExaScale and Beyo
-
8/8/2019 Dally SC10
44/55
Dragonfly Interconnect400 Cabinets is ~1EF and ~15MW
System to ExaScale and Beyo
-
8/8/2019 Dally SC10
45/55
-
8/8/2019 Dally SC10
46/55
GPU Computing Enables ExaScaleAt Reasonable Power2
The GPU is the ComputerA general purpose computing engine, not just an a3
GPU Computing is #1 TodayOn Top 500 AND Dominant on Green 5001
GPU Computing is the Fut
The Real Challenge is Software4
-
8/8/2019 Dally SC10
47/55
-
8/8/2019 Dally SC10
48/55
-
8/8/2019 Dally SC10
49/55
Optimize the Storage Hierarchy
2Tailor Memory to the Application3
Data Movement Dominates Power1
Power is THE Problem
Some Applications Have Hierarchical Re-U
-
8/8/2019 Dally SC10
50/55
pp
0
20
40
60
80
100
120
1.0E+0 1.0E+3 1.0E+6 1.0E+9
%Miss
Size
DGEMM
Applications with Hierarchical
-
8/8/2019 Dally SC10
51/55
ppReuse Want a Deep Storage Hierarchy
P P P P P P P P P P P P P P
L2 L2 L2 L2
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
L3
Some Applications Have
-
8/8/2019 Dally SC10
52/55
Plateaus in Their Working Sets
0
20
40
60
80
100
120
1.0E+0 1.0E+3 1.0E+6 1.0E+9
%Miss
Size
Table
Applications with Plateaus
-
8/8/2019 Dally SC10
53/55
Want a Shallow Storage Hierarchy
P P P P P P P P P P P P P P
NoC
L2 L2 L2 L
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
Configurable Memory Can Do Both
-
8/8/2019 Dally SC10
54/55
At the Same Time
Flat hierarchy for large working sets
Deep hierarchy for reuse
Shared memory for explicit management
Cache memory for unpredictable sharing
P
L1
SRAM SRAM SRAM SR
P
L1
P
L1
P
L1
P
L1
P
L1
P
L1
P
L1
P
L1
P
L1
P
L1
P
L1
P
L1
P
L1
NoC
Configurable Memory
-
8/8/2019 Dally SC10
55/55
g yReduces Distance and Energy
P L1
SRAM
P L1
SRAM
P L1 P L1
P L1
SRAM
P L1
SRAM
P L1 P L1
P L1
SRAM
P L1
SRAM
P L1 P L1
P L1
SRAM
P L1
SRAM
P L1 P L1
P L1
SRAM
P L1
SRAM
P L1 P L1
P L1
SRAM
P L1
SRAM
P L1 P L1
P L1
SRAM
P L1
SRAM
P L1 P L1
P L1
SRAM
P L1
SRAM
P L1 P L1
P L1
SRAM
P L1
SRAM
P L1 P L1
P L1
SRAM
P L1
SRAM
P L1 P L1
P L1
SRAM
P L1
SRAM
P L1 P L1
P L1
SRAM
P L1
SRAM
P L1 P L1
P L1
SRAM
P L1
SRAM
P L1 P L1
P L1
SRAM
P L1
SRAM
P L1 P L1
P L1
SRAM
P L1
SRAM
P L1 P L1
P L1
SRAM
P L1
SRAM
P L1 P L1
ROUTER
ROUTER
ROUTER
ROUTER