Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd...
Transcript of Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd...
![Page 1: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/1.jpg)
Bill Dally | Chief Scientist and SVP, Research NVIDIA | Professor (Research), EE&CS, Stanford
Efficiency and Programmability: Enablers for ExaScale
![Page 2: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/2.jpg)
Scientific Discovery and Business Analytics
Driving an Insatiable Demand for More Computing Performance
![Page 3: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/3.jpg)
Compute Communication Memory &
Storage
HPC Analytics
![Page 4: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/4.jpg)
Source: C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011
The End of Historic Scaling
![Page 5: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/5.jpg)
“Moore’s Law gives us more transistors…
Dennard scaling made them useful.”
Bob Colwell, DAC 2013, June 4, 2013
![Page 6: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/6.jpg)
18,688 NVIDIA Tesla K20X GPUs
27 Petaflops Peak: 90% of Performance from GPUs
17.59 Petaflops Sustained Performance on Linpack
Numerous real science applications
2.14GF/W – Most efficient accelerator
TITAN
![Page 7: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/7.jpg)
18,688 NVIDIA Tesla K20X GPUs
27 Petaflops Peak: 90% of Performance from GPUs
17.59 Petaflops Sustained Performance on Linpack
Numerous real science applications
2.14GF/W – Most efficient accelerator
TITAN
![Page 8: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/8.jpg)
20PF 18,000 GPUs
10MW 2 GFLOPs/W ~10
7 Threads
You Are Here
1,000PF (50x) 72,000HCNs (4x)
20MW (2x) 50 GFLOPs/W (25x)
~1010 Threads (1000x)
2013
2020
![Page 9: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/9.jpg)
0
5
10
15
20
25
30
35
40
45
50
2013 2014 2015 2016 2017 2018 2019 2020
GFLO
PS/W
Needed
Process
EFFICIENCY GAP
![Page 10: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/10.jpg)
![Page 11: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/11.jpg)
CIRCUITS 3X
PROCESS 2.2X
1
10
2013 2014 2015 2016 2017 2018 2019 2020
GFLO
PS/W
Needed
Process
![Page 12: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/12.jpg)
Simpler Cores = Energy Efficiency
Source: Azizi [PhD 2010]
![Page 13: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/13.jpg)
CPU 1690 pJ/flop
Optimized for Latency
Caches
Westmere 32 nm
GPU 140 pJ/flop
Optimized for Throughput
Explicit Management of On-chip Memory
Kepler 28 nm
![Page 14: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/14.jpg)
1
10
2013 2014 2015 2016 2017 2018 2019 2020
GFLO
PS/W
Needed
Process
CIRCUITS 3X
PROCESS 2.2X
ARCHITECTURE 4X
![Page 15: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/15.jpg)
Programmers, Tools, and Architecture Need to Play Their Positions
Programmer
Architecture Tools
![Page 16: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/16.jpg)
Programmers, Tools, and Architecture Need to Play Their Positions
Algorithm
All of the parallelism
Abstract locality
Fast mechanisms
Exposed costs
Combinatorial optimization
Mapping
Selection of mechanisms
Programmer
Architecture Tools
![Page 17: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/17.jpg)
DRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O
LO
C
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LO
C
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LO
C
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LO
C
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LO
C
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LO
C
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LO
C
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
LO
C
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
NOC
SM
SM
SM
SM
DRAM I/O DRAM I/O DRAM I/O DRAM I/ONW I/O
DR
AM
I/OD
RA
M I/O
DR
AM
I/OD
RA
M I/O
NW
I/O
DR
AM
I/OD
RA
M I/O
DR
AM
I/OD
RA
M I/O
NW
I/O
L2
Banks
XBAR
NOC
SM
La
ne
La
ne
La
ne
La
ne
La
ne
La
ne
La
ne
La
ne
SM
SM
![Page 18: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/18.jpg)
• <1ms Latency
• Scalable bandwidth
• Small messages 50% @ 32B
• Global adaptive routing
• PGAS
• Collectives & Atomics
• MPI Offload
An Enabling HPC Network
![Page 19: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/19.jpg)
An Open HPC Network Ecosystem
Common:
• Software-NIC API
• NIC-Router Channel
Processor/NIC Vendors
System Vendors
Networking Vendors
![Page 20: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/20.jpg)
Programmer
Architecture Tools
Programming
Parallelism
Heterogeneity
Hierarchy
Power
25x Efficiency with 2.2x from process
![Page 21: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/21.jpg)
“Super” Computing From Super Computers to Super Phones
![Page 22: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/22.jpg)
Backup
![Page 23: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/23.jpg)
![Page 24: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/24.jpg)
Source: Moore, Electronics 38(8) April 19, 1965
In The Past, Demand Was Fueled
by Moore’s Law
![Page 25: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/25.jpg)
Source: Dally et al. “The Last Classical Computer”, ISAT Study, 2001
ILP Was Mined Out in 2001
0.0001
0.001
0.01
0.1
1
10
100
1000
10000
100000
1000000
10000000
1980 1990 2000 2010 2020
Perf (ps/Inst)
Linear (ps/Inst)
30:1
1,000:1
30,000:1
![Page 26: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/26.jpg)
Source: Moore, ISSCC Keynote, 2003
Voltage Scaling Ended in 2005
![Page 27: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/27.jpg)
Moore’s law is alive and well, but…
Instruction-level parallelism (ILP) was mined out in 2001
Voltage scaling (Dennard scaling) ended in 2005
Most power is spent on communication
What does this mean to you?
Summary
![Page 28: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/28.jpg)
Source: C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011
The End of Historic Scaling
![Page 29: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/29.jpg)
All performance is from parallelism
Machines are power limited (efficiency IS performance)
Machines are communication limited (locality IS performance)
In the Future
![Page 30: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/30.jpg)
Two Major Challenges
Programming
Parallel (1010 threads)
Hierarchical
Heterogeneous
Energy Efficiency
25x in 7 years (~2.2x from process)
![Page 31: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/31.jpg)
1
10
2013 2014 2015 2016 2017 2018 2019 2020
GFLO
PS/W
Needed
Process
![Page 32: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/32.jpg)
How is Power Spent in a CPU?
In-order Embedded OOO Hi-perf
Clock + Control Logic
24%
Data Supply
17%
Instruction Supply
42%
Register File
11%
ALU 6% Clock + Pins
45%
ALU
4%
Fetch
11%
Rename
10%
Issue
11%
RF
14%
Data Supply
5%
Dally [2008] (Embedded in-order CPU) Natarajan [2003] (Alpha 21264)
![Page 33: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/33.jpg)
Processor Technology 40 nm 10nm
Vdd (nominal) 0.9 V 0.7 V
DFMA energy 50 pJ 7.6 pJ
64b 8 KB SRAM Rd 14 pJ 2.1 pJ
Wire energy (256 bits, 10mm) 310 pJ 174 pJ
Memory Technology 45 nm 16nm
DRAM interface pin bandwidth 4 Gbps 50 Gbps
DRAM interface energy 20-30 pJ/bit 2 pJ/bit
DRAM access energy 8-15 pJ/bit 2.5 pJ/bit
Keckler [Micro 2011], Vogelsang [Micro 2010]
Energy Shopping List
FP Op lower bound
=
4 pJ
![Page 34: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/34.jpg)
![Page 35: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/35.jpg)
CIRCUITS 3X
PROCESS 2.2X
1
10
2013 2014 2015 2016 2017 2018 2019 2020
GFLO
PS/W
Needed
Process
![Page 36: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/36.jpg)
Throughput-Optimized Core (TOC)
Latency-Optimized Core
(LOC)
PC
PC
Branch Predict
I$
Register Rename
ALU 1 ALU 2 ALU 4 ALU 3
Reorder Buffer
Instruction Window
ALU 1 ALU 2 ALU 4 ALU 3
I$
Select
Register File
PCs
Register File
![Page 37: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/37.jpg)
SIMT Lanes
Streaming Multiprocessor (SM)
Warp Scheduler
Shared Memory 32 KB
15% of SM Energy Main Register File
32 banks
ALU SFU MEM TEX
![Page 38: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/38.jpg)
Hierarchical Register File
0%
20%
40%
60%
80%
100%
Perc
ent
of
All V
alu
es
Pro
duced
Read >2 Times
Read 2 Times
Read 1 Time
Read 0 Times
0%
20%
40%
60%
80%
100%
Perc
ent
of
All V
alu
es
Pro
duced
Lifetime >3
Lifetime 3
Lifetime 2
Lifetime 1
![Page 39: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/39.jpg)
Register File Caching (RFC)
S F U
M E M
T E X
Operand Routing
Operand Buffering
MRF 4x128-bit Banks (1R1W)
RFC 4x32-bit (3R1W) Banks
ALU
![Page 40: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/40.jpg)
Energy Savings from RF Hierarchy 54% Energy Reduction
Source: Gebhart, et. al (Micro 2011)
![Page 41: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/41.jpg)
Two Major Challenges
Programming
Parallel (1010 threads)
Hierarchical
Heterogeneous
Energy Efficiency
25x in 7 years (~2.2x from process)
![Page 42: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/42.jpg)
Skills on LinkedIn Size (approx) Growth (rel)
C++ 1,000,000 -8%
Javascript 1,000,000 -1%
Python 429,000 7%
Fortran 90,000 -11%
MPI 21,000 -3%
x86 Assembly 17,000 -8%
CUDA 14,000 9%
Parallel programming 13,000 3%
OpenMP 8,000 2%
TBB 389 10%
6502 Assembly 256 -13%
Source: linkedin.com/skills (as of Jun 11, 2013)
Mainstream
Programming
Parallel and
Assembly
Programming
![Page 43: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/43.jpg)
forall molecule in set: # 1E6 molecules
forall neighbor in molecule.neighbors: # 1E2 neighbors ea
forall force in forces: # several forces
# reduction
molecule.force += force(molecule, neighbor)
Parallel Programming is Easy
![Page 44: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/44.jpg)
We Can Make It Hard
pid = fork() ; // explicitly managing threads
lock(struct.lock) ; // complicated, error-prone synchronization
// manipulate struct
unlock(struct.lock) ;
code = send(pid, tag, &msg) ; // partition across nodes
![Page 45: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/45.jpg)
Programmers, Tools, and Architecture Need to Play Their Positions
Programmer
Architecture Tools
![Page 46: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/46.jpg)
Programmers, Tools, and Architecture Need to Play Their Positions
Algorithm
All of the parallelism
Abstract locality
Fast mechanisms
Exposed costs
Combinatorial optimization
Mapping
Selection of mechanisms
Programmer
Architecture Tools
![Page 47: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/47.jpg)
OpenACC: Easy and Portable
!$acc parallel loop do i = 1, 20*128 !dir$ unroll 1000 do j = 1, 5000000 fa(i) = a * fa(i) + fb(i) end do end do
do i = 1, 20*128 do j = 1, 5000000 fa(i) = a * fa(i) + fb(i) end do end do Serial Code: SAXPY
![Page 48: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/48.jpg)
Conclusion
![Page 49: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/49.jpg)
Source: C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011
The End of Historic Scaling
![Page 50: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/50.jpg)
Parallelism is the source of all performance
Power limits all computing
Communication dominates power
![Page 51: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/51.jpg)
Two Challenges
Programming
Parallelism
Heterogeneity
Hierarchy
Power
25x Efficiency with 2.2x from process
![Page 52: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/52.jpg)
Programmer
Architecture Tools
Programming
Parallelism
Heterogeneity
Hierarchy
Power
25x Efficiency with 2.2x from process
![Page 53: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/53.jpg)
“Super” Computing From Super Computers to Super Phones
![Page 54: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/54.jpg)
64-bit DP 20pJ 26 pJ 256 pJ
1 nJ
500 pJ Efficient off-chip link
256-bit buses
16 nJ DRAM Rd/Wr
256-bit access 8 kB SRAM 50 pJ
20mm
Communication Takes More Energy Than Arithmetic
![Page 55: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/55.jpg)
Key to Parallelism: Independent operations on independent
data
sum(map(multiply, x, x))
every pair-wise multiply is independent
parallelism is permitted
![Page 56: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/56.jpg)
Flat computation
Key to Locality: Data decomposition should drive mapping
total = sum(x)
vs.
tiles = split(x)
partials = map(sum, tiles)
total = sum(partials)
![Page 57: Efficiency and Programmability: Enablers for ExaScale · Processor Technology 40 nm 10nm Vdd (nominal) 0.9 V 0.7 V DFMA energy 50 pJ 7.6 pJ 64b 8 KB SRAM Rd 14 pJ 2.1 pJ Wire energy](https://reader033.fdocuments.us/reader033/viewer/2022051912/60034f5e2115fd396c6f1765/html5/thumbnails/57.jpg)
Explicit decomposition
Key to Locality: Data decomposition should drive mapping
total = sum(x)
vs.
tiles = split(x)
partials = map(sum, tiles)
total = sum(partials)