HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill...
Transcript of HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill...
![Page 1: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/1.jpg)
Darrell Boggs, CPU Architecture
Co-authors: Gary Brown, Bill Rozas,
Nathan Tuck, K S Venkatraman
HOT CHIPS 2014
NVIDIA’S DENVER PROCESSOR
![Page 2: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/2.jpg)
2
The First 64-bit Android Kepler-Class
Chip
with Dual Denver CPUs TEGRA K1
![Page 3: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/3.jpg)
3 3
TEGRA K1 192-core Kepler-Class Chip
One Chip — Two Versions
Pin Compatible
Quad A15 CPUs
32-bit
3-way Superscalar
Up to 2.3GHz
32K+32K L1$
Dual Denver CPUs
64-bit
7-way Superscalar
Up to 2.5GHz
128K+64K L1$
![Page 4: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/4.jpg)
4
DENVER VALUE PROPOSITION
Highest performance and very energy-efficient ARMv8 processor
Greater dynamic sharing with GPU
Extended battery life
Low latency power-state transition
Best web browsing experience
Designed to bring PC-class performance to the ARM ecosystem
Content creation
Gaming
Enterprise applications
![Page 5: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/5.jpg)
5 5
DENVER CPU Highest Perf ARMv8 CPU
7-wide superscalar
Aggressive HW prefetcher
Dynamic Code Optimization Optimize once, use many times
OOO execution without the power
Denver Core
IFUBPU
L1 Inst Cache
128 K – 4 Way
ARM HW Decoder
SchedulerL1 Data Cache
64 K – 4 Way
JSR IEU0 IEU1 FP0 FP1 LS0 LS1
HW
Prefetch
![Page 6: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/6.jpg)
6
Branch
Mul
Integer Integer
FP
NEON
FP
NEON
Load
Store
Integer
Load
Store
Integer
Scheduler
Decoder
(predecode)
8 wide
Scheduler
Decoder
3 wide
Branch Integer Integer Mul
FP
NEON Load Store
FP
NEON
TEGRA K1 SUPERSCALAR ARCHITECTURE
Branch: 1
Integer: 2
Multiply: 1
Floating Point/Neon: 2 x 64-bit
LD/ST: 1 LD and 1 ST
Peak IPC 3
Branch: 1
Integer: 2 (+ Mul) + 2
Floating Point/Neon: 2 x 128-bit
LD/ST: 2 LD and/or ST
Peak IPC 7+
Cortex-A15 Tegra K1-32 Denver Tegra K1-64
![Page 7: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/7.jpg)
7
Denver
RF wrJCC CMPBypass
I$ Rd Way Sel PickDec PB
EL4 EE5 ES6 EW7
Correct Target
ITLB RF wrBypass ALUBypassSch
EB1 EA2 ED3 EL4 EE5 ES6 EW7SB2
Misprediction
Signal
13 Cycle Penalty
IC2 IW3 IN4 IN5 SB1IP1
RF Rd
EB0
Pipeline Microarchitecture – Mispredict Penalty
Tegra K1-32
15 cycle mispredict
Tegra K1-64
13 cycle mispredict
Higher ILP and efficiency Lower is better
![Page 8: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/8.jpg)
8
CORE CLUSTER RETENTION STATE
New power management state: CC4
Allows cache and architectural state retention
Allows voltage to be reduced below Vmin to a retention voltage
Fast entry and exit latencies enable wider range of use
![Page 9: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/9.jpg)
9
DENVER IDLE POWER IMPROVES WITH RETENTION
Overhead from energy
required to flush L2 Leakage
Efficiency
crossover
Power penalty if
entered for short
durations
![Page 10: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/10.jpg)
10
DYNAMIC CODE OPTIMIZATION OPTIMIZE ONCE, USE MANY TIMES
Instructions
Denver Hardware
Hardware
Decoder
Execution
Units
Optimizer
Optimization
m-interrupt
![Page 11: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/11.jpg)
11
DYNAMIC CODE OPTIMIZATION OPTIMIZE ONCE, USE MANY TIMES
Instructions
Denver Hardware
Hardware
Decoder
Execution
Units
Optimizer
Optimization Cache
Optimized
mcode Dynamic
Profile
Information
Unrolls Loops Renames registers Reorders Loads and Stores Improves control flow Removes unused computation Hoists redundant computation Sinks uncommonly executed computation Improves scheduling
![Page 12: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/12.jpg)
12
DYNAMIC CODE OPTIMIZATION OPTIMIZE ONCE, USE MANY TIMES
Instructions
Denver Hardware
Hardware
Decoder
Execution
Units Optimization Cache
Optimized
mcode Optimization
Lookup Optimized code
execution from
optimization cache
![Page 13: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/13.jpg)
13
DYNAMIC CODE OPTIMIZATION OPTIMIZE ONCE, USE MANY TIMES
Instructions
Denver Hardware
Hardware
Decoder
Execution
Units
Optimization
Lookup
Optimizer
Optimization Cache
Optimized
mcode
Optimized
mcode
Optimized
mcode
Optimized
mcode
Dynamic
Profile
Information
Chaining
![Page 14: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/14.jpg)
14
DENVER PERFORMANCE
0%
50%
100%
150%
200%
250%
300%
DMIPS SPECInt 2K SPECFP 2K AnTuTu 4 Geekbench 3Single-Core
GoogleOctane v2.0
16MBMemcpy(GB/s)
16MBMemset(GB/s)
16MBMemread
(GB/s)
Perf
orm
ance R
ela
tive t
o T
egra
K1 3
2
Baytrail (Celeron N2910)
Krait-400 (8974-AA)
iPhone 5s (A7 Cyclone)
Haswell (Celeron 2955U)
Tegra K1 64
![Page 15: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/15.jpg)
15
DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run
%
Exec
Types
Profile of execution
Optimized ucode execution HW Decoder
execution
Dynamic Code Optimizer
![Page 16: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/16.jpg)
16
DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run
%
Exec
Types
New
ARM
Code
Profile of new ARM instructions
![Page 17: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/17.jpg)
17
DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run
%
Exec
Types
New
ARM
Code
Profile of Instructions Per Cycle
IPC
![Page 18: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/18.jpg)
18
DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run
%
Exec
Types
New
ARM
Code
IPC
![Page 19: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/19.jpg)
19
DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run
%
Exec
Types
New
ARM
Code
IPC
![Page 20: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/20.jpg)
20
DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run
%
Exec
Types
New
ARM
Code
IPC
![Page 21: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/21.jpg)
21
DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run
%
Exec
Types
New
ARM
Code
IPC
![Page 22: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/22.jpg)
22
DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION Full benchmark run
%
Exec
Types
New
ARM
Code
IPC
![Page 23: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/23.jpg)
23
DCO: AN EXAMPLE SPECINT – CRAFTY EXECUTION 3% of benchmark run
%
Exec
Types
New
ARM
Code
IPC
![Page 24: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/24.jpg)
24
CONCLUSION
Dynamic Code Optimization is the architecture of the future
Breaks the out-of-order window physical limitation
Opens synergy between HW and SW that current architectures lack
Improves efficiency by optimizing once and using many times
Delivering PC-class performance to mobile form factors
Enables PC-class gaming experience
Enables true enterprise applications
Enables content creation
![Page 25: HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR Boggs, CPU Architecture Co-authors: Gary Brown, Bill Rozas, Nathan Tuck, K S Venkatraman HOT CHIPS 2014 NVIDIA’S DENVER PROCESSOR ... 3](https://reader033.fdocuments.us/reader033/viewer/2022051602/5ae47fc67f8b9a097a8f8166/html5/thumbnails/25.jpg)
25
ACKNOWLEDGMENT
We would like to thank the CPU team in NVIDIA for all the creativity, hard work, and dedication to bring this vision to a reality.