Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was...
Transcript of Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was...
![Page 1: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …](https://reader030.fdocuments.us/reader030/viewer/2022040919/5e966e89ffbddf4515798e66/html5/thumbnails/1.jpg)
Alex Ramirez, NVIDIA Research
Embedded Supercomputing at NVIDIA
![Page 2: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …](https://reader030.fdocuments.us/reader030/viewer/2022040919/5e966e89ffbddf4515798e66/html5/thumbnails/2.jpg)
2
The project Build HPC using embedded commodity technology
Tibidabo Tegra 2
KAYLA Tegra 3 +
Quadro GPU
Pedraforca Tegra 3 + Tesla K20
Mont-Blanc Exynos 5250
![Page 3: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …](https://reader030.fdocuments.us/reader030/viewer/2022040919/5e966e89ffbddf4515798e66/html5/thumbnails/3.jpg)
3
Tegra enabled ARM Multicore testing But the embedded GPU was not programmable …
Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …
Tegra 3 4x ARM Cortex-A9 VFP for 64-bit FP 3x faster GPU …
Tegra 4 4x ARM Cortex-A15
72-core embedded GPU …
![Page 4: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …](https://reader030.fdocuments.us/reader030/viewer/2022040919/5e966e89ffbddf4515798e66/html5/thumbnails/4.jpg)
4
Samsung Exynos 5250
2x ARM Cortex-A15 @ 1.7 GHz
4-core ARM Mali T604
OpenCL 1.1
Dual-channel DDR3
USB 3.0 to 1 GbE bridge
First embedded + programmable GPU
![Page 5: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …](https://reader030.fdocuments.us/reader030/viewer/2022040919/5e966e89ffbddf4515798e66/html5/thumbnails/5.jpg)
5
NVIDIA Tegra K1
4x ARM Cortex-A15 @ 2.3 GHz
32KB + 32KB L1 cache
2 MB L2 cache
192-core Kepler GPU
CUDA 6
The first embedded CUDA GPU
![Page 6: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …](https://reader030.fdocuments.us/reader030/viewer/2022040919/5e966e89ffbddf4515798e66/html5/thumbnails/6.jpg)
6
Jetson DevKit Clusters @ SC’14 Lead the way … and they will follow
![Page 7: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …](https://reader030.fdocuments.us/reader030/viewer/2022040919/5e966e89ffbddf4515798e66/html5/thumbnails/7.jpg)
7
NVIDIA Tegra K1-64 (Denver)
2x NVIDIA Denver CPU @ 2.5 GHz
64-bit ARM v8
128KB + 64KB L1 cache
2 MB L2 cache
192-core Kepler GPU
CUDA 6
Pin compatible with Tegra K1
64-bit ARM CPU
![Page 8: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …](https://reader030.fdocuments.us/reader030/viewer/2022040919/5e966e89ffbddf4515798e66/html5/thumbnails/8.jpg)
8
NVIDIA Denver CPU
7-wide superscalar
2x FP pipelines
Dynamic code optimization
OOO performance with in-order execution
Aggressive HW prefetcher
Highest performance ARMv8 Mobile CPU
![Page 9: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …](https://reader030.fdocuments.us/reader030/viewer/2022040919/5e966e89ffbddf4515798e66/html5/thumbnails/9.jpg)
9
NVIDIA Tegra X1
4x ARM Cortex A57
48KB + 32KB L1 cache
2 MB L2 cache
4x ARM Cortex A53
32KB + 32KB L1 cache
512KB L1 cache
256-core Maxwell GPU
Embedded GPU overhaul
![Page 10: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …](https://reader030.fdocuments.us/reader030/viewer/2022040919/5e966e89ffbddf4515798e66/html5/thumbnails/10.jpg)
10
Drive-PX
2x Tegra X1
4x ARM Cortex-A57
4x ARM Cortex-A53
256-core Maxwell GPU
2.3 TFLOPS (32-bit FP)
1 GbE cluster interconnect
Embedded supercomputer for autonomous driving
![Page 11: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …](https://reader030.fdocuments.us/reader030/viewer/2022040919/5e966e89ffbddf4515798e66/html5/thumbnails/11.jpg)
11
Drive-PX2
2x Tegra X2
4x ARM Cortex-A57
2x NVIDIA Denver2
2x 3840-core Pascal GPU
8 TFLOPS (64-bit FP)
24 TFLOPS (16-bit FP)
1 GbE cluster interconnect
Discrete GPU for embedded deep learning capabilities
![Page 12: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …](https://reader030.fdocuments.us/reader030/viewer/2022040919/5e966e89ffbddf4515798e66/html5/thumbnails/12.jpg)
12
This is what 8 TFLOPS looked like in 2001
ASCI White, LLNC, CA
![Page 13: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …](https://reader030.fdocuments.us/reader030/viewer/2022040919/5e966e89ffbddf4515798e66/html5/thumbnails/13.jpg)
13
This is what 8 TFLOPS look like in 2016 Drive-PX2, under the hood of your next self-driving car
![Page 14: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …](https://reader030.fdocuments.us/reader030/viewer/2022040919/5e966e89ffbddf4515798e66/html5/thumbnails/14.jpg)
14
This is what 40 TFLOPS looked like in 2004 MareNostrum, Barcelona Supercomputing Center, Spain
![Page 15: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …](https://reader030.fdocuments.us/reader030/viewer/2022040919/5e966e89ffbddf4515798e66/html5/thumbnails/15.jpg)
15
This is what 40 TFLOPS look like in 2016 DGX-1 Deep Learning System
![Page 16: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …](https://reader030.fdocuments.us/reader030/viewer/2022040919/5e966e89ffbddf4515798e66/html5/thumbnails/16.jpg)
16
NVlink in NVIDIA’s NUMA link
NVlink across Tegra SoCs NVlink across Tegra and GPU
For GPU’s only?
Tegra
XYZ
Tegra
XYZ
Tegra
XYZ
Tegra
XYZ
Tegra
XYZ
Tesla
Ω
HBM
HBM
HBM
HBM
Tegra
XYZ
Tesla
Ω
HBM
HBM
HBM
HBM
* Diagram does not imply any current or planned NVIDIA products
![Page 17: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …](https://reader030.fdocuments.us/reader030/viewer/2022040919/5e966e89ffbddf4515798e66/html5/thumbnails/17.jpg)
17
The self-hosted GPU accelerator ?
NVlink across Tegra and GPU Self-hosted GPU package with heterogeneous memory
DDR + HBM
Why integrate the CPU with the GPU?
Tegra
XYZ
Tesla
Ω
HBM
HBM
HBM
HBM
Tegra
XYZ
Tesla
Ω
HBM
HBM
HBM
HBM
* Diagram does not imply any current or planned NVIDIA products
![Page 18: Embedded Supercomputing at NVIDIA · Tegra enabled ARM Multicore testing But the embedded GPU was not programmable … Tegra 2 2x ARM Cortex-A9 VFP for 64-bit FP Low-power GPU …](https://reader030.fdocuments.us/reader030/viewer/2022040919/5e966e89ffbddf4515798e66/html5/thumbnails/18.jpg)
18
Conlusions
Embedded SoCs are gaining HPC compute capabilities
The benefits of SoC customization have not yet been exploited
Scaling performance across multiple SoC is still challenging …
… but connecting many of the is getting easier
Heterogeneous IP blocks across heterogeneous SoCs
Heterogeneous memories
Homogeneous programming model !?
Unleash the genie …
Phenomenal cosmic powers … Itty bitty living space