Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 •...
Transcript of Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 •...
![Page 1: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/1.jpg)
Full-Stack Optimizations for
Next-Generation Deep-Learning Accelerators
Sophia [email protected]
Electrical Engineering and Computer Sciences
![Page 2: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/2.jpg)
Growing
Demand in
Computing
OpenAI2
![Page 3: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/3.jpg)
Slowing Supply in ComputingAMD, HotChips, 2019
3
![Page 4: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/4.jpg)
4
Slowing
Supply in
Computing
Growing
Demand in
Computing
![Page 5: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/5.jpg)
5
Slowing
Supply in
Computing
Growing
Demand in
Computing
![Page 6: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/6.jpg)
6
Slowing
Supply in
Computing
Growing
Demand in
Computing
Domain-Specific
Accelerators
![Page 7: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/7.jpg)
Domain-Specific Accelerators
7
• Customized hardware
designed for a domain of
applications.
Apple M1 Chip
2020
CPU
CPU
GPU
* AnandTech
Neural
Engine
Domain-Specific
Accelerators
![Page 8: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/8.jpg)
Full-Stack Optimization for DL Accelerators
• MAGNet [ICCAD’2019]• Simba [MICRO’2019, VLSI’2019]
Design of
Accelerators
• Chipyard [IEEE Micro’2020]• Gemmini [DAC’2021]
Integration of
Accelerators
• CoSA [ISCA’2021]Scheduling of
Accelerators
8
![Page 9: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/9.jpg)
Full-Stack
Optimization
for DL
Accelerators
Design of Accelerators
Integration of Accelerators
Scheduling of Accelerators
9
![Page 10: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/10.jpg)
Scalable Inference Accelerators
10
• Need for fast and efficient inference accelerators from mobile to datacenter.
Motivation
• High design cost of building unique hardware for each design target.
Challenge
• Deep learning inference is intrinsically scalable with abundant parallelism.
• Recent advances in package-level integration for multi-chip-module-based designs.
Opportunities
![Page 11: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/11.jpg)
The Multi-Chip-Module Approach
11
• Advantages:• Build systems larger than reticle limit
• Smaller chips are cheaper to design
• Smaller chips have higher yield
• Faster time-to-market
• Challenges:• Area, energy, and latency for chip-to-
chip communication
Ref: Zimmer et al., VLSI 2019
![Page 12: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/12.jpg)
Simba: Scaling Inference with MCM-based Architecture
12
Simba Testchip:
Simba Characterization:
Simba NoP-Aware Tiling:
• Package and chiplet architecture
• Processing element design
• Baseline uniform tiling across chiplets and PEs
• Comparison with GPUs
• NoP bandwidth sensitivity
• NoP latency sensitivity
• Non-uniform work partitioning
• Communication-aware data placement
• Cross-layer pipelining
Best Paper Award at MICRO’2019, CACM Research Highlights
![Page 13: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/13.jpg)
Simba: Scalable MCM-Based Architecture
13
Core area 111.6 mm2
Voltage 0.52-1.1 V
Frequency 0.48-1.8 GHz
SRAM624KB/chip
23MB/package
Package and chiplet spec
6mm^2 chiplet in TSMC 16nm
36 chiplets/package
Chip-to-chip interconnect
Ground-Referenced Signaling
Efficient compute tiles
128 TOPS0.11 pJ/Op
8-bit integer datapathRef: Zimmer et al., VLSI 2019
![Page 14: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/14.jpg)
Simba Characterization
• Comparison with GPUs running ResNet-50
14
![Page 15: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/15.jpg)
Simba Characterization
15
• Layer Sensitivity
• Running three ResNet-50 layers across different number of chiplets.
• Increasing the number of active chiplets does not always translate to performance gains.
• The cost of communication hinders the
ability to exploit parallelism.
![Page 16: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/16.jpg)
Design of Accelerators
Integration of Accelerators
Scheduling of Accelerators
16
Full-Stack
Optimization
for DL
Accelerators
![Page 17: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/17.jpg)
Accelerators don’t exist in isolation.
17
http://vlsiarch.eecs.harvard.edu/research/accelerators/die-photo-
analysis/
Maltiel consulting estimates
Shao et al. IEEE Micro 2015
![Page 18: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/18.jpg)
Mobile SoC Usecase
• Mainstream architecture has long focused on general-purpose CPUs and GPUs.
• In an SoC, multiple IP blocks are active at the same time and communicate frequently with each other.
• Example:• Recording a 4K video
• Camera -> ISP• “Preview stream” for display• “Video stream” for storage
• DRAM for data sharing
18
Two Billion Devices and Counting: An Industry Perspective on the State of Mobile Computer Architecture, IEEE Micro’2018
![Page 19: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/19.jpg)
SoC Framework
19
https://github.com/ucb-bar/chipyard
[IEEE Micro’2020]
• Integrated design, simulation and implementation
environment for specialized SoCs.
![Page 20: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/20.jpg)
Gemmini: Full-System Co-Design of Hardware Accelerators
• Full-stack
• Includes OS
• End-to-end workloads
• “Multi-level” API
• Full-SoC
• Host CPUs
• Shared memory hierarchies
• Virtual address translation
20
https://github.com/ucb-bar/gemmini
[DAC’2021]
![Page 21: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/21.jpg)
Gemmini Case Study: Allocating on-chip SRAM
•Where to allocated SRAM?
• Private within each IP
• Shared
21
https://github.com/ucb-bar/gemmini
[DAC’2021]
SHARED
IP0
IP1
IP2
![Page 22: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/22.jpg)
Gemmini Case Study: Allocating on-chip SRAM
•Where to allocated SRAM?
• Private within each IP
• Shared
22
https://github.com/ucb-bar/gemmini
[DAC’2021]
• Application dependent.
Single-Core SoC
• SoC configuration dependent.
Dual-Core SoC
SHARED
IP0
IP1
IP2
![Page 23: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/23.jpg)
Design of Accelerators
Integration of Accelerators
Scheduling of Accelerators
23
Full-Stack
Optimization
for DL
Accelerators
![Page 24: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/24.jpg)
Large Space of Mapping Algorithms to ML Hardware
24
Scheduling
Algorithm Hardware
[ISCA’2021]
![Page 25: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/25.jpg)
CoSA: Constrained-Optimization for Spatial Architecture
25 [ISCA’2021]
![Page 26: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/26.jpg)
CoSA: Constrained-Optimization for Spatial Architecture
26
2.5x speedup
compared to SoTA
with 90x faster
time-to-solution.
[ISCA’2021]
![Page 27: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/27.jpg)
Acknowledgement
• Thanks collaborators from UC Berkeley and NVIDIA!
27
Seah KimJenny HuangHasan Genc
![Page 28: Full-Stack Optimizations for Next-Generation Deep-Learning ...Domain-Specific Accelerators 7 • Customized hardware designed for a domain of applications. Apple M1 Chip 2020 CPU CPU](https://reader035.fdocuments.us/reader035/viewer/2022071610/614a08a512c9616cbc6926c1/html5/thumbnails/28.jpg)
Design of Accelerators
Integration of Accelerators
Scheduling of Accelerators
28
Full-Stack
Optimization
for DL
Accelerators