Harvard SoC Designs · Multiple accelerators + other sys questions accelerators that want to access...
Transcript of Harvard SoC Designs · Multiple accelerators + other sys questions accelerators that want to access...
Paul Whatmough
Arm ML Research Lab
Harvard SoC Designs
…how to win friends and influence people with successful
Arm-based SoC tape outs
3
Why build test chips in academia?If I just do simulation I can write twice as many papers…
Circuits research
• Measured test chip essential for tier-1 publication
• Often need to characterize devices and passives
Architecture research
• Understand the whole stack, soup to nuts
• Know you are solving a real problem
• Add real impact to your work; not just another academic paper
All models are wrong, some are useful
• Many things are hard to simulate convincingly
• Data to build models to design better circuits
Training for researchers
• Build a deep understanding of real computers
• Problem solving, team work, time management
• Extremely valuable depth of experience
4
Harvard / Arm collaboration on ML hardware
Harvard VLSI-Architecture group
• vlsiarch.eecs.harvard.edu
Harvard NLP group
• nlp.seas.harvard.edu
RoboBees
Computer Architecture
VLSICircuits
Algorithms
David Brooks
Gu-Yeon Wei
Paul Whatmough(Arm)
Sasha RushHarvard NLP
5
Deep learning hardware for IoT
Next-generation UI/UX
• Small form-factor without a big screen or input device
• Speech detection/recognition/synthesis, gaze detection, biometric authentication (e.g. face detection)
• Personalize interface and predict user decision-making
Efficient DNN inference hardware on embedded devices
• Privacy, latency and energy issues with transmitting data
• Efficiency to handle large compute and storage demand
• But, still care about programmer efficiency and flexibility
6
Hardware architecture specialization
What is the gap in energy efficiency and flexibility?
• Measured results comparing energy efficient across compute sub-systems
How can we allow hardware accelerators and CPUs to share data efficiently?
• Demonstrate lightweight L2$ coherent data sharing via accelerator coherency port (ACP)
CPUSingle
Function
Flexibility Efficiency
GPU NPU
7
Recent tape outs at Harvard
8
SM2 – 28nm DNN ENGINE
Programmable DNN Classifier for IoT
• Parallelism/reuse 8-way SIMD, 10X data reuse @ 128b/cycle BW
• Small data-types 8-bit weights, –30% energy
• Sparse activation data +4X throughput and –4X energy
• Algorithmic resilience +50% throughput or –30% energy
[Whatmough et al., ISSCC ’17, JSSC ’18]
Operand Load
Weight SRAM
8wayFXP
MAC
SIMD Unit Activation
Bias ReLU
AGUNodeSRAM
DS
TB
DS
TB
Sparse FC-DNN Pipeline
Node List Sparse Index
Act. SRAM
Stall
Inte
rlo
ckStall
9
SM2 – System architecture
ARM Cortex-M0
UART
DNN Engine
W-MEM1MB
SYSCTL
Bridge
GPIO
D-MEM64KB
I-MEM64KB
Timers
BIST
Wdog
UARTs
32
b A
HB
32
b A
PB
12
8b
AX
I
M
M
S
S
S
S M S
S
S
Low-BW Peripherals
S
S
SCAN M
S M
S
Accelerator SubsystemCortex-M0 Subsystem
VACC – Accelerator LogicVMEMP – SRAM Periphery
VMEMC – SRAM Core
USB
GPIO
Scan
USB
DCO
28nm SoC Test Chip
S
VDCO
VSOC
RTC
ASY
NC
FCLKHCLKOSC
VSOC
[Whatmough et al., ISSCC ’17, JSSC ’18]
10
SM3 – 16nm DNN ENGINE v2
Technology 16nm FinFET
On-chip Model Size 1 MB
Layer Support FC-DNN
Voltage Range 0.4 – 1 V
Frequency Range 57 – 1360 MHz
Peak Eff. 16-bit 750 GOPS/W
MNIST Energy/Acc. 151 nJ / 98.51 %
Resilience FeaturesRazor
+ Adaptive Clock
Test Chip Summary
Weight
SRAM
Weight
SRAM
Accelerator
SR
AMARM
M0
2.5 mm
DCO
Noise
Gen.
2.5
mm
2.44
1.81
1.01
0.75Analog SNN
65nm
Digital SNN
28nm
Digital
DNN 28nmDigital DNN
16nm
Digital SNN
28nm
2.38X
[Whatmough et al., Hot chips ‘17, Lee et al. ESSCIRC ’18, JSSC ‘19]
11
SM3 – IoT application demos
Labeled Faces in the Wild (FACE)
Keyword Spotting (KWS)
Human Activity Recognition (HAR)
• Opportunity (HAR1): Detect 18 gestures w/ 7 sensors
• Smartphone (HAR2): Detect basic activities w/ smartphone inertial measurement unit (IMU)
• PAMAP2: Detect activities of daily living w/ 4 sensors
• Daphnet Freezing of Gait: Detect freezing incidents for Parkinson’s patients
Hand written digit classification (MNIST)
[Kodali et al., ICCD’17]
12
SMIV – 16nm SoC to evaluate ML hardware
Cortex-A53
4 Banks
x1MB SRAM
NIC-400 128-bit Interconnect
Cortex
M0Periph.
ThinLink
IO
Bridge
NIC-400 64-bit InterconnectAccelerator
Coherency
Port
(ACP)
Arm Cortex-A53 64-bit CPU Cluster Cache-Coherent Datapath Accelerators
eFPGA
2x2
EFLX4K
DRAM
PCIe
Cortex-A53
2MB L2
AHB 32-bit Interconnect
FC
ENGINE
+SRAM
ACC0 ACC1 ACC2 ACC3
Always-On (AON) 32-bit Cortex-M0 Cluster
[Whatmough et al., Hotchips ’18, VLSI ’19]
13
SMIV – GEMM performance across accelerators
Dual-Core CPU (A53)
• C/C++ or BLAS library
• 18.9 GOPS/W (1x)
Dual-Core CPU with SIMD enabled (A35)
• Assembly or BLAS library
• 58.7 GOPS/W (3.1x)
4-Tile Embedded FPGA (eFPGA)
• Verilog/VHDL/HLS, reconfigurable
• 312 GOPS/W (16.5x)
Quad-Core Datapath Accelerator (CCA)
• Verilog/VHDL/HLS, fixed functionality
• >1 TOPS/W (54.9x)
18.9 GOPS/W
58.7 GOPS/W
312.4 GOPS/W
1.04 TOPS/W
Software Programmable
Reconfigurable
Fixed
[Whatmough et al., Hotchips ’18, VLSI ’19]
14
SMIV – Gem5-Aladdin model
Leverages gem5-Aladdin framework
Compares different IO interfaces for interfacing accelerator to CPU across variety of DNN workloads running on NVDLA-like accelerator
Shows ACP with direct interface to CPU’s L2 cache can improve energy-delay product over using DMA
Multiple accelerators + other sys questions
accelerators that want to access coherent memory actually
also want a local cache.
C. SoC-Accelerator Interfacing
There have been a few publications over the years that in-
vestigated SoC-accelerator interfacing. Sadri et al. performed
an early investigation into the performance of the ARM
Accelerator Coherency Port on the Xilinx Zynq platform and
found that when the CPU and accelerator communicate over
ACP, they can cooperatively perform an FIR filter on images
about 20% faster and consume 20% less energy than by
using DMA [37]. However, this study only looked at simple
filters on images, and the physical platform limited the ACP
bandwidth to 2GB/s, which is far too low for contemporary
DNN workloads. Moreau et al. also used the ACP interface
when building SNNAP because of lower round-trip latency
[38], but they also used the Zynq platform and would be
equally bandwidth-limited. Recent work by Shao et al. found
that software-based cache-coherency management for DMA
interfaces contribute a significant amount of overhead to the
end-to-end runtime of an accelerated workload, and the so-
called serial data arrival effect limits how much speedup
can be gained from additional datapath parallelism [22].
However, their cache explorations also assume the presence
of a private accelerator cache.
IV. EXPERIMENTAL INFRASTRUCTURE
In this section, we will describe the SoC that forms the
basis for our studies and our experimental methodology.
We will describe the accelerator architecture, the networks
that are studied, the software that implements them, and the
performance and energy modeling infrastructure.
A. Baseline System Architecture
Figure 1 shows the baseline SoC used in this paper.
The microarchitectural parameters of the SoC components
are illustrated in Table II. The SoC consists of an out-of-
order CPU cluster with 64KB of private L1 instruction and
data caches and 2MB of shared L2 last-level cache. The
main system interconnect is a 256-bit full-duplex coherent
crossbar. The CPUs are clocked at 2.5GHz, while the system
bus and LLC run at half-processor speed, providing up to
40GB/s of interconnect bandwidth. The system is backed by
4GB of LP-DDR4 DRAM running at 1600MHz, providing
up to 25.6GB/s of bandwidth.
This paper is not about designing a new accelerator, but
rather about extracting more performance/watt of an existing
design without changing its core datapath. In this paper,
we use an accelerator design inspired by the Nvidia Deep
Learning Accelerator (NVDLA) [39]. We choose this design
because it is an open-source project backed by a major
hardware vendor. As shown on the left side of Figure 1, the
design is based around a convolution accelerator consisting
of 8 PEs. Each PE operates on a different output feature map
PE0
PE1
PE2
PE3
B0
B4 B5 B6
B1 B2 B3
B7
Input Scratchpad
PE4
PE5
PE6
PE7
B0
B4 B5 B6
B1 B2 B3
B7
Weight Scratchpad
B0
Output Scratchpad
B1
B2
B3
B4
B5
B6
B7
+
+
+
+
+
+
+
+
Control Logic
AcceleratorCPU0 CPU1
L1 cache L1 cache
2MB L2 cache
LPDDR4
MC
4GB LPDDR4
System Bus
AC
P
IO In
terfa
ce
DM
A
ACP8 MACC Arrays
16
16
32 16
4GB LPDDR4
Fig. 1: The baseline SoC used in this paper.
Component Parameters
CPU CoreOut-of-order X86 @2.5GHz8-µop issue width, 192-entry ROB
L1 Cache64KB i-cache & d-cache, 4-way associative64B cacheline, LRU, 2-cycle access latency
L2 Cache 2MB, 16-way, LRU, 20-cycle access latency
DRAM LP-DDR4, @1600MHz, 4GB, 4 channels, 25.6GB/s
System Bus 256-bit fully-coherent crossbar @1.25GHz, 40GB/s
TABLE II: SoC microarchitectural parameters
and contains a32-way MACC unit for spatial reduction along
the channel dimension. Thedataflow is described in Figure 2.
In total, there are 256 multipliers. Inputs and weights are 16-
bit fixed point; outputs are accumulated in 32-bit fixed point
and reduced to 16-bit before being written to the scratchpad.
In the emerging vernacular commonly used to describe DNN
dataflows, this dataflow is L0 weight-stationary (weights are
reused at the register level within a MACC array), and L1
input/output stationary (on every cycle, inputs are reread
from the SRAM and outputs are accumulated in-place in
the SRAM). It is backed by three distinct SRAMs, one each
for inputs, weights, and outputs. We only model the core
datapath and dataflow of NVDLA; other specific features like
its convolution buffer are not modeled.
In addition to convolution, our accelerator also supports
inner products by mapping them onto the convolution hard-
ware. To reduce memory bandwidth requirements, the accel-
erator supports reading sparse compressed fully-connected
weights, but they are first decompressed into the internal
scratchpads before running the inner product. The compres-
sion format is based on the CSC format described by Han et
al. [40]. Weights compression for convolutional layers is not
supported, since much research in the machine learning and
architecture community has observed that they are harder to
prune than fully-connected layers [23], [40]. Max and aver-
age pooling and batch normalization are natively supported,
as are a range of activation functions.
The accelerator runs at 1GHz. It is attached to the system
bus by a 256-bit bus for DMA operations. To support
delegated coherency, we also add a special port called
[Xi et al., in review]
15
CHIPKIT: Tutorial on Agile Research Test Chips @MICRO’19
microarch.org/micro52/program/workshops.html#chipkit
16
CHIPKIT: Tutorial on Agile Research Test Chips @MICRO’19
Outline
• Research test chips: fabrication routes, process technologies, project planning
• Test chip architectures: CPUs, peripherals, memories, interconnects and frameworks
• Design methodologies for custom blocks: Verilog, SystemVerilog, HLS and beyond
• Physical design flow: linting, synthesis, place and route, DRC/LVS, timing closure
• Bring up and test: packaging, PCBs, clocking, testing flows
Supporting materials
• CHIPKIT – a collection of generators and RTL glue for rapid SoC design
• SM2 – a simple Arm Cortex-M0 based scaffold for rapid (and functional!) test chips
17
Arm research enablement offerings
SoC HW/SW co-development with DesignStart
• DesignStart Eval - Cortex M0/ M3 based systems, evaluation with obfuscated RTL
• DesignStart Pro Academic - Cortex M0/ M3 based systems, RTL for SoC design
Compute systems modelling and architecture exploration
• Gem5 - CPU and system modelling
IP Building blocks*
• Design IP – CPUs, Interconnects, peripherals
• Physical IP – Standard cells, Memory compilers, POP IP
www.arm.com/resources/research/enablement
*Any logic Arm IP that is not part of DesignStart will be provided on a case by case basis, depending on the research project scope, objectives and alignment with Arm research agenda
18
Acknowledgments
Harvard University
• Faculty: Gu-Yeon Wei, David Brooks, Sasha Rush
• Many talented PhD students and post-docs
• Generous sponsors
People, papers, software
• vlsiarch.eecs.harvard.edu
• nlp.seas.harvard.edu
P. N. Whatmough, S. K. Lee, H. Lee, S. Rama, D. Brooks and G. Wei, IEEE Int. Solid-State Cir. Conf. (ISSCC), 2017
P. N. Whatmough, S. K. Lee, D. Brooks and G. Wei, IEEE Journal of Solid-State Circuits (JSSC), 2018
S. K. Lee, P. N. Whatmough, N. Mulholland, P. Hansen, D. Brooks and G. Wei, IEEE Euro. Solid State Cir. Conf. (ESSCIRC), 2018
S. K. Lee, P. N. Whatmough, D. Brooks and G. Wei, IEEE Journal of Solid-State Circuits (JSSC), 2019
P. N. Whatmough, S. K. Lee, M. Donato, H.-C. Hsueh, S. L. Xi, U. Gupta, L. Pentecost, G. Ko, D. Brooks and G.-Y. Wei, Symposium on VLSI Circuits (VLSI), 2019
1919 Confidential © 2017 Arm Limited
Thank You!Danke!Merci!谢谢!ありがとう!Gracias!Kiitos!감사합니다धन्यवाद