An Adaptive Edge Element Approximation of a Quasilinear H ...
Mixed-Signal Techniques for Embedded Machine Learning β¦2 H 1 P H 2 V 1 H 1 H 2 π»= V 1 V 2 π=...
Transcript of Mixed-Signal Techniques for Embedded Machine Learning β¦2 H 1 P H 2 V 1 H 1 H 2 π»= V 1 V 2 π=...
Mixed-Signal Techniques for Embedded
Machine Learning Systems
Boris Murmann
June 16, 2019
Applications
Speed of response
Bandwidth utilized
Privacy
Power consumed
CloudEdge
Conversational
Interfaces
β¦natural?
β¦seamless?
β¦real-time?
2
Task Complexity, Memory and Classification Energy
Face Detection
SVM
Digit RecognitionImage Classification
10 classes
Binary CNN
Image Classification
1000 classes
MobileNet v1
Self-Driving
ResNet-50 HD
pJ
nJ
Β΅J
mJ
J
3
Task Complexity, Memory and Classification Energy
pJ
nJ
Β΅J
mJ
J
My main interest
4
Edge Inference System
5
A/D
A/D
Sensors
Wake-up
Detector
Pre-
Processin
g
Deep
Neural
Network
DRA
M
SRA
M
Buffer
PE
Array
Register
File
Compute
MAC
Opportunities for Analog/Mixed-Signal Design
6
A/D
A/D
Sensors
Wake-up
Detector
Pre-
Processin
g
Deep
Neural
Network
DRA
M
SRA
M
Buffer
PE
Array
Register
File
Compute
MAC
Analog
Feature
Extraction
Analog
MAC
In-Memory
Computing
Outline
7
Data-Compressive Imager for Object Detection
βΊ Omid-Zohoor & Young, TCSVT 2018 & ISSCC 2019
Mixed-Signal ConvNet
βΊ Bankman, ISSCC 2018 & JSSC 2019
RRAM-based ConvNet with In-Memory Compute
βΊ Ongoing work
Wake-Up Detector with Hand-Crafted Features
Feature
Extractor
Simple
Classifier
Sensor
Signal LabelA/D
Training
Algorithm
Labeled
Training Data
High-dimensional
data
Low-dimensional
representation
Data Deluge
8
Analog Feature Extractor
Low-rate and/or lowβresolution ADC
Low data rate digital I/O
Reduced memory requirements
Feature
Extractor
Simple
Classification
Sensor
SignalLabelA/D
Training
Algorithm
Labeled
Training Data
Low-dimensional
representation
9
Histogram of Oriented Gradients
10
Analog Gradient Computation
H1 H2πΊπ» =
V1 V2πΊπ =
horizontal
vertical
V2
H
1
PH
2
V1
πΊπ» = 400ππ β 100ππ = πππππ½
πΊπ» =1
4400ππ β
1
4100ππ = ππππ½
Dark patch
Bright patch
11
High Dynamic Range Images
12
Gradient magnitude varies
significantly across image
Would require high-
resolution ADCs (β₯ 9b) to
digitize computed gradients
βΊ But, we want to produce
as little data as possible
Ratio-Based (βLogβ) Gradients
V2
H
1
PH
2
V1
H1
H2
πΊπ» =V1
V2
πΊπ =
H1
H2
πΊπ» = =πΌ Γ
πΌ Γ
H1
H2
Illumination Invariant
13
Ratio Quantization
V2
H
1
PH
2
V1
H1
H2
πΊπ» =
V1
V2
πΊπ =
ππππ₯1ππππ₯2
1
22
Negative
edge
Positive
edge
No
edge
*Empirically determined
thresholds
ππππ₯1ππππ₯2
1
22
1.5b
2.75b
4.21.31
1.3
1
4.2
H1
H2
V1
V2
πΊπ = ππΊπ» = π
14
HOG Feature Compression with 1.5b Gradients
Quant
Fewer angle bins
Quantizing histogram
magnitudes
25Γ less data in HOG
features compared to 8-bit
image
15
Log vs. Linear GradientsLess Illu
min
atio
n
16
Log vs. Linear GradientsLess Illu
min
atio
n
17
Log vs. Linear GradientsLess Illu
min
atio
n
18
Prototype Chip
19
Row Buffers with Pixel Binning (Image Pyramid)
20
Ratio-to-Digital Converter (RDC)
21
Data-Driven Spec Derivation
22
π»1 + πππ»2 + ππ
Chip Summary
β’ 0.13 Β΅m CIS 1P4M
β’ 5Β΅m 4T pixels
β’ QVGA 320(V) x 240(H)
β’ 229 mW @ 30 FPS
Supply Voltages
Pixel: 2.5V
Analog: 1.5V, 2.5V
Digital: 0.9V
23
Sample Images
24
25
Results using Deformable Parts Model detection & custom database (PascalRAW)
Comparison to State of the Art
26
Information Preservation
27
Use Log Gradients as ConvNet Input?
28
Ongoing work; comparable performance using ResNet-10 (PascalRaw dataset)
Digital ConvNet Fabric
Can We Play Mixed-Signal Tricks in a ConvNet?
29
BinaryNet
Courbariaux et al., NIPS 2016
Weights and activations constrained to +1
and -1, multiplication becomes XNOR
Minimizes D/A and A/D overhead
Nice option for small/medium-size
problems and mixed-signal exploration
30
Bankman et al., ISSCC 2018
Mixed-Signal Binary CNN Processor
1. Binary CNN with βCMOS-inspiredβ topology, engineered for minimal circuit-level path loading
2. Hardware architecture amortizes memory access across many computations, with all memory on chip (328 KB)
3. Energy-efficient switched-capacitor neuron for wide vector summation, replacing digital adder tree
31
CIFAR-10
Zhao et al., FPGA 2017
Original BinaryNet Topology
88.54% accuracy on CIFAR-10
1.67 MB weight memory (68% FC layers)
27.9 mJ/classification with FPGA
32
Mixed-Signal BinaryNet Topology
Sacrificed accuracy for regularity and energy efficiency
86.05% accuracy on CIFAR-10
328 KB weight memory
3.8 mJ per classification
33
Neuron
34
NaΓ―ve Sequential Computation
35
Weight-Stationary
256
x2 (north/south)
x4 (neuron mux)
36
Weight-Stationary and Data-ParallelParallel
broadcast
37
Complete Architecture
38
π§ = sgn
π=0
1023
π€ππ₯π + π π§
π€ πFilter Weights
π₯Image Patch Filter Bias
9b sign-magnitude
π = β1 π
π=0
7
2πππ
Output
2Γ2Γ256 2Γ2Γ256
Neuron Function
39
π£diffππ·π·
=πΆπ’πΆπ‘ππ‘
π=0
1023
π€ππ₯π + π
π = β1 π
π=0
7
2πππ
Switched-Capacitor Implementation
Bias & offset cal.Weights x inputs
Batch normalization
folded into weight
signs and bias
40
Behavioral Simulations
41
Significant margin in noise, offset, and mismatch (VFS = 460 mV)
πΆπ’ = 1 fFπ£ππ = 4.6 mV RMSπ£π = 460 Β΅V RMS
βMemory-Cell-Likeβ Processing Element
1 fF metal-oxide-metal fringe capacitorStandard-cell-based
42 transistors
24107 F2
42
TSMC 28nm HPL 1P8M
6 mm2 area
328 KB SRAM
10 MHz clock
Supply Voltages
VDD β Digital Logic, 0.6V β 1.0V
VMEM β SRAM, 0.53V β 1.0V
VNEU β Neuron Array, 0.6V
VCOMP β Comparators, 0.8V
Die Photo
43
ΞΌ = 86.05%
Ο = 0.40%
Measured Classification Accuracy
10 chips, 180 runs each through
10,000 CIFAR-10 test images
VDD = 0.8V, VMEM = 0.8V
3.8 ΞΌJ/classification
237 FPS, 899 ΞΌW
0.43 ΞΌJ in 1.8V I/O
Mean accuracy ΞΌ = 86.05% same
as perfect digital model
44
Comparison to Synthesized Digital
BinarEye(Moons et al., CICC 2018)
Synthesized Digital Mixed-Signal
45
Digital vs. Mixed-Signal Binary CNN Processor
Synthesized
Digital
Moons et al.
CICC 2018
Hand-Designed
Digital
(Projected)
Mixed-Signal
Bankman et al.
ISSCC 2018
Energy @
86.05%
CIFAR-10
46
CIFAR-10 Energy vs. Accuracy NeuromorphicβΊ [1] TrueNorth, Esser PNAS
2016
GPUβΊ [2] Zhao FPGA 2017
FPGAβΊ [2] Zhao FPGA 2017
βΊ [3] Umuroglu FPGA 2017
MCUβΊ [4] CMSIS-NN, Lai arXiv 2018
Memory-like, mixed-signalβΊ [5] Bankman ISSCC 2018
BinarEye, digitalβΊ [6] Moons CICC 2018
In-memory, mixed-signalβΊ [7] Jia arXiv 2018
*energy excludes off-chip DRAM
47
Limitations of Mixed-Signal BinaryNet
48
Poor programmability
Relatively limited accuracy (even on CIFAR-10) due to 1b arithmetic
Energy advantage over customized digital is not revolutionary
βΊ Same SRAM, essentially same dataflow
Need a more βanalogβ memory system to unleash larger gains
βΊ In-memory computing
BinaryNet Synapse versus Resistive RAM
β’ TBD
β’ 25 F2
β’ Multi-bit (?)
49
β’ 0.93 fJ per 1b-MAC in 28 nm
β’ 24107 F2
β’ Single-bit
Tsai, 2018
Matrix-Vector Multiplication with Resistive Cells
50
Typically use two cells to
achieve pos/neg weights
(other schemes possible)
256 columns
β¦
10
24
row
s
25 F2 cell @ F = 90 nm
Side length s = 0.45 mm
2 x 1024 x 0.45 um x 256 x 0.45 um = 0.106 mm2
per layer
0.84 mm2 for the complete 8-layer network
RRAM Density β BinaryNet Example
51
(2x)
Ongoing Research
52
What is the best architecture?
How many levels can be stored in each cell?
What is the most efficient readout?
Can we cope with nonidealities using training techniques?
53
(Content deleted for online distribution)
54
(Content deleted for online distribution)
55
(Content deleted for online distribution)
56
(Content deleted for online distribution)
VGG-7 Experiment (4.8 Million Parameters)
57
8192
10
3Γ3Γ128
3Γ3Γ128
3Γ3Γ256 3Γ3Γ512
3Γ3Γ2563Γ3Γ3
2-bit weights, 2-bit
activations
Accuracy on CIFAR-10
2b quantization only: 93%
2b quantization + RRAM/ADC model: 92%
Work in progress!
Optimal energy/MAC = 0.38 fJ
Energy Model for Column in Conv6 Layer
58
Summary
59
Analog feature extraction is attractive for wake-up detectors
Adding analog compute in ConvNets can be beneficial when it
simultaneously lets us reduce data movement
βΊ In-memory analog compute looks most promising
βΊ Can consider SRAM or emerging memories (e.g. RRAM)
Expect significant progress as more application drivers for βmachine
learning at the edgeβ emerge