ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for...
Transcript of ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for...
![Page 1: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/1.jpg)
ApproxHPVM: Accuracy-aware
Optimizations for Heterogeneous SoCs
Vikram Adve
With:
Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp Srivastava,
Yasmin Sarita, Nathan Zhou, Sarita Adve and Sasa Misailovic
University of Illinois at Urbana-Champaign
Supported by: NSF, SRC, DARPA, Amazon, Intel
![Page 2: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/2.jpg)
Software Multiplexing:
Full-system Code Deduplication
Sean Bartell, Will Dietz and Vikram Adve
University of Illinois at Urbana-Champaign
Supported by: ONR, NSF
[OOPSLA 2018]
[Submission, 2019]
![Page 3: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/3.jpg)
Sources of Code Duplication Across Programs
• Duplicated libraries across applications
• Duplicated functions within / across applications
• Duplicated code fragments within / across applications
Software multiplexing is a compiler strategy
to address all three issues
(current system addresses first two)
![Page 4: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/4.jpg)
Example: LEMP Stack for Web Servers
LEMP = Linux + nginx + mariadb + PHP
37 programs combined in ONE!
Fully automatic and no change in functionality
56%
![Page 5: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/5.jpg)
Example 2: OpenWRT Router Configurations
65%
47% 49%
• 3 package sets for OpenWRT: 62 (Home), 103 (Server)
• Baseline: “clang -Oz” + LLVM outliner on each binary
![Page 6: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/6.jpg)
Goal: Programmability for Heterogeneous Parallel SystemsMobile / embedded SoCs
Supercomputers
Cloud with accelerators
Key to Programmability:
Common abstractions for heterogeneous parallel hardware
Heterogeneous Parallel Virtual Machine
Use HPVM for:
1. Portable object code
2. Retargetable parallel
compiler IR and system
3. Run-time scheduling
Translators
HPVM
Virtual ISARuntime
Scheduler
C+HPVM
Keras
TensorFlow
Other
DSLs
Front ends
HPVM: IR and Tools
CPUs + Vector
SIMD Units
…
GPUDSP
Domain-specific
Accelerators
FPGA
Halide
Kotsifakou et al., PPOPP 2018
![Page 7: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/7.jpg)
Abstraction of Parallel Computation
Dataflow Graph
with side effects
Vector
VA = load <L4 x float>* AVB = load <L4 x float>* B
…VC = fmul <L4 x float> VA,
VB
Hierarchical
or
• Graph nodes – coarse-grain or fine-grain computational tasks
• Graph edges – explicit data transfer between nodes
• Loads and stores – implicit communication via shared memory
• Hierarchical – multiple levels of nested parallelism
![Page 8: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/8.jpg)
Static Dataflow Graph
Dynamic Dataflow Graph
[N] 1 2 N
✓ Graph Structure – coarse grain task parallelism, streams, pipelines✓ Graph hierarchy – nested parallelism✓ Node Instantiation – captures SPMD-style data parallelism✓ Vector instructions in leaf nodes – fine grain vector parallelism✓ Supports high-level optimizations✓ Captures FPGAs, some semi-custom hardware
N different parallelism models single unified parallelism model!
Node instantiation
HPVM Abstractions
![Page 9: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/9.jpg)
Simplifying Approximate Computing
We know applications can be 2x-10x more efficient
by slightly relaxing accuracy / quality requirements.
Can we make these strategies easier to use?
Sharif et al., OOPSLA 2019
![Page 10: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/10.jpg)
Domains for Approximate Computing
An approximate answer is enough in some important domains
Machine Learning Data Sciences Image Processing
![Page 11: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/11.jpg)
Heterogeneity of Approximation Methods
• Diverse approximate computing methods
Software: Loop perforation, barrier elision, reduction sampling,
function substitution, relaxed synchronization, …
Hardware: Numerical precision, caches, custom accelerators
• Example 1: GPU
➢Choices of precision: FP32, FP16, Int8, Int4, …
• Example 2: PROMISE(*)
➢Programmable deep in-memory analog ML accelerator
➢Choices of bit-line swing voltage: 7 different levels of accuracy
(*) Kang et al., ISCA 2018
![Page 12: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/12.jpg)
ApproxHPVM: Key Goals
• Applications should only specify high-level goals
➢Maximum acceptable loss in quality, e.g., inference error, PSNR
➢End-to-end metrics, not per function or pipeline stage or …
➢System should select and optimize approximation techniques
• (Often) Want object-code portability
➢Approximation choices are highly system-dependent
➢Can make orders-of-magnitude difference in performance, energy
![Page 13: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/13.jpg)
How to Map to Heterogeneous Approx. SoC?
• SoC with processing elements A, B, C and D
• Application with end-to-end quality spec (e.g., max error)
Map
A
CA
B D
B
C
End-to-End Quality
Specification+
![Page 14: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/14.jpg)
Three Specific Challenges
❑ Map end-to-end quality spec
to individual task quality specs
Map
A
C1A
B2 D
B1
C2
End-to-End Quality
Specification+
❑Flexible code-gen to heterogeneous h/w
❑Cost of mapping
![Page 15: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/15.jpg)
ApproxHPVM IR: Intrinsics & Quality Metrics
𝐿1𝑒 =
𝐿1(𝐴 − 𝐺)
𝐿1(𝐺)
𝐿2𝑒 =
𝐿2(𝐴 − 𝐺)
𝐿2(𝐺)
L1 Error:
L2 Error:
ApproxHPVM IR Portable
Quality metrics:
ApproxHPVM IR Tensor Intrinsics:
![Page 16: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/16.jpg)
Keras -> ApproxHPVM
Convolution
layer In Keras
Conv2D(Num_filters, Kernel_sizes, activation = “relu”, padding = “same”))
![Page 17: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/17.jpg)
Keras -> ApproxHPVM
f
define i8* @tensorConvNode(i8* %input, i8* %filter) {
%result = call i8* @tensor.conv(i8* %input, i8* %filter, i32* %padding)
return i8* %result
}
define i8* @tensorAddNode(i8* %input, i8* %bias_weights) {
%result = call i8* @tensor. add(i8* %input, i8* bias_weights)
return i8* %result
}
define i8* @tensorReluNode(i8* %input) {
%result = call i8* @tensor.relu(i8* %input)
return i8* %result
}
Conv2D(Num_filters, Kernel_sizes, activation = “relu”, padding = “same”))
![Page 18: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/18.jpg)
Keras Compiler Workflow
![Page 19: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/19.jpg)
Autotuning Workflow
Use autotuning to determine per-task quality metrics
![Page 20: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/20.jpg)
Experimental Setup
Hardware Platform➢ Nvidia Jetson TX2 GPU: direct execution
➢ PROMISE: simulations, validated against previous chips
Benchmarks:➢ DNNs: FC-4, LeNet, Alexnet, ResNet, VGG-16, MobileNet, Shallow MobileNet
➢ Image processing pipelines of 3-4 filters:
Gaussian, Motion blur, Sharpen, Outline, Emboss
Quality comparisons:➢ Baseline: FP32
➢ DNNs application goal: Inference accuracy loss <= %1, loss <= %2
➢ Image processing application goal: PSNR=30, PSNR=20
![Page 21: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/21.jpg)
DNN Speedup
1-2% loss of inference accuracy gives
2x-9x speedups in most networks
Sp
eed
up
ove
r F
P32
![Page 22: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/22.jpg)
DNNs: Energy SavingsE
ner
gy
savi
ng
s vs
. FP
32
1-2% loss of inference accuracy gives
2x-11x energy savings in most networks
![Page 23: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/23.jpg)
Image Benchmarks SpeedupS
pee
du
p o
ver
FP
32
PSNR of 30dB or 20dB gives
1x-6x speedups in image processing pipelines
6.1x
![Page 24: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/24.jpg)
Image Benchmarks: Energy SavingsE
ner
gy
red
uct
ion
vs.
FP
32
PSNR of 30dB or 20dB gives
1.2x-8x energy reduction in image processing pipelines
![Page 25: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/25.jpg)
Hardware-agnostic vs Hardware-specificS
low
do
wn
vs.
H/w
Sp
ecif
ic
Hw-agnostic tuning is 0-25% worse than hw-specific
(Up to 45% in one case)
![Page 26: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/26.jpg)
Ongoing Research (1)
Extending ApproxHPVM for flexibility, efficiency
• Algorithmic approximations as well as system-level
• Dynamic selection of approximations
• More DNN architectures: deeper n/ws, recurrent n/ws
• New domains beyond tensor operations
![Page 27: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/27.jpg)
Ongoing Research (2)
Domain-specific programming of edge systems
• Example: ARM (+ GPU) (+ DNN) (+ FPGA)
• Users: Crop scientists, civil engrs, medical researchers…
• Can we enable non-expert users to program complex
heterogeneous SoCs?
➢Very high-level DSLs
➢Automatic partitioning, approximation, mapping, code generation
➢Automatic run-time scheduling, performance analysis
![Page 28: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/28.jpg)
Ongoing Research (3)
Hardware-agnostic programming of FPGAs
• FPGAs are becoming widely available in data centers
• Application users lack hardware expertise
Intermediate Compilation
AOC Compiler
Full Compilation
Transformations
Code Gen
HPVM virtual object code
Analyze Report
Ke
rnel (.
cl)
Optim
iza
tion R
eport
Bitstream (.aocx)
HPVM-OpenCL Goal: Use compiler optimizations to
achieve high-perf. FPGA designs
from hardware-agnostic code
![Page 29: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/29.jpg)
Ongoing Research (4)
Integrate ApproxHPVM with Jasmine Toolflow
• Improve hw-agnostic tuning to match hw-specific
• Partition application + iterate through design space
• Explore approximate hw, sw mechanisms
![Page 30: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/30.jpg)
DSSoC: Hardware Design Space Exploration
…
…
ReLU
…
Ontology 1
Ontology 2
Ontology 3
Ontology n
Conv
1D
Conv
2D ……
Convolution
MatMul
Ontology
discovery
using graph
analytics &
static
analysisHPVM
Acc
GPUCPU
AccGPU
CPU
Acc
GPU
CPU
“Test” Set
38% 41% 3%Workload Mix
CPU
GPU
CPU
A A AA
A
A
CNN design space
WL1
“Training” Set
WL2 WL3 WLn
CV
Jasmine
DSSoC
Ontology learning
DSSoC design exploration
NOC
architectures
Design
constraints
Physical
interface
Dynamic
DDG
Hierarchical
static DDG
Compiler
flow
Training
flow
Design
flow
HPVMJasmine:
Design Space Exploration
Hierarchical
DFG
Performance
results
ESP:
SoC Design Framework
DSSoC
Applications
(IBM, Columbia, Harvard, Illinois)
(Harvard)
(Columbia)
Accelerator
Pareto curves
![Page 31: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/31.jpg)
Summary
HPVM: portability + performance for heterogeneous systems
ApproxHPVM: easy access to approximation techniques
Ongoing research:
➢Application-driven hardware design
➢Rich compiler infrastructure for DSLs
➢Easy programming of energy-efficient edge compute systems
Questions?
![Page 32: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/32.jpg)
• EXTRA SLIDES
![Page 33: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/33.jpg)
Example: HPVM CAVA DFG Structure
Demosaic
Node
Gamut
Mapping
Node
Root Node
Tone
Mapping
Node
Raw
image
Output
image
TsTwControl
Points Weights Coeffs L2 distance
Hos
t Cod
e Denoise
Node
White
Balance
Node
![Page 34: ApproxHPVM: Accuracy-aware Optimizations for ......ApproxHPVM: Accuracy-aware Optimizations for Heterogeneous SoCs Vikram Adve With: Maria Kotsifakou, Hashim Sharif, Adel Ejjeh, Prakalp](https://reader034.fdocuments.us/reader034/viewer/2022042709/5f4355cf0105115996570d8c/html5/thumbnails/34.jpg)
Example: HPVM CAVA DFG Structure
Demosaic
Node
Gamut
Mapping
Node
Root Node
Tone
Mapping
Node
Raw
image
Output
image
TsTwControl
Points Weights Coeffs L2 distance
Hos
t Cod
e Denoise
Node
White
Balance
Node