SoC Subsystem Acceleration using Application-Specific
Processors (ASIPs)Markus Willems
Product ManagerSynopsys
• What to do when the performance of your main processor is insufficient?
– Go multicore?• Application mapping difficult,
resource utilisation unbalanced– Add hardwired accelerators?
• Balanced but inflexible
SoC Design
• What to do when the performance of your main processor is insufficient?
SoC Design
ASIPs: application-specific processors• Anything between general-purpose P and hardwired data-path• Deploys classic hardware tricks (parallelism and customized datapaths) while
retaining programmability – Hardware efficiency with software programmability
Agenda•ASIPs as accelerators in SoCs•How to design ASIPs•Examples•Conclusions
Architectural Optimization Space
ASIP architectural optimization space
Parallelism Speciali-zation
Architectural Optimization Space
Parallelism
Instruction-level
parallelism (ILP)
Data-level
parallelism
Task-level
parallelism
Orthogonalinstructionset (VLIW)
Encoded instruction
set
Vector processing
(SIMD)Multicore Multi-
threading
Architectural Optimization Space
Specialization
App.-specificdata types
App.-specificinstructions
Connectivity & storage matching application’s data-
flow
App.-spec. data
processing
App.-spec. memory
addressing
App.-spec. control
processing
Distributed regs, sub-ranges
Multiple mem’s,sub-ranges
Jumps, subroutines,interrupts, HW do-loops, residual
control, predication…
Direct, indirect, post-modification, indexed,
stack indirect…Any exoticoperator
Integer, fractional, floating-point, bits, complex, vector…
Single or multi-cycle
Relative or absolute, address range, delay slots…
Pipeline
IP Designer: ASIP Design and Programming
Agenda•ASIPs as accelerators in SoCs•How to design ASIPs•Examples•Conclusions
Synopsys - Full Spectrum Processor Technology Provider
32-bit ARC HS ProcessorsHigh-Performance for Embedded Applications
10-stage pipeline
Instruction CCM
Instruction Cache
DataCache
DataCCM
ARCv2 ISA / DSP
User Defined Extensions
ARC Floating Point Unit
MAC & SIMD
Multi-plier ALU Divider Late
ALUReal-TimeTrace
Memory Protection Unit
JTAG
Optional
• Over 3100 DMIPS @ 1.6 GHz* • 53 mW* of power; 0.12mm2 area in
28-nm process*• HS Family products
– HS34 CCM, HS36 CCM plus I&D cache– HS234, HS236 dual-core– HS434, HS436 quad-core
• Configurable so each instance can be optimized for performance and power
• Custom instructions enable integration of proprietary hardware
*Worst case 28-nm silicon and conditions
• Pedestrian detection• Standard feature in luxury vehicles• Moving to mid-size and compact vehicles
in the next 5-10 years, also due to legislation efforts
• Implementation requirements• Low cost • Low power (small form factor, and/or battery powered) • Programmable (to allow for in-field SW upgrades)
• Most popular algorithm for pedestrian detection is Histogram of Oriented Gradients (HOG)
Pedestrian Detection and HOG
Histogram Of Oriented Gradients
Gradient ComputationApply Sobel operators: and
Grey scale conversion
Scale to multiple resolutions
Gradient computation
Histogram computation per
block
Normalization of the histograms
SVM per window position
Non-max suppression
Scale to Multiple Resolutions
Use a fixed 64x128-pixel detection window. Apply this detection window to scaled frames.
Histogram Of Oriented Gradients
The image is divided in 8x8-pixel cells. For very block of 2x2 cells, apply Gaussian weights and compute 4 histograms of orientation of gradients.
Histogram Computation
Normalization of the Histograms(1) L2 Normalization (2) clipping (saturation) (3) L2 Normalization
Support Vector MachineLinear classification of histogramsfor every 64x128 windows position.
Non-Max SuppressionCluster multi-scale dense scan of detection windows and select unique
Grey scale conversion
Scale to multiple resolutions
Gradient computation
Histogram computation per
block
Normalization of the histograms
SVM per window position
Non-max suppression
Grey scaleconversion
HOG Functional Validation on ARC HS
(640 x 480 pixels)
AXI local interconnectDMA,Sync& I/ODCCM
Dedicated Streaming Interconnect (FIFOs)
D D
Rescaling Gradient Histogram SVMNormali-zation
Non-maxsuppression
HSSubs. ctrl
ASIP1 ASIP2 ASIPn…
• OpenCV float profiling results: 2.6 G cycles per frame Fixed point profiling results: 2.4 G cycles per frame
1
ARC HSG cycles
% # ARC HSequivalent
0.1 0.2% 0.07
1.6 2.3% 1.0
17.3 26% 10.8
31.9 47% 20.0
1.2 1.8% 0.8
15.7 23% 9.8
0.004 0.01% 0.002
Profiling (640 x 480 pixels, at 30 FPS)
Grey scale conversion
Scale to multiple resolutions
Gradient computation
Histogram computation per
block
Normalization of the histograms
SVM per window position
Non-max suppression
Grey scaleconversion
Task Assignment #2
AXI local interconnectDMA,Sync& I/OHS DCCM
Dedicated Streaming Interconnect (FIFOs)
Subs. ctrl
D D DASIP1 ASIP2
Rescaling Gradient Histogram SVMNormali-zation
Non-maxsuppression
ASIP4
2
L3 Ext. DRAM
ASIP Example: HISTOGRAM
• Vector-slot next to existing scalar instructions (VLIW)• 16x(8/16)-bit vector register files• 16x8-bit SRAM interface• 16x8-bit FIFO interfaces• Vector arithmetic instructions• Special registers and instructions to compute histograms
4x size increase & 200x speedup (relative to RISC template)
Implemented in less than 1 week
Grey scaleconversion
Task Assignment #3
AXI local interconnectDMA,Sync& I/OHS DCCM
Dedicated Streaming Interconnect (FIFOs)
Subs. ctrl
Rescaling Gradient Histogram SVMNormali-zation
Non-maxsuppression
D D DDASIP1 ASIP2 ASIP3 ASIP4
3
L3 Ext. DRAM
Grey scaleconversion
Task Assignment #4
AXI local interconnectDMA,Sync& I/OHS DCCM
Dedicated Streaming Interconnect (FIFOs)
Subs. ctrl
Rescaling Gradient Histogram SVMNormali-zation
Non-maxsuppression
D D DDASIP1’ ASIP2 ASIP3 ASIP4
4
L3 Ext. DRAM
Grey scaleconversion
Task Assignment #4
AXI local interconnectDMA,Sync& I/O
Dedicated Streaming Interconnect (FIFOs)
Rescaling Gradient Histogram SVMNormali-zation
Non-maxsuppression
D D DDASIP1’ ASIP2 ASIP3 ASIP4
4’
HS DCCM L2 SRAM
L3 Ext. DRAM
ComparisonPlatformconfiguration
#HS(MHz)
#ASIP(MHz)
ARCFunctions
ASIPFunctions
HS ~40 0 All None
HS + ASIPs 2(1600)
2.5(500)
GreyscaleRescalingNormalizationNon-max suppr.Display
GradientHistogramSVM
HS + ASIPs 1(1600)
3.5(500)
GreyscaleRescalingNon-max suppr.Display
GradientHistogramNormalizationSVM
HS + ASIPs 1(500)
4(500)
GreyscaleNon-max suppr.Display
RescalingGradientHistogramNormalizationSVM
12
3
4
• 1 ARC HS, 4ASIPs, AXI interconnect, private SRAM, L2 SRAM
• 30 frames/second at 500 MHz • Functionally identical to OpenCV reference• TSMC 28nm• ASIP gate count: 330k gates• ASIP power consumption: ~130mW • Scaling due to multi-core, specialization and SIMD
usage• Power/performance/area via ASIPs
• Scaling due to multi-core, specialization and SIMD usage
• Performance gains and power efficiency due to tailored instruction sets and dedicated memory architecture
23
Final Results
Scenario: Need for Flexible FEC Core
• Existing and emerging standards use advanced FEC schemes like turbo coding, LDPC and Viterbi
• Instead of duplication of FEC cores, need for re-configurable architecture at minimum power and area
DVB-X?LDPC-A
UMTSTurbo-B
.11nLDPC-C
.16eLDPC-D
3GPP-LTEturbo-A
FlexFEC(turbo/LDPC/Vit)
.11nVit
Architecture Refinement to Increase Throughput: Increased ILP from 2 to 6
ILP: 2 FU (scalar+vector unit)
ILP: 6 FU (1 scalar+5 vector units)No duplication for arithmetic functionalityFor exploiting ILP to increase throughput
2 FUs for local memory access
Fast Area/Performance Trade-off(40nm logical synthesis Processor only)
2 3 4 5 60
10
20
30
40
50
60
70
80
90
100
ldpc - layer 6ldpc - layer 8turbo - betaturbo - output
Total number of processor functional units
cycl
e co
unt
0.177 sqmm 0.189 sqmm
Architectural ExplorationFU Utilization: 2 5
layer6 layer7 layer8 alpha beta output0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
100.0
scalarvector
layer6 layer7 layer8 alpha beta output0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
100.0
scalarvector aluvector specvector vmemvector bg vmem
Vector slot separated in different FUs without overlapping functionality
Local memory access congestion
Architectural ExplorationMore Balanced FU Utilization: 5 6
ldpc - layer6 ldpc - layer7 ldpc - layer8 turbo - alpha turbo - beta turbo - output0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
scalarvector aluvector specvector vmemvector vmem2vector bg vmem
Highly Efficient C-compilationVast Majority of 6 FU Used
Latest IP Available from IMEC
Blox-LDPC ASIP
adInstances available
Agenda•ASIPs as accelerators in SoCs•How to design ASIPs•Examples•Conclusions
Conclusion• ASIPs enable programmable accelerators
• IP Designer enables efficient design and programming of ASIPs
• “Programmable datapath” ASIPs offer performance, area and power comparable to hardwired accelerators
• ASIPs enable balanced multicore SoC architectures
Top Related