Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n...
Transcript of Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n...
![Page 1: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/1.jpg)
Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping
Takuya Kojima, Naoki Ando, Yusuke Matsushita, Hayate Okuhara,
Nguyen Anh Vu Doan and Hideharu AmanoKeio University, Japan
International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART2018), Toronto, Canada
![Page 2: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/2.jpg)
Outlinen Introductionn A CGRA Architecturen Three Types of Control
1. Pipeline Structure Control2. Body Bias Control3. Application Mapping
n New Mapping Optimization Methodn Real Chip Implementationn Experimental Resultsn Conclusion �
![Page 3: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/3.jpg)
Importance of Low Power Consumption
nForthcomingnIoT devicesnWearable computingnSensor network
nChallengesnHigh performance
nFor image processingnLow Power Consumption
nFor long battery life�
![Page 4: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/4.jpg)
SF-CGRAs: Straight-Forward Coarse-Grained Reconfigurable Arrays
n Key features of straight-forward CGRAs
Permutation Network
PEPE
PEPE
PEPE
PEPE
Pipeline Register
Permutation Network
PEPE
PEPE
PEPE
PEPE
Date Mem
ory
n Limited data flow directionn Less frequent reconfiguration
n Pipelined PE arraynHigh energy efficiency �
![Page 5: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/5.jpg)
VPCMA: Variable Pipelined
Cool Mega Array [1]
n PE array consists of
n 8 x 12 PEs
n 7 pipeline registers
n PE has
n No Register file
n No clock tree
n Pipeline register works in
1. latch mode
2. bypass mode
n μ-Controller
n Controls data transfer
data mem. ↔ PE array
PE PE
PE PE
PE PE
PE PE
PE-Array
���
PE PE
PE PE
PE PE
PE PE
���
Data Manipulator
Data Memory���
���
μ-co
ntr
oller
���
���
Pipeline
Registers
or
[1] N.Ando, et al. "Variable pipeline structure for Coarse Grained Reconfigurable Array CMA."
Field-Programmable Technology, 2016.
�
![Page 6: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/6.jpg)
Pipeline Structure Control
6
4th PE Row
3rd PE Row
2nd PE Row
1st PE Row
PipelineRegister
Number of Pipeline StageLarge Small
Operating FrequencyThroughput
Glitch PropagationDynamic Power of Registers & Clock
1st stage
3rd stage
2nd stage
![Page 7: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/7.jpg)
Pipeline Structure Control
7
4th PE Row
3rd PE Row
2nd PE Row
1st PE Row
PipelineRegister
2nd stage
1st stage
Number of Pipeline StageLarge Small
Operating FrequencyThroughput
Glitch PropagationDynamic Power of Registers & Clock
![Page 8: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/8.jpg)
Body Bias Effects on SOTB
n Tradeoff between leak power and performance
Decreaseof Static Power
PerformanceEnhancement
Zero BiasReverse Bias Forward Bias
n SOTB Technologyn 65 nmnOne of FD-SOIn Body Biasing
�
![Page 9: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/9.jpg)
Row-level Body Bias Control
9
Probability of Leak Power Reduction
Delay timein case of no control
Delay timein case ofrow-level control
AND
SL
MULT
ADD
2 Stage Pipeline 4 Stage PipelineDelay Time of PEfor Each Opcode
Time Deadline
![Page 10: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/10.jpg)
How to map an applicationto the PE array?
n An app. is represented as a data flow graph (DFG)
n Various Mappings exist
+
OR
>><<
×
− −
OR
+ ×
<< >>
Example of Application DFG PE Array
n High Performance
n Large Power
Mapping Eval.map
��
![Page 11: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/11.jpg)
How to map an applicationto the PE array?
n An app. is represented as a data flow graph (DFG)
n Various Mappings exist
+
OR
>><<
×
−
−OR
+ ×<< >>
Example of Application DFG
PE Array
n Small Power
n Low Performance
Mapping Eval.
map
��
![Page 12: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/12.jpg)
Complexity of Mapping Optimization
3. Body Bias Voltage(BBV) for Each Row2. Pipeline
Structure
(1( . ( ( -)
( 11 (2 2 () 2 ) 2 1
DynamicPower
StaticPower
(# of Rows)^(# of voltages)patterns
NP-Complete Problem
128 patterns
n Tradeoff between leak power and dynamic power
Interdependenton each other
control
![Page 13: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/13.jpg)
Related work
1. Performance & power optimization for CGRA[2]
n Considering VDD control
n Optimization Priority: Performace > Power
2. Body bias domain size exploration for CGRAs[3]
n Analysis of area overhead and power reduction effects
n Not taking care of the dynamic power
3. Pipeline & body bias optimization for CGRAs [4]
n Method using integer-linear-program
n Assuming static mapping
��
[2] Gu, Jiangyuan, et al. "Energy-aware loops mapping on multi-vdd CGRAs without performance degradation.”
Design Automation Conference (ASP-DAC), 2017 22nd Asia and South Pacific. IEEE, 2017.
[3] Y.Matsushita, “Body Bias Grain Size Exploration for a Coarse Grained Reconfigurable Accelerator”,
Proc. of the 26th The International Conference on Field-Programmable Logic and Applications (FPL),2016.
[4] T. Kojima, et al. “Optimization of body biasing for variable pipelined coarse-grained reconfigurable architectures”.
IEICE Transactions on Information and Systems, Vol. E101-D,No. 6, June 2018.
![Page 14: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/14.jpg)
Is optimizing only the power consumption enough?
n Several requirementsn Power Consumptionn Performance (Operating Frequency)n Throughput
n Multi-Objective Optimization brings usersnA variety of choicesnBalancing the tradeoffs
Power
PerformanceThroughput��
![Page 15: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/15.jpg)
Proposal: Use Multi-Objective Optimization
n Non-dominated Sorting Genetic Algorithm-II (NSGA-II)nMulti-Objective Genetic Algorithm
n In this workn1-point crossovernCommonly-used probability [5]
n0.7 crossover probabilityn0.3 mutation probability
n300 generations
[5] L. Davis. “Adapting operator probabilities in genetic algorithms”. In Proceedings of the third international conference on Genetic algorithms, pp. 61–69, San Francisco, CA, USA, 1989. Morgan Kaufmann Publishers Inc.
��
![Page 16: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/16.jpg)
Gene & Evaluation of Individuals
��������� DFG Mapping Pipeline Structure
Degree of Parallelism
Total WireLength
DynamicPower
Static Power
Total Power
AnalyzeEach Path
RoutingTargetFreq.
ILP Solverfor BBVs
GlitchEstimation
BBV forEach Row
��
Critical Path Delay
![Page 17: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/17.jpg)
Gene & Evaluation of Individuals
��������� DFG Mapping Pipeline Structure
Degree of
ParallelismTotal Wire
Length
Dynamic
Power
Static
Power
Total
Power
Analyze
Each Path
RoutingTarget
Freq.
ILP Solver
for BBVs
Glitch
Estimation
BBV forEach Row
��
Critical
Path Delay
• Dynamic power model
• Proposed in [6]
• Considering glitch
propagation
• Based on results
of real chip
measurements
[6] T.Kojima, et al. “Glitch-aware
variable pipeline optimization for
CGRAs”. ReConFig2017, pp. 1–6,
Dec 2017.
![Page 18: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/18.jpg)
Gene & Evaluation of Individuals
��������� DFG Mapping Pipeline Structure
Degree of Parallelism
Total WireLength
DynamicPower
Static Power
Total Power
AnalyzeEach Path
RoutingTargetFreq.
ILP Solverfor BBVs
GlitchEstimation
BBV forEach Row
��
Critical Path Delay
• An Integer Linear Program (ILP)• Minimizes the static power• Considers timing constraints• Takes within 0.1 sec
• The same method as proposedin [4]
[4] T. Kojima, et al. “Optimization of body biasing for variable pipelined coarse-grained reconfigurable architectures”. IEICE Transactions on Information and Systems, Vol. E101-D,No. 6, June 2018.
![Page 19: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/19.jpg)
An Implemented Real Chip “CCSOTB2”
n CCSOTB2n VPCMA Architecturen SOTB 65nm Technologyn 5 Body Bias Domains
n Design: Verilog HDLn Synthesis: Synopsys Design Compilern Place & Route: Synopsys IC Compiler
6mm
3mm
TCI
PE Array
Body Bias Domainsdomain1 1-5th PE Rowsdomain2 6th PE Row domain3 7th PE Row domain4 8th PE Row domain5 other parts
��
![Page 20: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/20.jpg)
Preliminary Experiments
n Leak power of PE row is measuredn BBV: -0.8 ~ +0.4 V (step: 0.2 V)
n Maximum Operating Freq.n 30MHzn due to bottleneck in μ-controller
CCSOTB2 Chip Artex-7FPGA
Experimental Environment
Mother Board
��
ZeroBias
![Page 21: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/21.jpg)
Benchmark Applications
n4 simple image processing applicationnAssuming 30MHz frequency
Name Descriptionaf 24bit alpha blender
gray 24bit gray scalesepia 8bit sepia filter
sf 24 bit sepia filter
��
![Page 22: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/22.jpg)
Proposed method vs. Black-Diamond
nBlack-Diamond [7]ndoes not support pipeline control
nor body bias controlnStatic mapping regardless of user’s requirements
nCombine with pipeline optimization[6]nConsidering glitch effects
[6] T.Kojima, et al. “Glitch-aware variable pipeline optimization for CGRAs”. ReConFig2017, pp. 1–6, Dec 2017.[7] V.Tunbunheng , et al. “Black-diamond: a retargetable compiler using graph with configuration bitsfor dynamically reconfigurable architectures”. In Proc. of The 14th SASIMI, pp. 412–419, 2007. ��
![Page 23: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/23.jpg)
Mapping quality
Black-Diamond withpipeline optimization Proposed method
Difference of mapping results (gray application)
0.0 V0.0 V
-0.4 V-0.4 V-0.4 V
��
![Page 24: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/24.jpg)
Mapping quality
Black-Diamond withpipeline optimization Proposed method
Difference of mapping results (af application)
0.0 V0.0 V
0.0 V0.0 V-0.2 V
��
![Page 25: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/25.jpg)
Power reduction
n For all applications, the total power is reducedn In average, 14.2 % reduction is achieved
��
![Page 26: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/26.jpg)
Conclusionn A new optimization method based on a multi-
objective genetic algorithm is proposedn Three controls are considered simultaneously
1. Pipeline structure control2. Body bias control3. Application mapping
n Real chip experiments shows 14.2% power reduction
��
![Page 27: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic](https://reader033.fdocuments.us/reader033/viewer/2022042310/5ed847a40fa3e705ec0e2e2c/html5/thumbnails/27.jpg)
2 222 2
2