Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n...

27
Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping Takuya Kojima , Naoki Ando, Yusuke Matsushita, Hayate Okuhara, Nguyen Anh Vu Doan and Hideharu Amano Keio University, Japan International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART2018), Toronto, Canada

Transcript of Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n...

Page 1: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping

Takuya Kojima, Naoki Ando, Yusuke Matsushita, Hayate Okuhara,

Nguyen Anh Vu Doan and Hideharu AmanoKeio University, Japan

International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART2018), Toronto, Canada

Page 2: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

Outlinen Introductionn A CGRA Architecturen Three Types of Control

1. Pipeline Structure Control2. Body Bias Control3. Application Mapping

n New Mapping Optimization Methodn Real Chip Implementationn Experimental Resultsn Conclusion �

Page 3: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

Importance of Low Power Consumption

nForthcomingnIoT devicesnWearable computingnSensor network

nChallengesnHigh performance

nFor image processingnLow Power Consumption

nFor long battery life�

Page 4: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

SF-CGRAs: Straight-Forward Coarse-Grained Reconfigurable Arrays

n Key features of straight-forward CGRAs

Permutation Network

PEPE

PEPE

PEPE

PEPE

Pipeline Register

Permutation Network

PEPE

PEPE

PEPE

PEPE

Date Mem

ory

n Limited data flow directionn Less frequent reconfiguration

n Pipelined PE arraynHigh energy efficiency �

Page 5: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

VPCMA: Variable Pipelined

Cool Mega Array [1]

n PE array consists of

n 8 x 12 PEs

n 7 pipeline registers

n PE has

n No Register file

n No clock tree

n Pipeline register works in

1. latch mode

2. bypass mode

n μ-Controller

n Controls data transfer

data mem. ↔ PE array

PE PE

PE PE

PE PE

PE PE

PE-Array

���

PE PE

PE PE

PE PE

PE PE

���

Data Manipulator

Data Memory���

���

μ-co

ntr

oller

���

���

Pipeline

Registers

or

[1] N.Ando, et al. "Variable pipeline structure for Coarse Grained Reconfigurable Array CMA."

Field-Programmable Technology, 2016.

Page 6: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

Pipeline Structure Control

6

4th PE Row

3rd PE Row

2nd PE Row

1st PE Row

PipelineRegister

Number of Pipeline StageLarge Small

Operating FrequencyThroughput

Glitch PropagationDynamic Power of Registers & Clock

1st stage

3rd stage

2nd stage

Page 7: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

Pipeline Structure Control

7

4th PE Row

3rd PE Row

2nd PE Row

1st PE Row

PipelineRegister

2nd stage

1st stage

Number of Pipeline StageLarge Small

Operating FrequencyThroughput

Glitch PropagationDynamic Power of Registers & Clock

Page 8: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

Body Bias Effects on SOTB

n Tradeoff between leak power and performance

Decreaseof Static Power

PerformanceEnhancement

Zero BiasReverse Bias Forward Bias

n SOTB Technologyn 65 nmnOne of FD-SOIn Body Biasing

Page 9: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

Row-level Body Bias Control

9

Probability of Leak Power Reduction

Delay timein case of no control

Delay timein case ofrow-level control

AND

SL

MULT

ADD

2 Stage Pipeline 4 Stage PipelineDelay Time of PEfor Each Opcode

Time Deadline

Page 10: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

How to map an applicationto the PE array?

n An app. is represented as a data flow graph (DFG)

n Various Mappings exist

+

OR

>><<

×

− −

OR

+ ×

<< >>

Example of Application DFG PE Array

n High Performance

n Large Power

Mapping Eval.map

��

Page 11: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

How to map an applicationto the PE array?

n An app. is represented as a data flow graph (DFG)

n Various Mappings exist

+

OR

>><<

×

−OR

+ ×<< >>

Example of Application DFG

PE Array

n Small Power

n Low Performance

Mapping Eval.

map

��

Page 12: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

Complexity of Mapping Optimization

3. Body Bias Voltage(BBV) for Each Row2. Pipeline

Structure

(1( . ( ( -)

( 11 (2 2 () 2 ) 2 1

DynamicPower

StaticPower

(# of Rows)^(# of voltages)patterns

NP-Complete Problem

128 patterns

n Tradeoff between leak power and dynamic power

Interdependenton each other

control

Page 13: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

Related work

1. Performance & power optimization for CGRA[2]

n Considering VDD control

n Optimization Priority: Performace > Power

2. Body bias domain size exploration for CGRAs[3]

n Analysis of area overhead and power reduction effects

n Not taking care of the dynamic power

3. Pipeline & body bias optimization for CGRAs [4]

n Method using integer-linear-program

n Assuming static mapping

��

[2] Gu, Jiangyuan, et al. "Energy-aware loops mapping on multi-vdd CGRAs without performance degradation.”

Design Automation Conference (ASP-DAC), 2017 22nd Asia and South Pacific. IEEE, 2017.

[3] Y.Matsushita, “Body Bias Grain Size Exploration for a Coarse Grained Reconfigurable Accelerator”,

Proc. of the 26th The International Conference on Field-Programmable Logic and Applications (FPL),2016.

[4] T. Kojima, et al. “Optimization of body biasing for variable pipelined coarse-grained reconfigurable architectures”.

IEICE Transactions on Information and Systems, Vol. E101-D,No. 6, June 2018.

Page 14: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

Is optimizing only the power consumption enough?

n Several requirementsn Power Consumptionn Performance (Operating Frequency)n Throughput

n Multi-Objective Optimization brings usersnA variety of choicesnBalancing the tradeoffs

Power

PerformanceThroughput��

Page 15: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

Proposal: Use Multi-Objective Optimization

n Non-dominated Sorting Genetic Algorithm-II (NSGA-II)nMulti-Objective Genetic Algorithm

n In this workn1-point crossovernCommonly-used probability [5]

n0.7 crossover probabilityn0.3 mutation probability

n300 generations

[5] L. Davis. “Adapting operator probabilities in genetic algorithms”. In Proceedings of the third international conference on Genetic algorithms, pp. 61–69, San Francisco, CA, USA, 1989. Morgan Kaufmann Publishers Inc.

��

Page 16: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

Gene & Evaluation of Individuals

��������� DFG Mapping Pipeline Structure

Degree of Parallelism

Total WireLength

DynamicPower

Static Power

Total Power

AnalyzeEach Path

RoutingTargetFreq.

ILP Solverfor BBVs

GlitchEstimation

BBV forEach Row

��

Critical Path Delay

Page 17: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

Gene & Evaluation of Individuals

��������� DFG Mapping Pipeline Structure

Degree of

ParallelismTotal Wire

Length

Dynamic

Power

Static

Power

Total

Power

Analyze

Each Path

RoutingTarget

Freq.

ILP Solver

for BBVs

Glitch

Estimation

BBV forEach Row

��

Critical

Path Delay

• Dynamic power model

• Proposed in [6]

• Considering glitch

propagation

• Based on results

of real chip

measurements

[6] T.Kojima, et al. “Glitch-aware

variable pipeline optimization for

CGRAs”. ReConFig2017, pp. 1–6,

Dec 2017.

Page 18: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

Gene & Evaluation of Individuals

��������� DFG Mapping Pipeline Structure

Degree of Parallelism

Total WireLength

DynamicPower

Static Power

Total Power

AnalyzeEach Path

RoutingTargetFreq.

ILP Solverfor BBVs

GlitchEstimation

BBV forEach Row

��

Critical Path Delay

• An Integer Linear Program (ILP)• Minimizes the static power• Considers timing constraints• Takes within 0.1 sec

• The same method as proposedin [4]

[4] T. Kojima, et al. “Optimization of body biasing for variable pipelined coarse-grained reconfigurable architectures”. IEICE Transactions on Information and Systems, Vol. E101-D,No. 6, June 2018.

Page 19: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

An Implemented Real Chip “CCSOTB2”

n CCSOTB2n VPCMA Architecturen SOTB 65nm Technologyn 5 Body Bias Domains

n Design: Verilog HDLn Synthesis: Synopsys Design Compilern Place & Route: Synopsys IC Compiler

6mm

3mm

TCI

PE Array

Body Bias Domainsdomain1 1-5th PE Rowsdomain2 6th PE Row domain3 7th PE Row domain4 8th PE Row domain5 other parts

��

Page 20: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

Preliminary Experiments

n Leak power of PE row is measuredn BBV: -0.8 ~ +0.4 V (step: 0.2 V)

n Maximum Operating Freq.n 30MHzn due to bottleneck in μ-controller

CCSOTB2 Chip Artex-7FPGA

Experimental Environment

Mother Board

��

ZeroBias

Page 21: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

Benchmark Applications

n4 simple image processing applicationnAssuming 30MHz frequency

Name Descriptionaf 24bit alpha blender

gray 24bit gray scalesepia 8bit sepia filter

sf 24 bit sepia filter

��

Page 22: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

Proposed method vs. Black-Diamond

nBlack-Diamond [7]ndoes not support pipeline control

nor body bias controlnStatic mapping regardless of user’s requirements

nCombine with pipeline optimization[6]nConsidering glitch effects

[6] T.Kojima, et al. “Glitch-aware variable pipeline optimization for CGRAs”. ReConFig2017, pp. 1–6, Dec 2017.[7] V.Tunbunheng , et al. “Black-diamond: a retargetable compiler using graph with configuration bitsfor dynamically reconfigurable architectures”. In Proc. of The 14th SASIMI, pp. 412–419, 2007. ��

Page 23: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

Mapping quality

Black-Diamond withpipeline optimization Proposed method

Difference of mapping results (gray application)

0.0 V0.0 V

-0.4 V-0.4 V-0.4 V

��

Page 24: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

Mapping quality

Black-Diamond withpipeline optimization Proposed method

Difference of mapping results (af application)

0.0 V0.0 V

0.0 V0.0 V-0.2 V

��

Page 25: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

Power reduction

n For all applications, the total power is reducedn In average, 14.2 % reduction is achieved

��

Page 26: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

Conclusionn A new optimization method based on a multi-

objective genetic algorithm is proposedn Three controls are considered simultaneously

1. Pipeline structure control2. Body bias control3. Application mapping

n Real chip experiments shows 14.2% power reduction

��

Page 27: Real Chip Evaluation of a Low Power CGRA with Optimized Application Mapping · 2020-05-14 · n Analysis of area overhead and power reduction effects n Not taking care of the dynamic

2 222 2

2