A Coarse-Grained Reconfigurable Wavelet Denoiser Exploiting the Multi-Dataflow Composer Tool
A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator
description
Transcript of A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator
![Page 1: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/1.jpg)
A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator
Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki Murakami
Kyushu University, Fukuoka, Japan
![Page 2: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/2.jpg)
2
OUTLINE
Introduction Problem Definition and Basic Concepts Hybrid DSE Approach for Designing RAC Case study: Designing RAC for a Reconfigurable Processor Conclusion
![Page 3: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/3.jpg)
3
OUTLINE
Introduction Problem Definition and Basic Concepts Hybrid DSE Approach for Designing RAC Case study: Designing RAC for a Reconfigurable Processor Conclusion
![Page 4: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/4.jpg)
4
Designing Embedded Systems
Embedded Microprocessors Application-Specific Integrated Circuits (ASICs) Application-Specific Instruction set Processors (ASIPs) Extensible Processors
CPU
Instruction Dispatcher
Register File
+ & x LD/ST CFU1 CFU2
LD/ST: Load / Store
CFU: Custom Functional Unit
![Page 5: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/5.jpg)
5
Extensible Processors
Improving the performance and energy efficiency of the embedded processors
maintaining compatibility and flexibility. Using an accelerator for accelerating frequently executed portions of
applications Accelerator implementations
custom hardware (such as ASIP or Extensible Processors) reconfigurable fine/coarse grain hw
CPU
Instruction Dispatcher
Register File
+ & x LD/ST CFU1 CFU2LD/ST: Load / Store
CFU: Custom Functional Unit
![Page 6: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/6.jpg)
6
Custom Instructions
Critical segments Most frequently executed portions of the applications
Instruction set customization hardware/software partitioning (Identifying critical segments in applications)
Custom Instructions (CIs) are extracted from critical segments of an application and executed on a Custom Functional Unit (CFU)
A CI is represented as a DFG
1: SUBU R3, R0, R32: ADDU R10, R0, R03: SRA R8, R10, 0x34: SLT R2, R3, R85: BNE R0,400488, R2
ADDU
SRA
SLT
SUBU
BNE
R3R0 R0R0
R10
R8
R2
R2
R30x3
400488
2
3
4
5
1A Custom Instruction
![Page 7: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/7.jpg)
7
Reconfigurable Processors
Using a reconfigurable accelerator (RAC) instead of custom functional units
Adding and generating custom instructions after fabrication
CPU
Instruction Dispatcher
Register File
+ & x LD/ST CFU1 CFU2RFU Config Mem
CFU: Custom Functional Unit
RFU: Reconfigurable Functional Unit
![Page 8: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/8.jpg)
8
Baseline Processor
ALU
Register File
RAC...
...
...
ConfigurationMemory
How a Reconfigurable Processor Works
GPP: General Purpose Processor
RAC: Reconfigurable Accelerator
400680 subiu $25,$25,1400688 lbu $13,0($7)400690 lbu $2,0($4)400698 sll $2,$2,0x184006a0 sra $14,$2,0x184006a8 addiu $4,$4,14006b0 srl $8,$2,0x1c4006b8 sll $2,$8,0x24006c0 addu $2,$2,$254006c8 lw $2,0($2)4006d0 xori $13,$13,14006d8 addu $10,$10,$2400680 subiu $25,$25,1400698 sll $2,$2,0x184006a0 sra $14,$2,0x18400688 lbu $13,0($7)4006e0 bgez $10,4006f0 ... Hot Basic Block
Reconfigurable Processor
![Page 9: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/9.jpg)
9
OUTLINE
Introduction Problem Definition and Basic Concepts Overall Design Approach Hybrid DSE Approach for Designing RAC Case study: Designing RAC for a Reconfigurable Processor Conclusion
![Page 10: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/10.jpg)
10
Designing a Reconfigurable Accelerator (RAC)
Multitude of design parameters e.g. number of functional units, input/output ports, type of
functional units and etc.
Several design parameters high complexity in the RAC design the requirements for a methodological approach
A major challenge finding the right balance between the different quality
requirements (e.g. speedup, area, energy consumption)
![Page 11: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/11.jpg)
11
Traditional Design Process
Describing a reference model Verifying the model functional correctness Obtaining a rough estimates of performance Manual or semi-automatic generation of several
alternative designs Choosing the most suitable design based on various
performance metrics
![Page 12: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/12.jpg)
12
Design Space Exploration
Design Space Exploration (DSE) the process of analyzing several “functionally equivalent”
implementation alternatives to identify an optimal solution Can become too computationally expensive
Example: a design with four tasks on an architecture with three processing
modules and each have four possible configurations results in 500 design alternatives
Our Hybrid (analytical & quantitative) DSE approach reduces substantially the design time and effort
![Page 13: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/13.jpg)
13
Assumptions
RAC: a matrix of FUs the width equal to w and height equal to h basically has a combinational logic
Basic Elements: Functional Units (logic resources) Multiplexers (routing resources)
FUs in RAC are fully connected Except the lack of connections from lower to upper rows
RAC Input Ports
RAC Output Ports
2
2FU
1
1FU
2
1FU
1
2FU Row1
FU FU FU FU
Col1
Row5
...
...
h
w
![Page 14: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/14.jpg)
14
Assumptions
CIs (DFGs) are mapped onto the RAC and executed at runtime
1: SUBU R3, R0, R32: ADDU R10, R0, R03: SRA R8, R10, 0x34: SLT R2, R3, R85: BNE R0,400488, R2
ADDU
SRA
SLT
SUBU
BNE
R3R0 R0R0
R10
R8
R2
R2
R30x3
400488
2
3
4
5
1A Custom Instruction
Data Flow Graph
2
3
4
1
RAC Initial Architecture
5
![Page 15: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/15.jpg)
15
OUTLINE
Introduction Problem Definition and Basic Concepts Hybrid DSE Approach for Designing RAC Case study: Designing RAC for a Reconfigurable Processor Conclusion
![Page 16: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/16.jpg)
16
The Problem
Designing a RAC while optimizing Speedup & Area Design parameters
the RAC’s width and height the number of FUs, the number of input and output ports
Speedup
CCyclesOnRATotalClock
ocessorseCyclesOnBaTotalClockSpeedup
Pr
![Page 17: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/17.jpg)
17
Design Methodology- Tool Chain
Our Approach: A Hybrid (Analytical & Quantitative) DSE Approach
CI ExtractorCIs
(DFGs)
CI (DFG)Analyzer
Required Informationfor DSE
Design SpaceExploration
Accelerator FinalSpecification
Adjusting RACArchitecure
Specification ofAccelerator
Application
![Page 18: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/18.jpg)
18
Problem Formulation
: the fraction of DFGs (CIs) in applications which have the width equal to i and height equal to j.
: percentage of application execution time concerns to all DFGs with the width of 4 and height of 3 is 7%.
ij
W
i
H
j
ij CPUCCCPUCC
1 1
)()(
jiFUCPUCC ij
ij )()(
2
4
6
1 3
5
2
4
5
1 3
ij
%743
![Page 19: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/19.jpg)
19
Problem Formulation
ij
W
i
wh
H
j
ij
wh RACCCRACCC
1 1
)()(
: the number of clock cycles for executing DFG(i,j) on a RAC (w,h)
)( wh
ij RACCC
if one or both dimensions of a DFGs are greater than RAC’s dimensions,
temporal partitioning: divides a DFG into time exclusive smaller DFGs
reconfiguration overhead time for loading subsequent partitions of a DFG from the configuration memory onto the RAC
![Page 20: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/20.jpg)
20
Problem Formulation
hjwi ,)3hjwi ,)2
)()( wh
wh
ij RACDRACCC hjwi ,)1
hjwi ,)4
RACh
w
RAC
DFG
RAC
h
w
RAC
DFGh
w
RAC
DFG
![Page 21: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/21.jpg)
21
Delay of RAC
wkMUXDFUDRACDh
i
ki
h
i
ki
wh ,...,1,)()()(
1
11
)( kjMUXD = delay of mux(
kjs2 to 1)
kjmk
js 2log )1()1( wwjmkjRAC Input Ports
RAC Output Ports
22FU
1
1FU
21FU
1
2FU Row1
FU FU FU FU
Col1
Row5
...
...
h
wRAC’s Delay Path
![Page 22: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/22.jpg)
22
Area of RAC
= delay of mux(kjs2 to 1)
w
i
h
j
ij
w
i
h
j
ij
wh MUXAFUARACA
1
1
11 1
)(*2)()(
)( ijMUXA
RAC Input Ports
RAC Output Ports
2
2FU
1
1FU
2
1FU
1
2FU Row1
FU FU FU FU
Col1
Row5
...
...
h
w
![Page 23: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/23.jpg)
23
Optimization Problem
Objective: Maximize ( ) )
ij
W
i
wh
H
j
ij
ij
W
i
H
j
ij
wh
wh
RACCC
CPUCC
RACCC
CPUCCRACS
1 1
1 1
)(
)(
)(
)()(
)( whRACS
![Page 24: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/24.jpg)
24
OUTLINE
Introduction Problem Definition and Basic Concepts Hybrid DSE Approach for Designing RAC Case study: Designing RAC for a Reconfigurable Processor Conclusion
![Page 25: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/25.jpg)
25
Designing RAC for a Reconfigurable Processor
AMBER: a reconfigurable processor targeted for embedded systems Main components
a base processor (general RISC processor) Sequencer and a coarse-grained reconfigurable functional unit (RFU)
Register File
ID/EXE Reg
RFU
ConfigurationMemory
ALUs
MUX SequencerSequencer
Table
EXE/MEM Reg
GPP Augmented HW
AMBER’s Architecture
![Page 26: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/26.jpg)
26
Quantitative Approach
22 applications of Mibench were attempted Applications were executed on Simplescalar and profiled Hot segments and DFGs are extracted from the applications No limitation in the RFU’s initial architecture DFGs are mapped on the initial RFU architecture Feedbacks of mapping
Specification of the designed architecture for AMBER’s RFU
![Page 27: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/27.jpg)
27
Hybrid DSE Approach
FU and various size multiplexers Synthesized using Hitachi 0.18um Measuring delay and area
DFGs are analyzed to extract required information quantitatively
Reconfiguration penalty time: 1 clock cycle
The base processor clock frequency: 166MHz
![Page 28: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/28.jpg)
28
Speedup Evaluation
14
710
1316
191 4 7
10
13
16
19
0
0.5
1
1.5
2
2.5
Speedup
Width
Height
0-0.5 0.5-1 1-1.5 1.5-2 2-2.5
![Page 29: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/29.jpg)
29
Speedup Evaluation
Increasing RAC’s width more parallelism
Widths larger than 6 negative effect of growing the number of muxes and their sizes no more speedup achievable
Increasing RAC’s Height longer delay speedup declines and area increases
![Page 30: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/30.jpg)
30
Comparison
![Page 31: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/31.jpg)
31
OUTLINE
Introduction Problem Definition and Basic Concepts Hybrid DSE Approach for Designing RAC Case study: Designing RAC for a Reconfigurable Processor Conclusion
![Page 32: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/32.jpg)
32
CONCLUSION
Hybrid DSE approach Uses realistic data from the attempted applications Substantially reduces design time and designer efforts Can be used for shrinking a large design space Easily extendable to apply new design parameters More suitable where the new applications are added
![Page 33: A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator](https://reader036.fdocuments.us/reader036/viewer/2022070417/568152f3550346895dc10f3f/html5/thumbnails/33.jpg)
Thanks for your attention!