High Performance, Low Power Reconfigurable Processor for Embedded Systems
description
Transcript of High Performance, Low Power Reconfigurable Processor for Embedded Systems
High Performance, Low Power Reconfigurable Processor for Embedded Systems
Farhad Mehdipour, Hamid Noori, Koji Inoue, Kazuaki Murakami
Kyushu University, Japan
Kyushu University ISOCC 2007@Seoul, Korea 2/38
Oct 16, 2007
Outline
Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support
Conditional Execution Experimental Results Conclusion
Kyushu University ISOCC 2007@Seoul, Korea 3/38
Oct 16, 2007
Outline
Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support
Conditional Execution Experimental Results Conclusion
Kyushu University ISOCC 2007@Seoul, Korea 4/38
Oct 16, 2007
Introduction
Designing Embedded Systems Embedded Microprocessors Application Specific Integrated Circuits (ASICs) Application Specific Instruction set Processors (ASIPs) Extensible Processors
CPU
Instruction Dispatcher
Register File
+ & x LD/ST CFU1 CFU2
LD/ST: Load / Store
CFU: Custom Functional Unit
Kyushu University ISOCC 2007@Seoul, Korea 5/38
Oct 16, 2007
Goal
Improving the performance and energy efficiency of embedded processors, while maintaining compatibility and flexibility.
Kyushu University ISOCC 2007@Seoul, Korea 6/38
Oct 16, 2007
Extensible Processors
Enhancing the performance of a processor in embedded systems Using an accelerator for accelerating frequently executed portions of applications
Accelerator implementations reconfigurable fine/coarse grain hw custom hardware (such as ASIP or Extensible Processors)
AND AND
OR
AND2_OR
CPU
Instruction Dispatcher
Register File
+ & x LD/ST CFU1 CFU2
LD/ST: Load / Store
CFU: Custom Functional Unit
Kyushu University ISOCC 2007@Seoul, Korea 7/38
Oct 16, 2007
Custom Instructions
Critical segments Most frequently executed portions of the applications
Instruction set customization hardware/software partitioning
Custom Instructions (CIs) are extracted from critical segments of an application and executed on an Custom Functional Unit (CFU)
A CI is represented as a Data Flow Graph (DFG)
CI or DFG generation levels high level or binary level (DFG nodes are the instructions level operations)
Kyushu University ISOCC 2007@Seoul, Korea 8/38
Oct 16, 2007
Adaptive Extensible Processors
Issues of Extensible Processors High NRE (Non-Recurring Engineering) and manufacturing costs Long time-to-market
Adaptive Extensible Processor Adding and generating custom instructions after fabrication Using a reconfigurable functional unit (RFU) instead of custom functional unit
CPU
Instruction Dispatcher
Register File
+ & x LD/ST CFU1 CFU2RFU Config Mem
CFU: Custom Functional Unit
RFU: Reconfigurable Functional Unit
Kyushu University ISOCC 2007@Seoul, Korea 9/38
Oct 16, 2007
General Overview of the Proposed Architecture
GPP: General Purpose Processor
RFU: Reconfigurable Functional Unit
Register File
ID/EXE Reg
RFUALU
MUX
EXE/MEM Reg
GPP Augmented HW
400680 subiu $25,$25,1400688 lbu $13,0($7)400690 lbu $2,0($4)400698 sll $2,$2,0x184006a0 sra $14,$2,0x184006a8 addiu $4,$4,14006b0 srl $8,$2,0x1c4006b8 sll $2,$8,0x24006c0 addu $2,$2,$254006c8 lw $2,0($2)4006d0 xori $13,$13,14006d8 addu $10,$10,$2400680 subiu $25,$25,1400698 sll $2,$2,0x184006a0 sra $14,$2,0x18400688 lbu $13,0($7)4006e0 bgez $10,4006f0 ....
Hot Basic Block
Config Memory
Kyushu University ISOCC 2007@Seoul, Korea 10/38
Oct 16, 2007
Outline
Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support
Conditional Execution Experimental Results Conclusion
Kyushu University ISOCC 2007@Seoul, Korea 11/38
Oct 16, 2007
Control Data Flow Graph: Definition
CDFG The DFG containing control instructions (e.g. branch instruction)
The sequence of execution changes due to the result of a branch instructions
Types of CDFGs: CDFGs containing at most one branch instruction (last instruction)
accelerator does not need to support conditional execution
CDFGs containing more than one branch instruction accelerator should support conditional execution
Kyushu University ISOCC 2007@Seoul, Korea 12/38
Oct 16, 2007
Why need to support Control Instructions- Motivations (1/2)
Quantitative analysis approach using applications of Mibench DFG extraction process
Short distance control instructions small size DFGs (SSDFG) SSDFGs do not offer noticeable speedup have to be run on the base
processor
1 2 5 beq 7 8bgez3 9 10 bne 12 13 14 15 . . .bne16
Kyushu University ISOCC 2007@Seoul, Korea 13/38
Oct 16, 2007
Why need to support Control Instructions- Motivations (2/2)
Analysis on 17 application of Mibench bitcount
almost 92% of application is hot 32% out of 92% of hot portions do not worth to be accelerated due to the SSDFGs
fft, fft(inv) and sha include few branch instructions supporting conditional execution results in no considerable speedup
0102030405060708090
100
adpcm
(enc)
adpcm
(dec
)
bitcounts
blowfis
h
blowfis
h (dec
)
basic
mat
hcj
peg crc
dijkst
ra
djpeg fft
fft (i
nv)la
me
patric
iash
a
strin
gsear
ch
susa
n
%
Percentage of hot portions Percentage of eliminated hot portions due to SSDFGs
Kyushu University ISOCC 2007@Seoul, Korea 14/38
Oct 16, 2007
Outline
Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support
Conditional Execution Experimental Results Conclusion
Kyushu University ISOCC 2007@Seoul, Korea 15/38
Oct 16, 2007
Basic Requirements
inst. # address inst. operands (dest, src1, src2)
1 400418 addu R13 R0 R02 400420 addiu R4 R4 23 400428 subu R3 R2 R114 400430 bgez 400440 R35 400438 addiu R13 R0 86 400440 beq 400450 R137 400448 subu R3 R0 R38 400450 addu R10 R0 R09 400458 sra R8 R9 0x310 400460 slt R2 R3 R911 400468 bne 400488 R212 400470 addiu R10 R0 413 400478 subu R3 R3 R914 400480 addu R8 R8 R915 400488 sra R9 R9 0x116 400490 slt R2 R3 R917 400498 bne 4004b8 R218 4004a0 ori R10 R10 219 4004a8 subu R3 R3 R920 4004b0 addu R8 R8 R921 4004b8 sra R9 R9 0x122 4004c0 slt R2 R3 R923 4004c8 bne 4004e0 R224 4004d0 ori R10 R10 125 4004d8 addu R8 R8 R9
19,20
13: subu
16: slt
17:bne
21: sra
22: slt
15: sra
23:bne
19
3, 7
3, 7
253, 7, 19
R3
R3
R3
R9
R9
R3
R24004e0
4004b8
0x1
R90x1
R9
1 5 6 7 84 11 15
17 16212223242526
... ... ...
...
Control Flow Graph
DFG nodes receive their input from a single source CDFG nodes can have multiple sources The correct source is selected at run time according to the results of branches
Kyushu University ISOCC 2007@Seoul, Korea 16/38
Oct 16, 2007
Basic Requirements
Capability of selective receiving of inputs from both accelerator primary inputs and output of other instructions (FUs) for each node
Capability of selecting the valid outputs from several outputs generated by accelerator according to conditions made by branch instructions
Accelerator should be equipped by control path to provide with the correct selection of inputs and outputs for each FU and entire accelerator
Kyushu University ISOCC 2007@Seoul, Korea 17/38
Oct 16, 2007
CDFG Generation-Example (1/3)
BB1 BB2 BB3
BB4
2 5 beq 7 8bgez3 9
18 17 16 14151920bne
bne10 11 1210
…………….30
load load
Kyushu University ISOCC 2007@Seoul, Korea 18/38
Oct 16, 2007
Example
2 5 beq 7 8bgez3 10
1920bne
bne11 120
exit4
exit3
exit1
exit2
2 5 beq 7 8bgez3 9
18 17 16 14151920bne
bne10 11 1210
BB1BB2 BB3
BB4
…………….30
CDFG Generation-Example (2/3)
Kyushu University ISOCC 2007@Seoul, Korea 19/38
Oct 16, 2007
CDFG Generation- Example (3/3)
inst. # address inst. operands (dest, src1, src2)
1 400418 lw R23 100 R22 400420 addiu R4 R4 23 400428 subu R3 R2 R114 400430 bgez 400440 R35 400438 addiu R13 R0 86 400440 beq 400468 R137 400448 subu R3 R0 R38 400450 addu R10 R0 R09 400458 lw R8 R9 0x310 400460 slt R2 R3 R9
13 400478 bne 4004a8 R214 400480 addiu R10 R0 415 400488 subu R3 R3 R916 400490 addu R8 R8 R917 400498 sra R9 R9 0x118 4004a0 slt R2 R3 R9
21 4004b8 bne 400410 R2
19 4004a8 ori R10 R10 220 4004b0 subu R3 R3 R9
22 4004c0 slt R2 R3 R9
11 40046812 400470
0 400410 addu R13 R0 R0
inst. # address inst. operands (dest, src1, src2)
1 400410 lw R23 100 R2
2 400420 addiu R4 R4 23 400428 subu R3 R2 R114 400430 bgez 400440 R35 400438 addiu R13 R0 86 400440 beq 400468 R137 400448 subu R3 R0 R38 400450 addu R10 R0 R0
9 400460 lw R8 R9 0x310 400458 slt R2 R3 R9
13 400478 bne 4004a8 R214 400480 addiu R10 R0 415 400488 subu R3 R3 R916 400490 addu R8 R8 R917 400498 sra R9 R9 0x118 4004a0 slt R2 R3 R9
21 4004b8 bne 400410 R2
19 4004a8 ori R10 R10 220 4004b0 subu R3 R3 R9
22 4004c0 slt R2 R3 R9
11 40046812 400470
0 400418 addu R13 R0 R0
Code before generating MECI Code after generating MECI
ori R10 R10 1addu R8 R8 R9
ori R10 R10 1addu R8 R8 R9
2 5 beq 7 8bgez3 10
1920bne
bne11 120exit4
exit3
exit1
exit2
Kyushu University ISOCC 2007@Seoul, Korea 20/38
Oct 16, 2007
Outline
Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support
Conditional Execution Experimental Results Conclusion
Kyushu University ISOCC 2007@Seoul, Korea 21/38
Oct 16, 2007
CDFG Temporal Partitioning Algorithms
CDFGs extracted from various applications have different sizes for some of the CDFGs
• the whole of CDFG can not be mapped on the accelerator due to the resource limitations of the accelerator
Resource constraints the number of inputs, outputs, logics and routing resource
constraints
Kyushu University ISOCC 2007@Seoul, Korea 22/38
Oct 16, 2007
CDFG Temporal Partitioning Algorithms
Temporal partitioning Temporally divides a DFG into a number of smaller partitions each partition can fit into the target hardware dependencies among the nodes are not violated
Temporal Partitioning algorithms for CDFGs Not-Taken Path Traversing alg. (NTPT) Frequency based TP alg.
Kyushu University ISOCC 2007@Seoul, Korea 23/38
Oct 16, 2007
TP Based on Not-Taken Paths (NTPT)
Adds instructions from not-taken path of a control instruction to a partition until violating the target hardware architectural constraints or reaching to a terminator control instruction
Generating a new partition is started with the branch instructions which at least one of their taken or not-taken instructions has not been located in the current partition
Terminator instruction an instruction which changes execution direction of the program an exit point for a CDFG
Kyushu University ISOCC 2007@Seoul, Korea 24/38
Oct 16, 2007
TP Based on Not-Taken Paths (NTPT)
2 5 beq 7 8bgez3 9
18 17 161920bne
bne10 11 1210
…
0 1 2 53 bgez beq 7 8 9 10beq 7 8 9 10 11 12 bne 151414 15
…
Kyushu University ISOCC 2007@Seoul, Korea 25/38
Oct 16, 2007
Frequency-Based TP Algorithm
NTPT algorithm instructions were selected only from not-taken paths of branches
Frequency based algorithm execution frequency of taken and not-taken paths is an effective factor
for in adding them to the current partition a frequency threshold is defined to determine that whether instruction is
critical or not one of the taken or not-taken paths or both of them can be critical
Kyushu University ISOCC 2007@Seoul, Korea 26/38
Oct 16, 2007
TP Based on Not-Taken Paths (NTPT)
2 5 beq 7 8bgez3 9
18 17 161920bne
bne10 11 12100 1 2 53 bgez beq 7 8 9 10 11 12 bne 151414 15
50%
50%
90%
10%
11
80%
20%
12
…
…
Kyushu University ISOCC 2007@Seoul, Korea 27/38
Oct 16, 2007
Evaluating Proposed Algorithms
Average Partition No. Efficiency
0
1
2
3
4
5
6
Av
era
ge
Pa
rtit
ion
No
.
NTPT Alg. Frequency Based Partitioning Alg.
012345678
Eff
icie
nc
y
NTPT Alg. Frequency Based Partitioning Alg.
Kyushu University ISOCC 2007@Seoul, Korea 28/38
Oct 16, 2007
Outline
Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support
Conditional Execution Experimental Results Conclusion
Kyushu University ISOCC 2007@Seoul, Korea 29/38
Oct 16, 2007
Case study: Extending an Extensible Processor to Support Conditional Execution
AMBER an extensible processor targeted
for embedded systems Goal: accelerating application
execution base processor a general
RISC processor Including: profiler, sequencer and
a coarse grain RFU RFU: an accelerator without
conditional execution support
Register File
ID/EXE Reg
RFU
Multi-ContextMemory
Cache
Functional Unit
MUX SequencerSequencer
Tabel
EXE/MEM Reg Profiler ProfilerTable
GPP Augmented HW
OnlineTraining
Kyushu University ISOCC 2007@Seoul, Korea 30/38
Oct 16, 2007
Extending AMBER’s RFU to Support Conditional Execution (CRFU)
The selectors of muxes are used for choosing data for FU inputs controlled by the configuration bits
The outputs of FUs are only applied to the Selector-Muxes in the lower-level rows
Connections from input ports toinputs of the rows
RFU Input Ports
RFU Output Ports
Outputs of 2nd row to theinputs of 4th and 5th rows
Row1
Row5
FU FU FU FU
Outputs of 1st row to theinputs of 3rd, 4th and 5th rows
Kyushu University ISOCC 2007@Seoul, Korea 31/38
Oct 16, 2007
CRFU
FU1 FU2
FU3 FU4
ConfigurationBits
ConfigurationBits
DataSelectionMux
Branch result from FU1
Branch result from FU2
Configuration Bits
Branch result from FU1
Branch result from FU2
Configuration Bits
Configuration Bits
Configuration Bits
Selection using branch results
Inputs of Selector-Mux (one-bit width) originate from the FUs executing branch instructions
Data Selector-MuxSelection using configuration bits
Selection using configuration bits
Kyushu University ISOCC 2007@Seoul, Korea 32/38
Oct 16, 2007
Outline
Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support
Conditional Execution Experimental Results Conclusion
Kyushu University ISOCC 2007@Seoul, Korea 33/38
Oct 16, 2007
Experimental Results
Synthesis result of the extended RFU Synopsys tools and Hitachi 0.18μm area 2.1 mm2.
CDFG configuration bit-stream 615 bits Control signals: 375 bits
Executing applications on the Simplescalar as ISS Profiling and extracting hot instruction sequence
NTPT temporal partitioning algorithm for generating mappable CDFGs
Kyushu University ISOCC 2007@Seoul, Korea 34/38
Oct 16, 2007
Experimental Results-Speedup
Speedup Comparison
11.21.41.61.8
22.22.42.62.8
3
adpcm
(enc)
adpcm
(dec
)
blowfis
h (enc)
blowfis
h (dec
) cr
c
dijkst
ra
Avera
ge
Sp
eed
up
DFG
CDFG
Average Speedup: 2.1 (CDFG) versus 1.1 (DFG)
Kyushu University ISOCC 2007@Seoul, Korea 35/38
Oct 16, 2007
Experimental Results-Energy Saving
Energy Saving
01020304050607080
adpcm
(enc)
adpcm
(dec
)
blowfis
h(enc)
blowfis
h(dec
)cr
c
dijkst
ra
Avera
ge
To
tal E
ner
gy
Red
uct
ion
(%
) C
DF
G v
s D
FG
DFGCDFG
Average Energy Saving: 43% (CDFG) versus 21% (DFG)
Kyushu University ISOCC 2007@Seoul, Korea 36/38
Oct 16, 2007
Experimental Results-Comparison
Kyushu University ISOCC 2007@Seoul, Korea 37/38
Oct 16, 2007
Outline
Introduction Motivations for supporting Control Instructions Basic Requirements for Supporting Conditional Execution Algorithms for CDFG Temporal Partitioning Case study: Extending an Extensible Processor to Support
Conditional Execution Experimental Results Conclusion
Kyushu University ISOCC 2007@Seoul, Korea 38/38
Oct 16, 2007
Conclusion
Handling branch instruction & extending DFGs to CDFGs Basic requirements for an accelerator featuring conditional
execution CDFG temporal partitioning algorithms
NTPT traverse not-taken path of the branch instructions Frequency-based temporal partitioning algorithm
Extending AMBER’s RFU to support conditional execution Increasing speedup Reducing energy consumption
Thank You!