Post on 23-Feb-2016
description
ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions
Koji Inoue†, Hamid Noori‡, Farhad Mehdipour†, Takaaki Hanada†, and Kazuaki Murakami†
†Department of Advanced Information Technology, Kyushu University, Fukuoka, Japan‡School of Electrical and Computer Engineering, University of Tehran
2
Outline
• Introduction• ADEXOR: Adaptive Extensible Processor
– Overview– Microarchitecture– Coarse-grained Reconfigurable Functional Unit
• Evaluation• Conclusions
Motivation and Solution• Embedded processors have to achieve
– Low cost– High-performance– Low-power or low-energy consumption
• Key point– How can processors adapt to target applications?
• Solution: ASIP w/ Re-configurability– Application specific ISA
• Provide custom instructions (CIs)– Implement re-configurable FUs
3
ADaptive EXtensible processOR(ADEXOR)
4GPP: General Purpose Processor
CRFU: Coarse-grained Reconfigurable Functional Unit
Register File
ID/EXE Reg
CRFU
RFUConfiguration
Memory
ALU
MUX Counter
EXE/MEM Reg
GPP Augmented HW
Triggered by mtc1 orsequencer
Indexed by mtc1or sequencer
400680 subiu $25,$25,1400688 lbu $13,0($7)400690 lbu $2,0($4)400698 sll $2,$2,0x184006a0 sra $14,$2,0x184006a8 addiu $4,$4,14006b0 srl $8,$2,0x1c4006b8 sll $2,$8,0x24006c0 addu $2,$2,$254006c8 bgez $10,4006f0 4006d0 xori $13,$13,14006d8 addu $10,$10,$2400680 subiu $25,$25,1400698 sll $2,$2,0x184006a0 sra $14,$2,0x18400688 lbu $13,0($7)4006e0 bgez $10,4006f0 ....
Hot Basic Block
ID/EXE Reg
• Has a coarse-grained re-configurable functional unit• Supports efficient “Multi-Exits CIs”• Achieves high-performance and low energy
CRFU Microarchitecture
• 16 FUs controlled by configuration bits• MUX-base interconnection between FUs• Early stage data can be transferred to output
ports
Row 1
Row 5
Adder/subtractor AND OR XOR Barrel
Shifter
Configurationbits
Configurationbits
Configurationbits
FU FU FU FU
2 5 beq 7 8bgez3 9
18 17 16 14151920bne
bne10 11 1210
BB5
…………….30
BB1BB2
BB3BB4
BB6
50%
50%
60%
40%95%
5%
Supporting Multi-Exits Custom Instructions (MECIs)
6
Multiple-Exits Custom InstructionConditional Execution + Hot-Path Selection
Assume 16 nodes can be included in one CI in maximum
#Required nodes: 16adpcm
Exit
Exit
7
Experimental Setup (1/2)
Issue 1-way
L1-Instruction Cache 32K, 4 way, 1 cycle latency, miss penalty 20 cycles
L1- Data Cache 16K, 4 way, 1 cycle latency, miss penalty 20 cyclesALUs 1 integer unit, 1 floating point unit
Multiplier 1 Integer (5 cycles)
Divider 1 Integer (8 cycles)
Branch predictor bimodal
Branch prediction table size 256Extra branch misprediction 3
Register File 4-read ports, 2-write ports
Clock Frequency 135 MHz
Base Processor Configuration
Experimental Setup (2/2)
DEC/EXE Pipeline Registers
ALU MUL/DIV CRFU
Reg0 ………………………………....
Reg31
EXE/MEM Pipeline Registers
Counter
ConfigMemory
CounterFrom decode stage
Triggered bymtc1
Triggered bymtc1
CRFU Input RegsEn
Result bus
or sequencer
or sequencer
DEC/EXE Pipeline Registers
ALU MUL/DIV CRFU
Reg0 ………………………………....
Reg31
EXE/MEM Pipeline Registers
Counter
ConfigMemory
CounterFrom decode stage
Triggered bymtc1
Triggered bymtc1
CRFU Input RegsEn
Result bus
or sequencer
or sequencer
arch1: (4-read/2-write)•Clock freq: 135MHz•RF read/write access
Input: 5, 6, 7, or 8 +1 extra cycleOutput: 3 or 4 +1 extra cycleOutput: 5 or 6 +2 extra cycles
•CRFU executionarch-1-var: variable (1 or 2 cycles)arch-1-fix: 2 cycles
arch2: (8-read/4-write)•Clock freq: 130MHz•RF read/write access
Input: no extra cycleOutput: 5 or 6 +1 extra cycle
•CRFU executionarch-2-var: variable (1 or 2 cycles)arch-2-fix: 2 cycles
8
9
Performance Evaluation
1
1.5
2
2.5
3
3.5
4
4.5
5
Spee
dup
arch1-vararch2-fixarch2-var
Energy Consumption
Pros.• Low activity of hardware
components– I-Cache, Bpred– Decoder– Register File– Functional Unit
• Higher I-Cache hit rates– Reduce the energy for off-
chip accesses
Cons.• RFU configuration
– Accessing the config. Memory
– Setting control signals in the RFU
• Increased complexity– Communication between the
processor’s data-path and the RFU
10
11
Total Energy Reduction
0
10
20
30
40
50
60
70
80
Tota
l ene
rgy
redu
ctio
n (%
)
clk-gating-arch2-vararch2-vararch2-fixarch1-var
12
basic
...qso
rtsus
anc-jp
egd-j
peg
lame
dijkstr
a
patric
ia
string
se...
blowfish sha
adpcm fft crc gsm
45
45.5
46
46.5
47
47.5
48130MHz 260MHz 390MHz 520MHz 650MHz
Tem
pera
ture
(℃)
Temperature Analysis
FU
FU
FU
FU
FU
FU FUFU
FUFU
FU FU FU
FUFU
FU
CRFU Floor Plan(1.7x1.7 [mm2])
Conclusions
13
• ADEXOR: Adaptive Extensible Processor– Has a coarse-grain reconfigurable functional unit– Supports multi-exit custom instructions
• Performance / Energy Analysis– 5X speed up (best case)– 60% energy reduction (best case)
• Future Work– Extend for 3D-IC Implementation
Acknowledgement
• This research was supported in part by – New Energy and Industrial Technology Development
Organization– The chip fabrication program of VLSI Design and Education
Center(VDEC), the University of Tokyo in collaboration with Hitachi Ltd. and Dai Nippon Printing Corporation.
14
Backup Slides
15
Area overhead (1/2)• VHDL & Hitachi 0.18μm library
– Base processor: 4.5 mm2
– CRFU: 1.7 mm2
• CACTI 4.2 (0.18μm)– I-Cache & D-Cache (32KB 4-way ): 2.25mm2 – Configuration Memory (SRAM - for 32 MECIs): 0.56mm2 – Sequencer (CAM – 32 entries): 0.092mm2
• Base Processor (with caches)– Area: 9.0mm2
16
Area overhead (2/2)
Area Overheadarch1/mtc1 25.1%
arch1/sequencer
26.1%
arch2/mtc1 30%arch2/
sequencer31%
17
Access Reduction
-5
15
35
55
75
95
basicmath
bitcount
qsort
susa
ncjp
egdjpeg
dijkstr
a
patricia
blowfish
rijndae
lsh
a
adpcm crc fft gsm
stringse
arch
avg-se
q
avg-m
tc1
Acc
ess
redu
ctio
n (%
)
decoderbranch predreg fileicacheALUicache-miss
55
35
15
seq mtc1
18
Energy Consumption Breakdown for arch1/invoke-mtc1
0102030405060708090
100
basicm
ath
bitcoun
tqso
rtsu
sancjp
egdjpeg
dikjstr
a
patrici
a
blowfish
rijnda
el sha
adpcm crc fft
gsm
string
searc
h
Ener
gy C
onsu
mpt
ion
Bre
akdo
wn
Base Processor CRFU Config Mem
19
Energy Consumption Breakdown for arch2/invoke-seq
0
20
40
60
80
100
basicm
ath
bitcoun
tqso
rtsu
sancjp
egdjpeg
dijkstr
a
patrici
a
blowfish
rijnda
el sha
adpcm crc fft
gsm
string
searc
h
Ener
gy C
onsu
mpt
ion
Brea
kdow
n
Base Processor CRFU Config Mem Reg File Sequencer
20